is predicted."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-121",
"text": "To ensure causality in answer decoding, we mask the attention weights in the self-attention layers of the transformer architecture [48] such that question words, detected objects and OCR tokens cannot attend to any decoding steps, and all decoding steps can only attend to previous decoding steps in addition to question words, detected objects and OCR tokens."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-122",
"text": "This is similar to prefix LM in [40] ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-124",
"text": "**TRAINING**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-125",
"text": "During training, we supervise our multimodal transformer at each decoding step."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-126",
"text": "Similar to sequence prediction tasks such as machine translation, we use teacherforcing [28] (i.e. using ground-truth inputs to the decoder) to train our multi-step answer decoder, where each groundtruth answer is tokenized into a sequence of words."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-127",
"text": "Given that an answer word can appear in both fixed answer vocabulary and OCR tokens, we apply multi-label sigmoid loss (instead of softmax loss) over the concatenated scores y all t ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-129",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-130",
"text": "We evaluate our model on three challenging datasets for the TextVQA task, including the TextVQA dataset [44] , the ST-VQA dataset [8] , and the OCR-VQA dataset [37] ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-131",
"text": "Our model outperforms previous work by a significant margin on all the three datasets."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-133",
"text": "**EVALUATION ON THE TEXTVQA DATASET**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-134",
"text": "The TextVQA dataset [44] contains 28,408 images from the Open Images dataset [27] , with human-written questions asking to reason about text in the image."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-135",
"text": "Similar to VQAv2 [17] , each question in the TextVQA dataset has 10 human annotated answers, and the final accuracy is measured via soft voting of the 10 answers."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-136",
"text": "2 We use d = 768 as the dimensionality of the joint embedding space and extract question word features with BERT-BASE using the 768-dimensional outputs from its first three layers, which are fine-tuned during training."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-137",
"text": "For visual objects, following Pythia [43] and LoRRA [44] , we detect objects with a Faster R-CNN detector [41] pretrained on the Visual Genome dataset [26] , and keeps 100 top-scoring objects per image."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-138",
"text": "Then, the fc6 feature vector is extracted from each detected object."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-139",
"text": "We apply the Faster R-CNN fc7 weights on the extracted fc6 features to output 2048-dimensional fc7 appearance features and finetune fc7 weights during training."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-140",
"text": "However, we do not use the ResNet-152 convolutional features [19] as in LoRRA."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-141",
"text": "Finally, we extract text tokens on each image using the Rosetta OCR system [10] ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-142",
"text": "Unlike the prior work LoRRA [44] that uses a multilingual Rosetta version, in our model we use an English-only version of Rosetta that we find has higher recall."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-143",
"text": "We refer to these two versions as Rosettaml and Rosetta-en, respectively."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-144",
"text": "As mentioned in Sec. 3.1, from each OCR token we extract FastText [9] feature, appearance feature from Faster R-CNN (FRCN), PHOC [2] feature, and bounding box (bbox) feature."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-145",
"text": "In our multimodal transformer, we use L = 4 layers of multimodal transformer with 12 attention heads."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-146",
"text": "Other hyper-parameters (such as dropout ratio) follow BERT-BASE [13] ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-147",
"text": "However, we note that the multimodal trans-former parameters are initialized from scratch rather than from a pretrained BERT model."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-148",
"text": "We use T = 12 maximum decoding step in answer prediction unless otherwise specified, which is sufficient to cover almost all answers."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-149",
"text": "We collect the top 5000 frequent words from the answers in the training set as our answer vocabulary."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-150",
"text": "During training, we use a batch size of 128, and train for a maximum of 24,000 iterations."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-151",
"text": "Our model is trained using the Adam optimizer, with a learning rate of 1e-4 and a staircase learning rate schedule, where we multiply the learning rate by 0.1 at 14000 and at 19000 iterations."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-152",
"text": "The best snapshot is selected using the validation set accuracy."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-153",
"text": "The entire training takes approximately 10 hours on 4 Nvidia Tesla V100 GPUs."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-154",
"text": "As a notable prior work on this dataset, we show a stepby-step comparison with the LoRRA model [44] ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-155",
"text": "Ablations on pretrained question encoding and OCR systems."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-156",
"text": "We first experiment with a restricted version of our model using the multimodal transformer architecture but without iterative decoding in answer prediction, i.e. M4C (w/o dec.) in Table 1 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-157",
"text": "In this setting, we only decode for one step, and either select a frequent answer 3 from the training set or copy a single OCR token in the image as the answer."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-158",
"text": "As a step-by-step comparison with LoRRA, we start with extracting OCR tokens from Rosetta-ml, representing OCR tokens only with FastText vectors, and initializing question encoding parameters in Sec. 3.1 from scratch (rather than from a pretrained BERT-BASE model)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-159",
"text": "The result is shown in line 2 of Table 1 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-160",
"text": "Compared with LoRRA in line 1, this restricted version of our model already outperforms LoRRA by around 3% (absolute) on TextVQA validation set."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-161",
"text": "Given that LoRRA uses pretrained GloVe [39] for question encoding while we learn question encoding from scratch in line 2, this result shows that our multimodal transformer architecture is more efficient for jointly modeling the three input modalities."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-162",
"text": "We then switch to a pretrained BERT for question encoding in line 3, and Rosettaen for OCR extraction in line 4."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-163",
"text": "Comparing line 2 to 4, we see that a pretrained BERT leads to around 0.6% higher accuracy, and Rosetta-en gives another 1% improvement."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-164",
"text": "Ablations on OCR feature representation."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-165",
"text": "We analyze the impact of our rich OCR representation in Sec. 3.1 through ablations in Table 1 Figure 3 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-166",
"text": "Accuracy under different maximum decoding steps T on the validation set of TextVQA, ST-VQA, and OCR-VQA."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-167",
"text": "There is a major gap between single-step (T = 1) and multi-step (T > 1) answer prediction."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-168",
"text": "We use 12 steps by default in our experiments."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-169",
"text": "location (bbox) features and the RoI-pooled appearance features (FRCN) both improve the performance by a noticeable margin."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-170",
"text": "In addition, we find that PHOC is also helpful as a character-level representation of the OCR token."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-171",
"text": "Our rich OCR representation gives around 4% (absolute) accuracy improvement compare with using only FastText features as in LoRRA (line 7 vs 4)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-172",
"text": "We note that our extra OCR features do not require more pretrained models, as we apply exactly the same Faster R-CNN model use in object detection for OCR appearance features, and PHOC is a manuallydesigned feature that does not need pretraining."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-173",
"text": "Iterative answer decoding."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-174",
"text": "We then apply our full M4C model with iterative answer decoding to the TextVQA dataset."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-175",
"text": "The results are shown in Table 1 line 10, which is around 4% (absolute) higher than its counterpart in line 7 using a single-step classifier and 13% (absolute) higher than LoRRA in line 1."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-176",
"text": "In addition, we ablate our model using Rosetta-ml and randomly initialized question encoding parameters in line 8 and 9."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-177",
"text": "Here, we see that our model in line 8 still outperforms LoRRA (line 1) by as much as 9.5% (absolute) when using the same OCR system as LoRRA and even fewer pretrained components."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-178",
"text": "We also analyze the performance of our model with respect to the maximum decoding steps, shown in Figure 3 , where decoding for multiple steps greatly improves the performance compared with a single step."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-179",
"text": "Figure 4 shows qualitative examples (more examples in appendix) of our M4C model on the TextVQA dataset in comparison to LoRRA [44] , where our model is capable of selecting multiple OCR tokens and combining them with its fixed vocabulary in predicted answers."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-180",
"text": "Qualitative insights."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-181",
"text": "When inspecting the errors, we find that a major source of errors is OCR failure (e.g. in the last example in Figure 4 , we find that the digits on the watch are not detected)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-182",
"text": "This suggests that the accuracy of our model could be improved with better OCR systems, as supported by the comparison between line 9 and 10 in Table 1 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-183",
"text": "Another possible future direction is to dynamically recognize text in the image based on the question (e.g. if the question asks about the price of a product brand, one may want to directly localize the brand name in the image)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-184",
"text": "Some other errors of our model include resolving relations between objects and text or understanding large chunks of text in images (such as book pages)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-185",
"text": "However, our model is able to correct a large number of mistakes in previous work where copying multiple text tokens is required to form an answer."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-186",
"text": "TextVQA Challenge 2019."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-187",
"text": "We also compare to the winning entries in the TextVQA Challenge 2019."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-188",
"text": "4 We compare are from fixed answer vocabulary)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-189",
"text": "Compared to the previous work LoRRA [44] which selects one answer from training set or copies only a single OCR token, our model can copy multiple OCR tokens and combine them with its fixed vocabulary through iterative decoding."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-190",
"text": "our method to DCD [32] (the challenge winner, based on ensemble) and MSFT VTI [46] (the top entry after the challenge), both relying on one-step prediction."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-191",
"text": "We show that our single model (line 10) significantly outperforms these challenge winning entries on the TextVQA test set by a large margin."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-192",
"text": "We also experiment with using the ST-VQA dataset [8] as additional training data (a practice used by some of the previous challenge participants), which gives another 1% improvement and 40.46% final test accuracya new state-of-the-art on the TextVQA dataset."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-193",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-194",
"text": "**EVALUATION ON THE ST-VQA DATASET**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-195",
"text": "The ST-VQA dataset [8] contains natural images from multiple sources including ICDAR 2013 [24] , ICDAR 2015 [23] , ImageNet [12] , VizWiz [18] , IIIT STR [36] , Visual Genome [26] , and COCO-Text [49] ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-196",
"text": "5 The format of the ST-VQA dataset is similar to the TextVQA dataset in Sec. 4.1."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-197",
"text": "However, each question is accompanied by only one or two ground-truth answers provided by the question writer."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-198",
"text": "The dataset involves three tasks, and its Task 3 -Open Dictionary (containing 18,921 training-validation images and test 2,971 images) corresponds to our general TextVQA setting where no answer candidates are provided at test time."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-199",
"text": "The ST-VQA dataset adopts Average Normalized Levenshtein Similarity (ANLS) 6 as its official evaluation metric, defined as scores 1 \u2212 d L (a pred , a gt )/ max(|a pred |, |a gt |) (where a pred and a gt are prediction and ground-truth answers and d L is edit distance) averaged over all questions."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-200",
"text": "Also, all scores below the threshold 0.5 are truncated to 0 before averaging."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-201",
"text": "To facilitate comparison, we report both accuracy and ANLS in our experiments."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-202",
"text": "As the ST-VQA dataset does not have an official split for training and validation, we randomly select 17,028 im- 5 We notice that many images from COCO-Text [49] in the downloaded ST-VQA data (around 1/3 of all images) are resized to 256\u00d7256 for unknown reasons, which degrades the image quality and distorts their aspect ratios."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-203",
"text": "In our experiments, we replace these images with their original versions from COCO-Text as inputs to object detection and OCR systems."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-204",
"text": "6 Table 2 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-205",
"text": "Ablations of our model."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-206",
"text": "We train two versions of our model, one restricted version (M4C w/o dec."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-207",
"text": "in Table 2 ) with a fixed one-step classifier as output module (similar to line 7 in Table 1 ) and one full version (M4C) with iterative answer decoding."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-208",
"text": "Comparing the results of these two models, it can be seen that there is a large improvement from our iterative answer prediction mechanism."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-209",
"text": "Comparison to previous work."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-210",
"text": "We compare with two previous methods on this dataset: 1) SAN+STR [8] , which combines SAN for VQA [51] and Scene Text Retrieval [16] for answer vocabulary retrieval, and 2) VTA [7] , the ICDAR 2019 ST-VQA Challenge 6 winner, based on BERT [13] for question encoding and BUTD [3] for VQA."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-211",
"text": "From Table 2 , it can be seen that our restricted model (M4C w/o dec.) already achieves higher ANLS than these two models, and our full model achieves as much as +0.18 (absolute) ANLS boost over the best previous work."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-212",
"text": "We also ablate the maximum copying number in our"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-213",
"text": "What is the name of the street on which the Stop sign appears?"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-214",
"text": "What does the white sign say?"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-215",
"text": "How many cents per pound are the bananas?"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-216",
"text": "What kind of stop sign is in the image? prediction: 45th parallel dr prediction: tokyo station prediction: 99 prediction: stop all way GT: 45th parallel dr GT: tokyo station GT: 99 GT: all way model in Figure 3 , showing that it is beneficial to decode for multiple (as opposed to one) steps."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-217",
"text": "Figure 5 shows qualitative examples of our model on the ST-VQA dataset."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-218",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-219",
"text": "**EVALUATION ON THE OCR-VQA DATASET**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-220",
"text": "The OCR-VQA dataset [37] contains 207,572 images of book covers, with template-based questions asking about the title, author, edition, genre, year or other information about the book."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-221",
"text": "Each question is has a single ground-truth answer, and the dataset assumes that the answers to these questions can be inferred from the book cover images."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-222",
"text": "We train our model using the same hyper-parameters as in Sec. 4.1 and 4.2, but use 2\u00d7 the total iterations and adapted learning rate schedule since the OCR-VQA dataset contains more images."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-223",
"text": "The results are shown in Table 3 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-224",
"text": "Compared to using a one-step classifier (M4C w/o dec.), our full model with iterative decoding achieves significantly better accuracy, which coincides with Figure 3 that having multiple decoding steps is greatly beneficial on this dataset."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-225",
"text": "This is likely because the OCR-VQA dataset often contains multi-word answers such as book titles and author names."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-226",
"text": "We compare to four baseline approaches from [37] , which are VQA systems based on 1) visual features from a convolutional network (CNN), 2) grouping OCR tokens into text blocks (BLOCK) with manually defined rules, 3) an averaged word2vec (W2V) feature over all the OCR tokens in the image, and 4) their combinations."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-227",
"text": "Note that while the BLOCK baseline can also select multiple OCR tokens, it relies on manually defined rules to merge tokens into groups and can only select one group as answer, while our method learns from data how to copy OCR tokens to compose answers."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-228",
"text": "Compare to these baselines, our M4C has over 15% (absolute) higher test accuracy."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-229",
"text": "Figure 6 shows qualitative examples of our model on this dataset."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-230",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-231",
"text": "**CONCLUSION**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-232",
"text": "In this paper, we present Multimodal Multi-Copy Mesh (M4C) for visual question answering based on understanding and reasoning about text in images."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-233",
"text": "M4C adopts rich representations for text in the images, jointly models all modalities through a pointer-augmented multimodal transformer architecture over a joint embedding space, and predicts the answer through iterative decoding, outperforming previous work by a large margin on three challenging datasets for the TextVQA task."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-234",
"text": "Our results suggest that it is efficient to handle multiple modalities through domain-specific embedding followed by homogeneous selfattention and to generate complex answers as multi-step decoding instead of one-step classification."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-235",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-236",
"text": "**ITERATIVE ANSWER PREDICTION WITH POINTER-AUGMENTED**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-237",
"text": "Multimodal Transformers for TextVQA ( Supplementary Material) A. Hyper-parameters in M4C"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-238",
"text": "We summarize the hyper-parameters in our M4C model in Table 4 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-239",
"text": "Most hyper-parameters are the same across all the three datasets (TextVQA, ST-VQA, and OCR-VQA), except that we use 2\u00d7 the total iterations and adapted learning rate schedule on the OCR-VQA dataset since it contains more images."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-240",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-241",
"text": "**HYPER-PARAMETER**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-242",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-243",
"text": "**B. ADDITIONAL ABLATION ANALYSIS**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-244",
"text": "During the iterative answer decoding process, at each step our M4C model can decode an answer word either from the model's fixed vocabulary, or from the OCR tokens extracted from the image."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-245",
"text": "We find in our experiments that it is necessary to have both the fixed vocabulary space and the OCR tokens."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-246",
"text": "Table 5 shows our ablation study where we remove the fixed answer vocabulary or the dynamic pointer network for OCR copying from our M4C."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-247",
"text": "Both these two ablated versions have a large accuracy drop compared to our full model."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-248",
"text": "However, we note that even without fixed answer vocabulary, our restricted model (M4C w/o fixed vocabulary in Table 5 ) still outperforms the previous work LoRRA [44] , suggesting that it is particularly important to learn to copy multiple OCR tokens to form an answer (a key feature in our model but not in LoRRA)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-249",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-250",
"text": "**# METHOD**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-251",
"text": "TextVQA Val Accuracy Table 5 ."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-252",
"text": "We ablate our M4C model by removing its fixed answer vocabulary (M4C w/o fixed vocabulary) or its dynamic pointer network for OCR copying (M4C w/o OCR copying) on the TextVQA dataset."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-253",
"text": "We see that our full model has significantly higher accuracy than these ablations, showing that it is important to have both a fixed and a dynamic vocabulary (i.e. OCR tokens)."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-254",
"text": "----------------------------------"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-255",
"text": "**C. ADDITIONAL QUALITATIVE EXAMPLES**"
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-256",
"text": "As mentioned in Sec. 4.1 in the main paper, we find that OCR failure is a major source of error for our M4C model's predictions."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-257",
"text": "Figure 7 shows cases on the TextVQA dataset where the OCR system fails to precisely localize the corresponding text tokens in the image, suggesting that our model's accuracy can be improved with better OCR systems."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-258",
"text": "Figure 8 , 9, and 10 shows additional qualitative examples from our M4C model on the TextVQA dataset, ST-VQA, and OCR-VQA datasets, respectively."
},
{
"sent_id": "5e0b1b085a7a10b1e1c17286f7048e-C001-259",
"text": "While our model occasionally fails when reading a large piece of text or resolving the relation between text and objects as in Figure 8 (f) and (h), in most cases it learns to identify and copy text tokens from the image and combine them with its fixed vocabulary to predict an answer."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-16"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-21",
"5e0b1b085a7a10b1e1c17286f7048e-C001-22"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-40"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-41"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-58"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-134"
]
],
"cite_sentences": [
"5e0b1b085a7a10b1e1c17286f7048e-C001-16",
"5e0b1b085a7a10b1e1c17286f7048e-C001-21",
"5e0b1b085a7a10b1e1c17286f7048e-C001-22",
"5e0b1b085a7a10b1e1c17286f7048e-C001-40",
"5e0b1b085a7a10b1e1c17286f7048e-C001-41",
"5e0b1b085a7a10b1e1c17286f7048e-C001-58",
"5e0b1b085a7a10b1e1c17286f7048e-C001-134"
]
},
"@DIF@": {
"gold_contexts": [
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-18"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-36"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-90"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-142"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-189"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-248"
]
],
"cite_sentences": [
"5e0b1b085a7a10b1e1c17286f7048e-C001-18",
"5e0b1b085a7a10b1e1c17286f7048e-C001-36",
"5e0b1b085a7a10b1e1c17286f7048e-C001-90",
"5e0b1b085a7a10b1e1c17286f7048e-C001-142",
"5e0b1b085a7a10b1e1c17286f7048e-C001-189",
"5e0b1b085a7a10b1e1c17286f7048e-C001-248"
]
},
"@MOT@": {
"gold_contexts": [
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-22",
"5e0b1b085a7a10b1e1c17286f7048e-C001-23",
"5e0b1b085a7a10b1e1c17286f7048e-C001-24",
"5e0b1b085a7a10b1e1c17286f7048e-C001-25",
"5e0b1b085a7a10b1e1c17286f7048e-C001-26",
"5e0b1b085a7a10b1e1c17286f7048e-C001-27",
"5e0b1b085a7a10b1e1c17286f7048e-C001-28"
]
],
"cite_sentences": [
"5e0b1b085a7a10b1e1c17286f7048e-C001-22"
]
},
"@EXT@": {
"gold_contexts": [
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-58",
"5e0b1b085a7a10b1e1c17286f7048e-C001-59",
"5e0b1b085a7a10b1e1c17286f7048e-C001-60"
]
],
"cite_sentences": [
"5e0b1b085a7a10b1e1c17286f7048e-C001-58"
]
},
"@USE@": {
"gold_contexts": [
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-81"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-130"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-137"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-154"
],
[
"5e0b1b085a7a10b1e1c17286f7048e-C001-179"
]
],
"cite_sentences": [
"5e0b1b085a7a10b1e1c17286f7048e-C001-81",
"5e0b1b085a7a10b1e1c17286f7048e-C001-130",
"5e0b1b085a7a10b1e1c17286f7048e-C001-137",
"5e0b1b085a7a10b1e1c17286f7048e-C001-154",
"5e0b1b085a7a10b1e1c17286f7048e-C001-179"
]
}
}
},
"ABC_64b344bf8ec9b6a113bf6b3f638528_2": {
"x": [
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-94",
"text": "**IMPLEMENTATION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-95",
"text": "We implement the neural network using the torch7 library (Collobert et al., 2011a) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-2",
"text": "Named entity recognition is a challenging task that has traditionally required large amounts of knowledge in the form of feature engineering and lexicons to achieve high performance."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-3",
"text": "In this paper, we present a novel neural network architecture that automatically detects word-and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-4",
"text": "We also propose a novel method of encoding partial lexicon matches in neural networks and compare it to existing approaches."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-5",
"text": "Extensive evaluation shows that, given only tokenized text and publicly available word embeddings, our system is competitive on the CoNLL-2003 dataset and surpasses the previously reported state of the art performance on the OntoNotes 5.0 dataset by 2.13 F1 points."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-6",
"text": "By using two lexicons constructed from publicly-available sources, we establish new state of the art performance with an F1 score of 91.62 on CoNLL-2003 and 86.28 on OntoNotes, surpassing systems that employ heavy feature engineering, proprietary lexicons, and rich entity linking information."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-9",
"text": "Named entity recognition is an important task in NLP."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-10",
"text": "High performance approaches have been dominated by applying CRF, SVM, or perceptron models to hand-crafted features (Ratinov and Roth, 2009; Passos et al., 2014; Luo et al., 2015) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-11",
"text": "However, Collobert et al. (2011b) proposed an effective neural network model that requires little feature engineering and instead learns important features from word embeddings trained on large quantities of unlabelled text -an approach made possible by recent advancements in unsupervised learning of word embeddings on massive amounts of data (Collobert and Weston, 2008; Mikolov et al., 2013) and neural network training algorithms permitting deep architectures (Rumelhart et al., 1986) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-12",
"text": "Unfortunately there are many limitations to the model proposed by Collobert et al. (2011b) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-13",
"text": "First, it uses a simple feed-forward neural network, which restricts the use of context to a fixed sized window around each word -an approach that discards useful long-distance relations between words."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-14",
"text": "Second, by depending solely on word embeddings, it is unable to exploit explicit character level features such as prefix and suffix, which could be useful especially with rare words where word embeddings are poorly trained."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-15",
"text": "We seek to address these issues by proposing a more powerful neural network model."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-16",
"text": "A well-studied solution for a neural network to process variable length input and have long term memory is the recurrent neural network (RNN) (Goller and Kuchler, 1996) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-17",
"text": "Recently, RNNs have shown great success in diverse NLP tasks such as speech recognition (Graves et al., 2013) , machine translation (Cho et al., 2014) , and language modeling (Mikolov et al., 2011) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-18",
"text": "The long-short term memory (LSTM) unit with the forget gate allows highly non-trivial long-distance dependencies to be easily learned (Gers et al., 2000) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-19",
"text": "For sequential labelling tasks such as NER and speech recognition, a bi-directional LSTM model can take into account an effectively infinite amount of context on both sides of a word and eliminates the problem of limited context that applies to any feed-forward model (Graves et al., 2013) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-20",
"text": "While LSTMs have been studied in the past for the NER task by Hammerton (2003) , the lack of computational power (which led to the use of very small models) and quality word embeddings limited their effectiveness."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-21",
"text": "Convolutional neural networks (CNN) have also been investigated for modeling character-level information, among other NLP tasks."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-22",
"text": "Santos et al. (2015) and Labeau et al. (2015) successfully employed CNNs to extract character-level features for use in NER and POS-tagging respectively."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-23",
"text": "Collobert et al. (2011b) also applied CNNs to semantic role labeling, and variants of the architecture have been applied to parsing and other tasks requiring tree structures (Blunsom et al., 2014) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-24",
"text": "However, the effectiveness of character-level CNNs has not been evaluated for English NER."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-25",
"text": "While we considered using character-level bi-directional LSTMs, which was recently proposed by Ling et al. (2015) for POStagging, preliminary evaluation shows that it does not perform significantly better than CNNs while being more computationally expensive to train."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-26",
"text": "Our main contribution lies in combining these neural network models for the NER task."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-27",
"text": "We present a hybrid model of bi-directional LSTMs and CNNs that learns both character-and word-level features, presenting the first evaluation of such an architecture on well-established English language evaluation datasets."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-28",
"text": "Furthermore, as lexicons are crucial to NER performance, we propose a new lexicon encoding scheme and matching algorithm that can make use of partial matches, and we compare it to the simpler approach of Collobert et al. (2011b) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-29",
"text": "Extensive evaluation shows that our proposed method establishes a new state of the art on both the CoNLL-2003 NER shared task and the OntoNotes 5.0 datasets."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-31",
"text": "**MODEL**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-32",
"text": "Our neural network is inspired by the work of Collobert et al. (2011b) , where lookup tables transform discrete features such as words and characters into continuous vector representations, which are then concatenated and fed into a neural network."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-33",
"text": "Instead of a feed-forward network, we use the bi-directional long-short term memory (BLSTM) network."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-34",
"text": "To induce character-level features, we use a convolutional neural network, which has been successfully applied to Spanish and Portuguese NER (Santos et al., 2015) and German POS-tagging (Labeau et al., 2015) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-36",
"text": "**SEQUENCE-LABELLING WITH BLSTM**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-37",
"text": "Following the speech-recognition framework outlined by Graves et al. (2013) , we employed a stacked 1 bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-38",
"text": "Figures 1, 2, and 3 illustrate the network in detail."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-39",
"text": "The extracted features of each word are fed into a forward LSTM network and a backward LSTM network."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-40",
"text": "The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-41",
"text": "These two vectors are then simply added together to produce the final output."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-42",
"text": "We tried minor variants of output layer architecture and selected the one that performed the best in preliminary experiments."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-44",
"text": "**EXTRACTING CHARACTER FEATURES USING A CONVOLUTIONAL NEURAL NETWORK**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-70",
"text": "**LEXICONS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-45",
"text": "For each word we employ a convolution and a max layer to extract a new feature vector from the percharacter feature vectors such as character embeddings (Section 2.3.2) and (optionally) character type (Section 2.5)."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-46",
"text": "Words are padded with a number of special PADDING characters on both sides depending on the window size of the CNN."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-47",
"text": "The hyper-parameters of the CNN are the window size and the output vector size."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-48",
"text": "1 For each direction (forward and backward), the input is fed into multiple layers of LSTM units connected in sequence (i.e. LSTM units in the second layer take in the output of the first layer, and so on); the number of layers is a tuned hyperparameter."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-49",
"text": "Figure 1 shows only one unit for simplicity."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-51",
"text": "**WORD EMBEDDINGS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-52",
"text": "Our best model uses the publicly available 50dimensional word embeddings released by Collobert et al. (2011b) 2 , which were trained on Wikipedia and the Reuters RCV-1 corpus."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-53",
"text": "We also experimented with two other sets of published embeddings, namely Stanford's GloVe embeddings 3 trained on 6 billion words from Wikipedia and Web text (Pennington et al., 2014) and Google's word2vec embeddings 4 trained on 100 billion words from Google News (Mikolov et al., 2013) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-54",
"text": "In addition, as we hypothesized that word embeddings trained on in-domain text may perform better, we also used the publicly available GloVe (Pennington et al., 2014) program and an in-house re-implementation 5 of the word2vec (Mikolov et al., 2013) program to train word embeddings on Wikipedia and Reuters RCV1 datasets as well."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-55",
"text": "6 Following Collobert et al. (2011b) , all words are lower-cased before passing through the lookup table Text Hayao Tada , commander of the Japanese North China Area Army to convert to their corresponding embeddings."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-56",
"text": "The pre-trained embeddings are allowed to be modified during training."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-57",
"text": "7"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-59",
"text": "**CHARACTER EMBEDDINGS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-60",
"text": "We randomly initialized a lookup table with values drawn from a uniform distribution with range [\u22120.5, 0.5] to output a character embedding of 25 dimensions."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-61",
"text": "The character set includes all unique characters in the CoNLL-2003 dataset 8 plus the special tokens PADDING and UNKNOWN."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-62",
"text": "The PADDING token is used for the CNN, and the UNKNOWN token is used for all other characters (which appear in OntoNotes)."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-63",
"text": "The same set of random embeddings was used for all experiments."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-64",
"text": "9 2.4 Additional Word-level Features"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-66",
"text": "**CAPITALIZATION FEATURE**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-67",
"text": "As capitalization information is erased during lookup of the word embedding, we evaluate Collobert's method of using a separate lookup table to add a capitalization feature with the following options: allCaps, upperInitial, lowercase, mixedCaps, noinfo (Collobert et al., 2011b) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-68",
"text": "This method is compared with the character type feature (Section 2.5) and character-level CNNs."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-71",
"text": "Most state of the art NER systems make use of lexicons as a form of external knowledge (Ratinov and Roth, 2009; Passos et al., 2014) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-72",
"text": "For each of the four categories (Person, Organization, Location, Misc) defined by the CoNLL 2003 NER shared task, we compiled a list of known named entities from DBpedia (Auer et al., 2007) , by extracting all descendants of DBpedia types corresponding to the CoNLL categories."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-73",
"text": "14 We did not construct separate lexicons for the OntoNotes tagset because correspondences between DBpedia categories and its tags could not be found in many instances."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-74",
"text": "In addition, for each entry we first removed parentheses and all text contained within, then stripped trailing punctuation, 15 and finally tokenized it with the Penn Treebank tokenization script for the purpose of partial matching."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-75",
"text": "Table 1 shows the size of each category in our lexicon compared to Collobert's lexicon, which we extracted from their SENNA system."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-76",
"text": "Figure 4 shows an example of how the lexicon features are applied."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-77",
"text": "16 For each lexicon category, we match every n-gram (up to the length of the longest lexicon entry) against entries in the lexicon."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-78",
"text": "A match is successful when the n-gram matches the prefix or suffix of an entry and is at least half the length of the entry."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-79",
"text": "Because of the high potential for spurious matches, for all categories except Person, we discard partial matches less than 2 tokens in length."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-80",
"text": "When there are multiple overlapping matches within the same category, we prefer exact matches over partial matches, and then longer matches over shorter matches, and finally earlier matches in the sentence over later matches."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-81",
"text": "All matches are case insensitive."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-82",
"text": "For each token in the match, the feature is en- coded in BIOES annotation (Begin, Inside, Outside, End, Single), indicating the position of the token in the matched entry."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-83",
"text": "In other words, B will not appear in a suffix-only partial match, and E will not appear in a prefix-only partial match."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-84",
"text": "As we will see in Section 4.5, we found that this more sophisticated method outperforms the method presented by Collobert et al. (2011b) , which treats partial and exact matches equally, allows prefix but not suffix matches, allows very short partial matches, and marks tokens with YES/ NO."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-85",
"text": "In addition, since Collobert et al. (2011b) released their lexicon with their SENNA system, we also applied their lexicon to our model for comparison and investigated using both lexicons simultaneously as distinct features."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-86",
"text": "We found that the two lexicons complement each other and improve performance on the CoNLL-2003 dataset."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-87",
"text": "Our best model uses the SENNA lexicon with exact matching and our DBpedia lexicon with partial matching, with BIOES annotation in both cases."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-89",
"text": "**ADDITIONAL CHARACTER-LEVEL FEATURES**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-90",
"text": "A lookup table was used to output a 4-dimensional vector representing the type of the character (upper case, lower case, punctuation, other)."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-92",
"text": "**TRAINING AND INFERENCE**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-96",
"text": "Training and inference are done on a per-sentence level."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-97",
"text": "The initial states of the LSTM are zero vectors."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-98",
"text": "Except for the character and word embeddings whose initialization has been described previously, all lookup tables are randomly initialized with values drawn from the standard normal distribution."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-100",
"text": "**OBJECTIVE FUNCTION AND INFERENCE**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-101",
"text": "We train our network to maximize the sentencelevel log-likelihood from Collobert et al. (2011b) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-102",
"text": "17 First, we define a tag-transition matrix A where A i,j represents the score of jumping from tag i to tag j in successive tokens, and A 0,i as the score for starting with tag i. This matrix of parameters are also learned."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-103",
"text": "Define \u03b8 as the set of parameters for the neural network, and \u03b8 = \u03b8 \u222a {A i,j \u2200i, j} as the set of all parameters to be trained."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-104",
"text": "Given an example sentence, [x] T 1 , of length T , and define [f \u03b8 ] i,t as the score outputted by the neural network for the t th word and i th tag given parameters \u03b8, then the score of a sequence of tags [i] T 1 is given as the sum of network and transition scores: Then, letting [y] T 1 be the true tag sequence, the sentence-level log-likelihood is obtained by normalizing the above score over all possible tag-sequences [j] T 1 using a softmax:"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-105",
"text": "This objective function and its gradients can be efficiently computed by dynamic programming (Collobert et al., 2011b) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-106",
"text": "At inference time, given neural network outputs [f \u03b8 ] i,t we use the Viterbi algorithm to find the tag sequence [i] T 1 that maximizes the score"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-107",
"text": "The output tags are annotated with BIOES (which stand for Begin, Inside, Outside, End, Single, indicating the position of the token in the 18 OntoNotes results taken from (Durrett and Klein, 2014) 19 Evaluation on OntoNotes 5.0 done by Pradhan et al. (2013) 20 Not directly comparable as they evaluated on an earlier version of the corpus with a different data split."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-108",
"text": "21 Numbers taken from the original paper (Luo et al., 2015) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-109",
"text": "While the precision, recall, and F1 scores are clearly inconsistent, it is unclear in which way they are incorrect."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-110",
"text": "entity) as this scheme has been reported to outperform others such as BIO (Ratinov and Roth, 2009 )."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-112",
"text": "**LEARNING ALGORITHM**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-113",
"text": "Training is done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-114",
"text": "Each mini-batch consists of multiple sentences with the same number of tokens."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-115",
"text": "We found applying dropout to the output nodes 22 of each LSTM layer (Pham et al., 2014) was quite effective in reducing overfitting (Section 4.4)."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-116",
"text": "We explored other more sophisticated optimization algorithms such as momentum (Nesterov, 1983) , AdaDelta (Zeiler, 2012), and RM-SProp (Hinton et al., 2012) , and in preliminary experiments they did not improve upon plain SGD."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-118",
"text": "**EVALUATION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-119",
"text": "Evaluation was performed on the well-established CoNLL-2003 NER shared task dataset (Tjong Kim Sang and De Meulder, 2003) and the much larger but less-studied OntoNotes 5.0 dataset (Hovy et al., 2006; Pradhan et al., 2013) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-120",
"text": "Table 2 gives an overview of these two different datasets."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-121",
"text": "For each experiment, we report the average and standard deviation of 10 successful trials."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-123",
"text": "**DATASET PREPROCESSING**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-124",
"text": "For all datasets, we performed the following preprocessing:"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-125",
"text": "\u2022 All digit sequences are replaced by a single \"0\"."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-126",
"text": "\u2022 Before training, we group sentences by word length into mini-batches and shuffle them."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-127",
"text": "In addition, for the OntoNotes dataset, in order to handle the Date, Time, Money, Percent, Quantity, Ordinal, and Cardinal named entity tags, we split tokens before and after every digit."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-129",
"text": "**CONLL 2003 DATASET**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-130",
"text": "The CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003) consists of newswire from the Reuters RCV1 corpus tagged with four types of named entities: location, organization, person, and miscellaneous."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-131",
"text": "As the dataset is small compared to OntoNotes, we trained the model on both the training and development sets after performing hyperparameter optimization on the development set."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-133",
"text": "**ONTONOTES 5.0 DATASET**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-134",
"text": "Pradhan et al. (2013) compiled a core portion of the OntoNotes 5.0 dataset for the CoNLL-2012 shared task and described a standard train/dev/test split, which we use for our evaluation."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-135",
"text": "Following Durrett and Klein (2014) , we applied our model to the portion of the dataset with gold-standard named entity annotations; the New Testaments portion was excluded for lacking gold-standard annotations."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-136",
"text": "This dataset is much larger than CoNLL-2003 and consists of text from a wide variety of sources, such as broadcast conversation, broadcast news, newswire, magazine, telephone conversation, and Web text."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-138",
"text": "**HYPER-PARAMETER OPTIMIZATION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-139",
"text": "We performed two rounds of hyper-parameter optimization and selected the best settings based on development set performance 23 ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-140",
"text": "Table 3 shows the final hyper-parameters, and Table 4 shows the dev set performance of the best models in each round."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-141",
"text": "In the first round, we performed random search and selected the best hyper-parameters over the development set of the CoNLL-2003 data."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-142",
"text": "We evaluated around 500 hyper-parameter settings."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-143",
"text": "Then, we took the same settings and tuned the learning rate and epochs on the OntoNotes development set."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-144",
"text": "24 For the second round, we performed independent hyper-parameter searches on each dataset using Optunity's implementation of particle swarm (Claesen et al., ) , as there is some evidence that it is more efficient than random search (Clerc and Kennedy, 2002) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-145",
"text": "We evaluated 500 hyper-parameter settings this round as well."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-146",
"text": "As we later found out that training fails occasionally (Section 3.5) as well as large variation from run to run, we ran the top 5 settings from each dataset for 10 trials each and selected the best one based on averaged dev set performance."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-147",
"text": "For CoNLL-2003, we found that particle swarm produced better hyper-parameters than random search."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-148",
"text": "However, surprisingly for OntoNotes particle swarm was unable to produce better hyperparameters than those from the ad-hoc approach in round 1."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-149",
"text": "We also tried tuning the CoNLL-2003 hyper-parameters from round 2 for OntoNotes and that was not any better 25 either."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-150",
"text": "We trained CoNLL-2003 models for a large num- 23 Hyper-parameter optimization was done with the BLSTM-CNN + emb + lex feature set, as it had the best performance."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-151",
"text": "24 Selected based on dev set performance of a few runs."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-152",
"text": "25 The result is 84.41 (\u00b1 0.33) on the OntoNotes dev set."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-153",
"text": "Table 7 : F1 scores when the Collobert word vectors are replaced."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-154",
"text": "We tried 50-and 300-dimensional random vectors (Random 50d, Random 300d); GloVe's released vectors trained on 6 billion words (GloVe 6B 50d, GloVe 6B 300d); Google's released 300-dimensional vectors trained on 100 billion words from Google News (Google 100B 300d); and 50-dimensional GloVe and word2vec skip-gram vectors that we trained on Wikipedia and Reuters RCV-1 (Our GloVe 50d, Our Skip-gram 50d)."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-155",
"text": "ber of epochs because we observed that the models did not exhibit overtraining and instead continued to slowly improve on the development set long after reaching near 100% accuracy on the training set."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-156",
"text": "In contrast, despite OntoNotes being much larger than CoNLL-2003, training for more than about 18 epochs causes performance on the development set to decline steadily due to overfitting."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-157",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-158",
"text": "**EXCLUDING FAILED TRIALS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-159",
"text": "On the CoNLL-2003 dataset, while BLSTM models completed training without difficulty, the BLSTM-CNN models fail to converge around 5\u223c10% of the time depending on feature set."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-160",
"text": "Similarly, on OntoNotes, 1.5% of trials fail."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-161",
"text": "We found that using a lower learning rate reduces failure rate."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-162",
"text": "We also tried clipping gradients and using AdaDelta and both of them were effective at eliminating such failures by themselves."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-163",
"text": "AdaDelta, however, made training more expensive with no gain in model performance."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-164",
"text": "In any case, for all experiments we excluded trials where the final F1 score on a subset of training data falls below a certain threshold, and continued to run trials until we obtained 10 successful ones."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-165",
"text": "For CoNLL-2003, we excluded trials where the final F1 score on the development set was less than 95; there was no ambiguity in selecting the threshold as every trial scored either above 98 or below 90."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-166",
"text": "For OntoNotes, the threshold was a F1 score of 80 on the last 5,000 sentences of the training set; every trial scored either above 80 or below 75."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-168",
"text": "**TRAINING AND TAGGING SPEED**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-169",
"text": "On an Intel Xeon E5-2697 processor, training takes about 6 hours while tagging the test set takes about 12 seconds for CoNLL-2003."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-170",
"text": "The times are 10 hours and 60 seconds respectively for OntoNotes."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-171",
"text": "Table 5 shows the results for all datasets."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-172",
"text": "To the best of our knowledge, our best models have surpassed the previous highest reported F1 scores for both CoNLL-2003 and OntoNotes."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-173",
"text": "In particular, with no external knowledge other than word embeddings, our model is competitive on the CoNLL-2003 dataset and establishes a new state of the art for OntoNotes, suggesting that given enough data, the neural network automatically learns the relevant features for NER without feature engineering."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-175",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-177",
"text": "**COMPARISON WITH FFNNS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-178",
"text": "We re-implemented the FFNN model of Collobert et al. (2011b) as a baseline for comparison."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-179",
"text": "Table 5 shows that while performing reasonably well on CoNLL-2003, FFNNs are clearly inadequate for OntoNotes, which has a larger domain, showing that LSTM models are essential for NER."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-181",
"text": "**CHARACTER-LEVEL CNNS VS. CHARACTER TYPE AND CAPITALIZATION FEATURES**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-182",
"text": "The comparison of models in Table 6 Turian et al. (2010) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-184",
"text": "**WORD EMBEDDINGS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-185",
"text": "One possible reason that Collobert embeddings perform better than other publicly available embeddings on CoNLL-2003 is that they are trained on the Reuters RCV-1 corpus, the source of the CoNLL-2003 dataset, whereas the other embeddings are not 28 ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-186",
"text": "On the other hand, we suspect that Google's embeddings perform poorly because of vocabulary mismatch -in particular, Google's embeddings were trained in a case-sensitive manner, and embeddings for many common punctuations and 27 Wilcoxon rank sum test, p < 0.001 28 To make a direct comparison to Collobert et al. (2011b) , we do not exclude the CoNLL-2003 NER task test data from the word vector training data."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-187",
"text": "While it is possible that this difference could be responsible for the disparate performance of word vectors, the CoNLL-2003 training data comprises only 20k out of 800 million words, or 0.00002% of the total data; in an unsupervised training scheme, the effects are likely negligible."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-188",
"text": "symbols were not provided."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-189",
"text": "To test these hypotheses, we performed experiments with new word embeddings trained using GloVe and word2vec, with vocabulary list and corpus similar to Collobert et."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-190",
"text": "al. (2011b) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-191",
"text": "As shown in Table 7 , our GloVe embeddings improved significantly 29 over publicly available embeddings on CoNLL-2003, and our word2vec skip-gram embeddings improved significantly 30 over Google's embeddings on OntoNotes."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-192",
"text": "Due to time constraints we did not perform new hyper-parameter searches with any of the word embeddings."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-193",
"text": "As word embedding quality depends on hyper-parameter choice during their training (Pennington et al., 2014) , and also, in our NER neural network, hyper-parameter choice is likely sensitive to the type of word embeddings used, optimizing them all will likely produce better results and provide a fairer comparison of word embedding quality."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-194",
"text": "value may not be the best-performing in Table 8 ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-195",
"text": "Table 6 shows that on the CoNLL-2003 dataset, using features from both the SENNA lexicon and our proposed DBpedia lexicon provides a significant 32 improvement and allows our model to clearly surpass the previous state of the art."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-196",
"text": "Unfortunately the difference is minuscule for OntoNotes, most likely because our lexicon does not match DBpedia categories well."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-197",
"text": "Figure 5 shows that on CoNLL-2003, lexicon coverage is reasonable and matches the tags set for everything except the catchall MISC category."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-198",
"text": "For example, LOC entries in lexicon match mostly LOC named entities and vice versa."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-199",
"text": "However, on OntoNotes, the matches are noisy and correspondence between lexicon match and tag category is quite ambiguous."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-200",
"text": "For example, all lexicon categories have spurious matches in unrelated named entities like CARDINAL, and LOC, GPE, and LANGUAGE entities all get a lot of matches from the LOC category in the lexicon."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-201",
"text": "In addition, named entities in categories like NORP, ORG, LAW, PRODUCT receive little coverage."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-202",
"text": "The lower coverage, noise, and ambiguity all contribute to the disappointing performance."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-203",
"text": "This suggests that the DBpedia lexicon construction method needs to be improved."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-204",
"text": "A reasonable place to start would be the DBpedia category to OntoNotes NE tag mappings."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-206",
"text": "**EFFECT OF DROPOUT**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-207",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-208",
"text": "**LEXICON FEATURES**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-209",
"text": "In order to isolate the contribution of each lexicon and matching method, we compare different sources and matching methods on a BLSTM-CNN model with randomly initialized word embeddings and no 32 Wilcoxon rank sum test, p < 0.001."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-210",
"text": "other features or sources of external knowledge."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-211",
"text": "Table 9 shows the results."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-212",
"text": "In this weakened model, both lexicons contribute significant 33 improvements over the baseline."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-213",
"text": "Compared to the SENNA lexicon, our DBpedia lexicon is noisier but has broader coverage, which explains why when applying it using the same method as Collobert et al. (2011b) , it performs worse on CoNLL-2003 but better on OntoNotesa dataset containing many more obscure named entities."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-214",
"text": "However, we suspect that the method of Collobert et al. (2011b) is not noise resistant and therefore unsuitable for our lexicon because it fails to distinguish exact and partial matches 34 and does not set a minimum length for partial matching."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-215",
"text": "35 Instead, when we apply our superior partial matching algorithm and BIOES encoding with our DBpedia lexicon, we gain a significant 36 improvement, allowing our lexicon to perform similarly to the SENNA lexicon."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-216",
"text": "Unfortunately, as we could not reliably remove partial entries from the SENNA lexicon, we were unable to investigate whether or not our lexicon matching method would help in that lexicon."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-217",
"text": "In addition, using both lexicons together as distinct features provides a further improvement 37 on CoNLL-2003, which we suspect is because the lexi- 33 Wilcoxon rank sum test, p < 0.05 for SENNA-Exact-BIOES, p < 0.005 for all others."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-218",
"text": "34 We achieve this by using BIOES encoding and prioritizing exact matches over partial matches."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-219",
"text": "35 Matching only the first word of a long entry is not very useful; this is not a problem in the SENNA lexicon because 99% of its entries contain only 3 tokens or less."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-220",
"text": "36 Wilcoxon rank sum test, p < 0.001."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-221",
"text": "37 Wilcoxon rank sum test, p < 0.001."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-222",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-223",
"text": "**LEXICON**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-224",
"text": "Matching Encoding CoNLL- Table 9 : Comparison of lexicon and matching/encoding methods over the BLSTM-CNN model employing random embeddings and no other features."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-225",
"text": "When using both lexicons, the best combination of matching and encoding is Exact-BIOES for SENNA and Partial-BIOES for DBpedia."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-226",
"text": "Note that the SENNA lexicon already contains \"partial entries\" so exact matching in that case is really just a more primitive form of partial matching."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-227",
"text": "cons are complementary; the SENNA lexicon is relatively clean and tailored to newswire, whereas the DBpedia lexicon is noisier but has high coverage."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-228",
"text": "(Finkel and Manning, 2009; Durrett and Klein, 2014) , likely because we apply a completely different machine learning method."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-229",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-230",
"text": "**ANALYSIS OF ONTONOTES PERFORMANCE**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-231",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-232",
"text": "**RELATED RESEARCH**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-233",
"text": "Named entity recognition is a task with a long history."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-234",
"text": "In this section, we summarize the works we compare with and that influenced our approach."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-235",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-236",
"text": "**NAMED ENTITY RECOGNITION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-237",
"text": "Most recent approaches to NER have been characterized by the use of CRF, SVM, and perceptron models, where performance is heavily dependent on feature engineering."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-238",
"text": "Ratinov and Roth (2009) used non-local features, a gazetteer extracted from 38 We downloaded their publicly released software and model to perform the per-genre evaluation."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-239",
"text": "Wikipedia, and Brown-cluster-like word representations, and achieved an F1 score of 90.80 on CoNLL-2003."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-240",
"text": "Lin and Wu (2009) Training an NER system together with related tasks such as entity linking has recently been shown to improve the state of the art."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-241",
"text": "Durrett and Klein (2014) combined coreference resolution, entity linking, and NER into a single CRF model and added cross-task interaction factors."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-242",
"text": "Their system achieved state of the art results on the OntoNotes dataset, but they did not evaluate on the CoNLL-2003 dataset due to lack of coreference annotations."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-243",
"text": "Luo et al. (2015) achieved state of the art results on CoNLL-2003 by training a joint model over the NER and entity linking tasks, the pair of tasks whose interdependencies contributed the most to the work of Durrett and Klein (2014) ."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-244",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-245",
"text": "**NER WITH NEURAL NETWORKS**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-246",
"text": "While many approaches involve CRF models, there has also been a long history of research involving neural networks."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-247",
"text": "Early attempts were hindered by Table 10 : Per genre F1 scores on OntoNotes."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-248",
"text": "BC = broadcast conversation, BN = broadcast news, MZ = magazine, NW = newswire, TC = telephone conversation, WB = blogs and newsgroups lack of computational power, scalable learning algorithms, and high quality word embeddings."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-249",
"text": "Petasis et al. (2000) used a feed-forward neural network with one hidden layer on NER and achieved state-of-the-art results on the MUC6 dataset."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-250",
"text": "Their approach used only POS tag and gazetteer tags for each word, with no word embeddings."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-251",
"text": "Hammerton (2003) attempted NER with a singledirection LSTM network and a combination of word vectors trained using self-organizing maps and context vectors obtained using principle component analysis."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-252",
"text": "However, while our method optimizes loglikelihood and uses softmax, they used a different output encoding and optimized an unspecified objective function."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-253",
"text": "Hammerton's (2003) reported results were only slightly above baseline models."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-254",
"text": "Much later, with the advent of neural word embeddings, Collobert et al. (2011b) presented SENNA, which employs a deep FFNN and word embeddings to achieve near state of the art results on POS tagging, chunking, NER, and SRL."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-255",
"text": "We build on their approach, sharing the word embeddings, feature encoding method, and objective functions."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-256",
"text": "Recently, Santos et al. (2015) presented their CharWNN network, which augments the neural network of Collobert et al. (2011b) with character level CNNs, and they reported improved performance on Spanish and Portuguese NER."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-257",
"text": "We have successfully incorporated character-level CNNs into our model."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-258",
"text": "There have been various other similar architecture proposed for various sequential labeling NLP tasks."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-259",
"text": "used a BLSTM for the POS-tagging, chunking, and NER tasks, but they employed heavy feature engineering instead of using a CNN to automatically extract characterlevel features."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-260",
"text": "Labeau et al. (2015) used a BRNN with character-level CNNs to perform German POStagging; our model differs in that we use the more powerful LSTM unit, which we found to perform better than RNNs in preliminary experiments, and that we employ word embeddings, which is much more important in NER than in POS tagging."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-261",
"text": "Ling et al. (2015) used both word-and character-level BLSTMs to establish the current state of the art for English POS tagging."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-262",
"text": "While using BLSTMs instead of CNNs allows extraction of more sophisticated character-level features, we found in preliminary experiments that for NER it did not perform significantly better than CNNs and was substantially more computationally expensive to train."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-263",
"text": "----------------------------------"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-264",
"text": "**CONCLUSION**"
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-265",
"text": "We have shown that our neural network model, which incorporates a bidirectional LSTM and a character-level CNN and which benefits from robust training through dropout, achieves state-of-the-art results in named entity recognition with little feature engineering."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-266",
"text": "Our model improves over previous best reported results on two major datasets for NER, suggesting that the model is capable of learning complex relationships from large amounts of data."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-267",
"text": "Preliminary evaluation of our partial matching lexicon algorithm suggests that performance could be further improved through more flexible application of existing lexicons."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-268",
"text": "Evaluation of existing word embeddings suggests that the domain of training data is as important as the training algorithm."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-269",
"text": "More effective construction and application of lexicons and word embeddings are areas that require more research."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-270",
"text": "In the future, we would also like to extend our model to perform similar tasks such as extended tagset NER and entity linking."
},
{
"sent_id": "64b344bf8ec9b6a113bf6b3f638528-C001-271",
"text": "368"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-11"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-12"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-105"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-254"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-256"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-11",
"64b344bf8ec9b6a113bf6b3f638528-C001-12",
"64b344bf8ec9b6a113bf6b3f638528-C001-105",
"64b344bf8ec9b6a113bf6b3f638528-C001-254",
"64b344bf8ec9b6a113bf6b3f638528-C001-256"
]
},
"@MOT@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-12",
"64b344bf8ec9b6a113bf6b3f638528-C001-13",
"64b344bf8ec9b6a113bf6b3f638528-C001-14",
"64b344bf8ec9b6a113bf6b3f638528-C001-15"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-32"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-12",
"64b344bf8ec9b6a113bf6b3f638528-C001-32"
]
},
"@USE@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-28"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-52"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-55"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-67"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-85"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-101"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-178"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-186"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-213"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-28",
"64b344bf8ec9b6a113bf6b3f638528-C001-52",
"64b344bf8ec9b6a113bf6b3f638528-C001-55",
"64b344bf8ec9b6a113bf6b3f638528-C001-67",
"64b344bf8ec9b6a113bf6b3f638528-C001-85",
"64b344bf8ec9b6a113bf6b3f638528-C001-101",
"64b344bf8ec9b6a113bf6b3f638528-C001-178",
"64b344bf8ec9b6a113bf6b3f638528-C001-186",
"64b344bf8ec9b6a113bf6b3f638528-C001-213"
]
},
"@DIF@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-28",
"64b344bf8ec9b6a113bf6b3f638528-C001-29"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-84"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-28",
"64b344bf8ec9b6a113bf6b3f638528-C001-84"
]
},
"@SIM@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-85",
"64b344bf8ec9b6a113bf6b3f638528-C001-86"
],
[
"64b344bf8ec9b6a113bf6b3f638528-C001-256",
"64b344bf8ec9b6a113bf6b3f638528-C001-257"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-85",
"64b344bf8ec9b6a113bf6b3f638528-C001-256"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-214"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-214"
]
},
"@EXT@": {
"gold_contexts": [
[
"64b344bf8ec9b6a113bf6b3f638528-C001-254",
"64b344bf8ec9b6a113bf6b3f638528-C001-255"
]
],
"cite_sentences": [
"64b344bf8ec9b6a113bf6b3f638528-C001-254"
]
}
}
},
"ABC_7bdb51a3ca6c322ef6e04d18ba8483_2": {
"x": [
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-39",
"text": "**RELATED WORK**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-151",
"text": "VQA-CP v2."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-2",
"text": "Abstract."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-3",
"text": "Attention mechanisms in biological perception are thought to select subsets of perceptual information for more sophisticated processing which would be prohibitive to perform on all sensory inputs."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-4",
"text": "In computer vision, however, there has been relatively little exploration of hard attention, where some information is selectively ignored, in spite of the success of soft attention, where information is re-weighted and aggregated, but never filtered out."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-5",
"text": "Here, we introduce a new approach for hard attention and find it achieves very competitive performance on a recently-released visual question answering datasets, equalling and in some cases surpassing similar soft attention architectures while entirely ignoring some features."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-6",
"text": "Even though the hard attention mechanism is thought to be non-differentiable, we found that the feature magnitudes correlate with semantic relevance, and provide a useful signal for our mechanism's attentional selection criterion."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-7",
"text": "Because hard attention selects important features of the input information, it can also be more efficient than analogous soft attention mechanisms."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-8",
"text": "This is especially important for recent approaches that use non-local pairwise operations, whereby computational and memory costs are quadratic in the size of the set of features."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-11",
"text": "Visual attention is instrumental to many aspects of complex visual reasoning in humans [1, 2] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-12",
"text": "For example, when asked to identify a dog's owner among a group of people, the human visual system adaptively allocates greater computational resources to processing visual information associated with the dog and potential owners, versus other aspects of the scene."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-13",
"text": "The perceptual effects can be so dramatic that prominent entities may not even rise to the level of awareness when the viewer is attending to other things in the scene [3, 4, 5] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-14",
"text": "Yet attention has not been a transformative force in computer vision, possibly because many standard computer vision tasks like detection, segmentation, and classification do not involve the sort of complex reasoning which attention is thought to facilitate."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-15",
"text": "Answering detailed questions about an image is a type of task which requires more sophisticated patterns of reasoning, and there has been a rapid recent proliferation of computer vision approaches for tackling the visual question answering (Visual QA) task [6, 7] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-16",
"text": "Successful Visual QA architectures must be able Given a natural image and a textual question as input, our Visual QA architecture outputs an answer."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-17",
"text": "It uses a hard attention mechanism that selects only the important visual features for the task for further processing."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-18",
"text": "We base our architecture on the premise that the norm of the visual features correlates with their relevance, and that those feature vectors with high magnitudes correspond to image regions which contain important semantic content."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-19",
"text": "to handle many objects and their complex relations while also integrating rich background knowledge, and attention has emerged as a promising strategy for achieving good performance [7, 8, 9, 10, 11, 12, 13, 14] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-20",
"text": "We recognize a broad distinction between types of attention in computer vision and machine learning -soft versus hard attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-21",
"text": "Existing attention models [7, 8, 9, 10] are predominantly based on soft attention, in which all information is adaptively re-weighted before being aggregated."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-22",
"text": "This can improve accuracy by isolating important information and avoiding interference from unimportant information."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-23",
"text": "Learning becomes more data efficient as the complexity of the interactions among different pieces of information reduces; this, loosely speaking, allows for more unambiguous credit assignment."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-24",
"text": "By contrast, hard attention, in which only a subset of information is selected for further processing, is much less widely used."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-25",
"text": "Like soft attention, it has the potential to improve accuracy and learning efficiency by focusing computation on the important parts of an image. But beyond this, it offers better computational efficiency because it only fully processes the information deemed most relevant."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-26",
"text": "However, there is a key downside of hard attention within a gradientbased learning framework, such as deep learning: because the choice of which information to process is discrete and thus non-differentiable, gradients cannot be backpropagated into the selection mechanism to support gradient-based optimization."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-27",
"text": "There have been various efforts to address this shortcoming in visual attention [15] , attention to text [16] , and more general machine learning domains [17, 18, 19] , but this is still a very active area of research."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-28",
"text": "Here we explore a simple approach to hard attention that bootstraps on an interesting phenomenon [20] in the feature representations of convolutional neural networks (CNNs): learned features often carry an easily accessible signal for hard attentional selection."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-29",
"text": "In particular, selecting those feature vectors with the greatest L2-norm values proves to be a heuristic that can facilitate hard attention -and provide the performance and efficiency benefits associated with -without requiring specialized learning procedures (see Figure 1 )."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-30",
"text": "This attentional signal results indirectly from a standard supervised task loss, and does not require explicit supervision to incentivize norms to be proportional to object presence, salience, or other potentially meaningful measures [20, 21] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-31",
"text": "We rely on a canonical Visual QA pipeline [7, 9, 22, 23, 24, 25] augmented with a hard attention mechanism that uses the L2-norms of the feature vectors to select subsets of the information for further processing."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-32",
"text": "The first version, called the Hard Attention Network (HAN), selects a fixed number of feature vectors by choosing those with the top norms."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-33",
"text": "The second version, called the Adaptive Hard Attention Network (AdaHAN), selects a variable number of feature vectors that depends on the input."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-34",
"text": "Our results show that our algorithm can actually outperform comparable soft attention architectures on a challenging Visual QA task."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-35",
"text": "This approach also produces interpretable hard attention masks, where the image regions which correspond to the selected features often contain semantically meaningful information, such as coherent objects."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-36",
"text": "We also show strong performance when combined with a form of non-local pairwise model [26, 25, 27, 28] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-37",
"text": "This algorithm computes features over pairs of input features and thus scale quadratically with number of vectors in the feature map, highlighting the importance of feature selection."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-40",
"text": "Visual question answering, or more broadly the Visual Turing Test, asks \"Can machines understand a visual scene only from answering questions?\" [6, 23, 29, 30, 31, 32] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-41",
"text": "Creating a good Visual QA dataset has proved non-trivial: biases in the early datasets [6, 22, 23, 33] rewarded algorithms for exploiting spurious correlations, rather than tackling the reasoning problem head-on [7, 34, 35] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-42",
"text": "Thus, we focus on the recently-introduced VQA-CP [7] and CLEVR [34] datasets, which aim to reduce the dataset biases, providing a more difficult challenge for rich visual reasoning."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-43",
"text": "One of the core challenges of Visual QA is the problem of grounding language: that is, associating the meaning of a language term with a specific perceptual input [36] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-44",
"text": "Many works have tackled this problem [37, 38, 39, 40] , enforcing that language terms be grounded in the image."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-45",
"text": "In contrast, our algorithm does not directly use correspondence between modalities to enforce such grounding but instead relies on learning to find a discrete representation that captures the required information from the raw visual input, and question-answer pairs."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-46",
"text": "The most successful Visual QA architectures build multimodal representations with a combined CNN+LSTM architecture [22, 33, 41] , and recently have begun including attention mechanisms inspired by soft and hard attention for image captioning [42] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-47",
"text": "However, only soft attention is used in the majority of Visual QA works [7, 8, 9, 10, 11, 12, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-48",
"text": "In these architectures, a full-frame CNN representation is used to compute a spatial weighting (attention) over the CNN grid cells."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-49",
"text": "The visual representation is then the weighted-sum of the input tensor across space."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-50",
"text": "The alternative is to select CNN grid cells in a discrete way, but due to many challenges in training non-differentiable architectures, such hard attention alternatives are severely under-explored."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-51",
"text": "Notable exceptions include [6, 13, 14, 53, 54, 55] , but these run state-of-the-art object detectors or proposals to compute the hard attention maps."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-52",
"text": "We argue that relying on such external tools is fundamentally limited: it requires costly annotations, and cannot easily adapt to new visual concepts that aren't previously labeled."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-53",
"text": "Outside Visual QA and captioning, some prior work in vision has explored limited forms of hard attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-54",
"text": "One line of work on discriminative patches builds a representation by selecting some patches and ignoring others, which has proved useful for object detection and classification [56, 57, 58] , and especially visualization [59] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-55",
"text": "However, such methods have recently been largely supplanted by end-to-end feature learning for practical vision problems."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-56",
"text": "In deep learning, spatial transformers [60] are one method for selecting an image regions while ignoring the rest, although these have proved challenging to train in practice."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-57",
"text": "Recent work on compressing neural networks (e.g. [61] ) uses magnitudes to remove weights of neural networks."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-58",
"text": "However it prunes permanently based on weight magnitudes, not dynamically based on activation norms, and has no direct connection to hard-attention or Visual QA."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-59",
"text": "Attention has also been studied outside of vision."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-60",
"text": "While the focus on soft attention predominates these works as well, there are a few examples of hard attention mechanisms and other forms of discrete gating [15, 16, 17, 18, 19] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-61",
"text": "In such works the decision of where to look is seen as a discrete variable that had been optimized either by reinforce loss or various other approximations (e.g. straightthrough)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-62",
"text": "However, due to the high variance of these gradients, learning can be inefficient, and soft attention mechanisms usually perform better."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-64",
"text": "**METHOD**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-65",
"text": "Answering questions about images is often formulated in terms of predictive models [24] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-66",
"text": "These architectures maximize a conditional distribution over answers a, given questions q and images x:"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-67",
"text": "where A is a countable set of all possible answers."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-68",
"text": "As is common in question answering [7, 9, 22, 23, 24] , the question is a sequence of words q = [q 1 , ..., q n ], while the output is reduced to a classification problem between a set of common answers (this is limited compared to approaches that generate answers [41] , but works better in practice)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-69",
"text": "Our architecture for learning a mapping from image and question, to answer, is shown in Figure 2 ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-70",
"text": "We encode the image with a CNN [62] (in our case, a pre-trained ResNet-101 [63] , or a small CNN trained from scratch), and encode the question to a fixed-length vector representation with an LSTM [64] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-71",
"text": "We compute a combined representation by copying the question representation to every spatial location in the CNN, and concatenating it with (or simply adding it to) the visual features, like previous Otherwise, we follow the canonical Visual QA pipeline [7, 9, 22, 23, 24, 25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-72",
"text": "Questions and images are encoded into their vector representations."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-73",
"text": "Next, the spatial encoding of the visual features is unraveled, and the question embedding is broadcasted and concatenated (or added) accordingly to form a multimodal representation of the inputs."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-74",
"text": "Our attention mechanism selectively chooses a subset of the multimodal vectors that are next aggregated and processed by the answering module."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-75",
"text": "work [7, 9, 22, 23, 24, 25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-76",
"text": "After a few layers of combined processing, we apply attention over spatial locations, following previous works which often apply soft attention mechanisms [7, 8, 9, 10] at this point in the architecture."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-77",
"text": "Finally, we aggregate features, using either sum-pooling, or relational [25, 27, 65] modules."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-78",
"text": "We train the whole network end-to-end with a standard logistic regression loss over answer categories."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-80",
"text": "**ATTENTION MECHANISMS**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-81",
"text": "Here, we describe prior work on soft attention, and our approach to hard attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-82",
"text": "Soft Attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-83",
"text": "In most prior work, soft attention is implemented as a weighted mask over the spatial cells of the CNN representation."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-84",
"text": "Let x := CN N (x), q := LST M (q) for image x and question q. We compute a weight w ij for every x ij (where i and j index spatial locations), using a neural network that takes x ij and q as input."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-85",
"text": "Intuitively, weight w ij measures the \"relevance\" of the cell to the input question."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-86",
"text": "w is nonnegative and normalized to sum to 1 across the image (generally with softmax)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-87",
"text": "Thus, w is applied to the visual input via\u0125 ij := w ij x ij to build the multi-modal representation."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-88",
"text": "This approach has some advantages, including conceptual simplicity and differentiability."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-89",
"text": "The disadvantage is that the weights, in practice, are never 0."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-90",
"text": "Irrelevant background can affect the output, no features can be dropped from potential further processing, and credit assignment is still challenging."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-91",
"text": "Hard Attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-92",
"text": "Our main contribution is a new mechanism for hard attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-93",
"text": "It produces a binary mask over spatial locations, which determines which features are passed on to further processing."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-94",
"text": "We call our method the Hard Attention Network (HAN)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-95",
"text": "The key idea is to use the L2-norm of the activations at each spatial location as a proxy for relevance at that location."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-96",
"text": "The correlation between L2-norm and relevance is an emergent property of the trained CNN features, which requires no additional constraints or objectives."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-97",
"text": "[20] recently found something related: in an ImageNet-pretrained representation of an image of a cat and a dog, the largest feature norms appear above the cat and dog face, even though the representation was trained purely for classification."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-98",
"text": "Our architecture bootstraps on this phenomenon without explicitly training the network to have it."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-99",
"text": "As above, let x ij and q be a CNN cell at the spatial position i, j, and a question representation respectively."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-100",
"text": "We first embed q \u2208 R q and x \u2208 R x into two feature spaces that share the same dimensionality d, i.e.,"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-101",
"text": "where CN N 1\u00d71 stands for a 1 \u00d7 1 convolutional network and M LP stands for a multilayer perceptron."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-102",
"text": "We then combine both the convolutional image features with the question features into a shared multimodal embedding by first broadcasting the question features to match the w \u00d7 d shape of the image feature map, and then performing element-wise addition (1x1 conv net/MLP in Figure 2 ):"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-103",
"text": "Element-wise addition keeps the dimensionality of each input, as opposed to concatenation, yet is still effective [12, 24] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-104",
"text": "Next, we compute the presence vector,"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-105",
"text": "w\u00d7h which measures the relevance of entities given the question:"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-106",
"text": "where || \u00b7 || 2 denotes L2-norm."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-107",
"text": "To select k entities from m for further processing, the indices of the top k entries in p,"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-108",
"text": "k\u00d7d ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-109",
"text": "This set of features is passed to the decoder module and gradients will flow back to the weights of the CNN/MLP through the selected features."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-110",
"text": "Our assumption is that important outputs of the CNN/MLP will tend to grow in norm, and therefore are likely to be selected."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-111",
"text": "Intuitively if less useful features are selected, the gradients will push the norm of these features down, making them less likely to be selected again. But there is nothing in our framework which explicitly incorporates this behavior into a loss."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-112",
"text": "Despite its simplicity, our experiments (Section 4) show the HAN is very competitive with canonical soft attention [9] while also offering interpretability and efficiency."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-113",
"text": "Thus far, we have assumed that we can fix the number of features k that are passed through the attention mechanism."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-114",
"text": "However, it is likely that different questions require different spatial support within the image."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-115",
"text": "Thus, we also introduce a second approach which adaptively chooses the number of entities to attend to (termed Adaptive-HAN, or AdaHAN) as a function of the inputs, rather than using a fixed k. The key idea is to make the presence vector p (the norm of the embedding at each spatial location) \"compete\" against a threshold \u03c4 ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-116",
"text": "However, since the norm is unbounded from above, to avoid trivial solutions in which the network sets the presence vector very high and selects all entities, we apply a softmax operator to p. We put both parts into the competition by only selecting those elements of m whose presence values exceed the threshold,"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-117",
"text": "Note that due to the properties of softmax, the competition is encouraged not only between both sides of the inequality, but also between the spatially distributed elements of the presence vector p. Although \u03c4 could be chosen through the hyper-parameter selection, we decide to use \u03c4 := 1 w\u00b7h where w and h are spatial dimensions of the input vector x ij ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-118",
"text": "Such value for \u03c4 has an interesting interpretation."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-119",
"text": "If each spatial location of the input were equally important, we would sample the locations from a uniform probability distribution p(\u00b7) := \u03c4 = 1 w\u00b7h ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-120",
"text": "This is equivalent to a probability distribution induced by the presence vector of a neural network with uniformly distributed spatial representation, i.e. \u03c4 = softmax(p uniform ), and hence the trained network with the presence vector p has to \"win\" against the p uniform of the random network in order to select right input features by shifting the probability mass accordingly."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-121",
"text": "It also naturally encourages higher selectivity as the increase in the probability mass at one location would result in decrease in another location."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-122",
"text": "In contrast to the commonly used soft-attention mechanism, our approaches do not require extra learnable parameters."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-123",
"text": "HAN requires a single extra but interpretable hyper-parameter: a fraction of input cells to use, which trades off speed for accuracy."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-124",
"text": "AdaHAN requires no extra hyper-parameters."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-126",
"text": "**FEATURE AGGREGATION**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-127",
"text": "Sum Pooling."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-221",
"text": "These masks are occasionally useful for diagnosing behavior."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-244",
"text": "Implementation Details."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-128",
"text": "A simple way to reduce the set of feature vectors after attention is to sum pool them into a constant length vector."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-129",
"text": "In the case of a soft attention module with an attention weight vector w, it is straightforward to compute a pooled vector as ij w ij x ij ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-130",
"text": "Given features selected with hard attention, an analogous pooling can be written as k \u03ba=1 m l\u03ba ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-131",
"text": "Non-local Pairwise Operator."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-132",
"text": "To improve on sum pooling, we explore an approach which performs reasoning through non-local and pairwise computations, one of a family of similar architectures which has shown promising results for question-answering and video understanding [25, 26, 27 ]."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-133",
"text": "An important aspect of these non-local pairwise methods is that the computation is quadratic in the number of features, and thus hard attention can provide significant computational savings."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-134",
"text": "Given some set of embedding vectors (such as the spatial cells of the output of a convolutional layer) x ij , one can use three simple linear projections to produce a matrix of queries, q ij := W q x ij , keys, k ij := W k x ij , and values, v ij = W v x ij at each spatial location."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-135",
"text": "Then, for each spatial location i, j, we compare the query q ij with the keys at all other locations, and sum the values v weighted by the similarity."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-136",
"text": "Mathematically, we comput\u1ebd"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-137",
"text": "Here, the softmax operates over all i, j locations."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-138",
"text": "The final representation of the input is computed by summarizing allx lk representations, e.g. we use sumpooling to achieve this goal."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-139",
"text": "Thus, the mechanism computes non-local [26] pairwise relations between embeddings, independent of spatial or temporal proximity."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-140",
"text": "The separation between keys, queries, and values allows semantic information about each object to remain separated from the information that binds objects together across space."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-141",
"text": "The result is an effective, if somewhat expensive, spatial reasoning mechanism."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-142",
"text": "Although expensive, similar mechanism has been shown useful in various tasks, from synthetic visual question [25] , to machine translation [27] , to video recognition [26] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-143",
"text": "Hard attention can help to reduce the set of comparisons that must be considered, and thus we aim to test whether the features selected by hard attention are compatible with this operator."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-145",
"text": "**RESULTS**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-146",
"text": "To show the importance of hard attention for Visual QA, we first compare HAN to existing soft attention (SAN) architectures on VQA-CP v2, and exploring the effect of varying degrees of hard attention by directly controlling the number of attended spatial cells in the convolutional map."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-147",
"text": "We then examine AdaHAN, which adaptively chooses the number of attended cells, and briefly investigate the effect of network depth and pretraining."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-148",
"text": "Finally, we present qualitative results, and also provide results on CLEVR to show the method's generality."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-150",
"text": "**DATASETS**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-152",
"text": "This dataset [7] consists of about 121K (98K) images, 438K (220K) questions, and 4.4M (2.2M) answers in the train (test) set; and it is created so that the distribution of the answers between train and test splits differ, and hence the models cannot excessively rely on the language prior [7] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-153",
"text": "As expected, [7] show that performance of all Visual QA approaches they tested drops significantly between train to test sets."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-154",
"text": "The dataset provides a standard traintest split, and also breaks questions into different question types: those where the answer is yes/no, those where the answer is a number, and those where the answer is something else."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-155",
"text": "Thus, we report accuracy on each question type as well as the overall accuracy for each network architecture."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-156",
"text": "CLEVR."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-157",
"text": "This synthetic dataset [34] consists of 100K images of 3D rendered objects like spheres and cylinders, and roughly 1m questions that were automatically generated with a procedural engine."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-158",
"text": "While the visual task is relatively simple, solving this dataset requires reasoning over complex relationships between many objects."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-160",
"text": "**EFFECT OF HARD ATTENTION**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-161",
"text": "We begin with the most basic hard attention architecture, which applies hard attention and then does sum pooling over the attended cells, followed by a small MLP."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-162",
"text": "For each experiment, we take the top k cells, out of 100, according to our L2-norm criterion, where k ranges from 16 to 100 (with 100, there is no attention, and the whole image is summed)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-163",
"text": "Results are shown in the top of Table 1 ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-164",
"text": "Considering that the hard attention selects only a subset of the input cells, we might expect that the algorithm would lose important information and be unable to recover."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-165",
"text": "In fact, however, the performance is almost the same with less than half of the units attended."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-166",
"text": "Even with just 16 units, the performance loss is less than 1%, suggesting that hard attention is quite capable of capturing the important parts of the image."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-167",
"text": "Table 1 : Comparison between different number of attended cells (percentage of the whole input), and aggregation operation."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-168",
"text": "We consider a simple summation, and non-local pairwise computations as the aggregation tool."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-169",
"text": "The fact that hard attention can work is interesting itself, but it should be especially useful for models that devote significant processing to each attended cell."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-170",
"text": "We therefore repeat the above experiment with the non-local pairwise aggregation mechanism described in section 3, which computes activations for every pair of attended cells, and therefore scales quadratically with the number of at-tended cells."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-171",
"text": "These results are shown in the middle of Table 1 , where we can see that hard attention (48 entitties) actually boosts performance over an analogous model without hard attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-172",
"text": "Finally, we compare standard soft attention baselines in the bottom of Table 1."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-219",
"text": "In Figure 3 , we show results with our different hard-attention mechanisms (HAN or AdaHAN), and different aggregation operations (summation or pairwise)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-173",
"text": "In particular, we include previous results using a basic soft attention network [7, 9] , as well as our own re-implementation of the soft attention pooling algorithm presented in [7, 9] with the same features used in other experiments."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-174",
"text": "Surprisingly, soft attention does not outperform basic sum pooling, even with careful implementation that outperforms the previously reported results with the same method on this dataset; in fact, it performs slightly worse."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-175",
"text": "The nonlocal pairwise aggregation performs better than SAN on its own, although the best result includes hard attention."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-176",
"text": "Our results overall are somewhat worse than the state-of-the-art [7] , but this is likely due to several architectural decisions not included here, such as a split pathway for different kinds of questions, special question embeddings, and the use of the question extractor."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-177",
"text": "Table 2 : Comparison between different adaptive hard-attention techniques with average number of attended parts, and aggregation operation."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-178",
"text": "We consider a simple summation, and the non-local pairwise aggregation."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-179",
"text": "Since AdaHAN adaptively selects relevant features, based on the fixed threshold 1 w * h , we report here the average number of attended parts."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-181",
"text": "**ADAPTIVE HARD ATTENTION**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-182",
"text": "Thus far, our experiments have dealt with networks that have a fixed threshold for all images."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-183",
"text": "However, some images and questions may require reasoning about more entities than others."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-184",
"text": "Therefore, we explore a simple adaptive method, where the network chooses how many cells to attend to for each image."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-185",
"text": "Table 2 shows results, where AdaHAN refers to our adaptive mechanism."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-186",
"text": "We can see that on average, the adaptive mechanism uses surprisingly few cells: 25.66 out of 100 when sum pooling is used, and 32.63 whenever the non-local pairwise aggregation mechanism is used."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-187",
"text": "For sum pooling, this is on-par with a non-adaptive network that uses more cells on average (HAN+sum 32); for the non-local pairwise aggregation mechanism, just 32.63 cells are enough to outperform our best non-adaptive model, which uses roughly 50% more cells."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-188",
"text": "This shows that even very simple methods of adapting hard attention to the image and the question can lead to both computation and performance gains, suggesting that more sophisticated methods will be an important direction for future work."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-190",
"text": "**EFFECTS OF NETWORK DEPTH**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-191",
"text": "In this section, we briefly analyze an important architectural choice: the number of layers used on top of the pretrained embeddings."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-192",
"text": "That is, before the question and image representations are combined, we perform a small amount of processing to \"align\" the information, so that the embedding can easily tell the relevance of the visual information to the question."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-193",
"text": "Table 3 shows the results of removing the two layers which perform this function."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-220",
"text": "We can see that the important objects are attended together with some context, which we hypothesize can also be important in correctly answering questions."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-194",
"text": "We consistently see a drop of about 1% without the layers, suggesting that deciding which cells to attend to requires different information than the classification-tuned ResNet is designed to provide."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-196",
"text": "**IMPLEMENTATION DETAILS.**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-197",
"text": "All our models use the same LSTM size 512 for questions embeddings, and the last convolutional layer of the ImageNet pre-trained ResNet-101 [63] (yielding 10-by-10 spatial representation, each with 2048 dimensional cells) for image embedding."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-198",
"text": "We also use MLP with 3 layers of sizes: 1024, 2048, 1000, as a classification module."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-199",
"text": "We use ADAM for optimization [66] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-200",
"text": "We use a distributed setting with two workers computing a gradient over a batch of 128 elements each."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-201",
"text": "We normalize images by dividing them by their norm."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-202",
"text": "We do not perform a hyper-parameter search as there is no separated validation set available."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-203",
"text": "Instead, we rather choose default hyper-parameters based on our prior experience on Visual QA datasets."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-204",
"text": "We trained our models until we notice a saturation on the training set."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-205",
"text": "Then we evaluate these models on the test set."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-206",
"text": "Our tables show the performance of all the methods wrt."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-207",
"text": "the second digits precision obtained by rounding."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-208",
"text": "Table 1 shows SAN's [9] results reported by [7] together with our in-house implementation (denoted as \"ours\")."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-209",
"text": "Our implementation has 2 attention hops, 1024 dimensional multimodal embedding size, a fixed learning rate 0.0001, and ResNet-101."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-210",
"text": "In these experiments we pool the attended representations by weighted average with the attention weights."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-211",
"text": "Our in-house implementation of the nonlocal pairwise mechanism strongly resembles implementations of [26] , and [27] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-212",
"text": "We use 2 heads, with embedding size 512."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-213",
"text": "In Equation 2 and Equation 3, we use d := 2048 (the same as dimensionality as the image encoding) and two linear layers with RELU that follows up each layer."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-214",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-215",
"text": "**QUALITATIVE RESULTS.**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-216",
"text": "One advantage of our formulation is that it is straightforward to visualize the masks of attended cells given questions and images, which we show in Figure 3 and Figure 4 ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-217",
"text": "In general, relevant objects are usually attended, and that significant portions of the irrelevant background is suppressed."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-218",
"text": "Although some background might be kept, we hypothesize the context matters in answering some questions."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-222",
"text": "For instance, as row 2 and column 3 suggest, the network may answer the question correctly but likely for wrong reasons."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-223",
"text": "We can also see broad differences between the network architectures."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-224",
"text": "For instance, the sum pooling method (row 2) is much more spatially constrained than the pairwise pooling version (row 1), even though the adaptive attention can select an arbitrarily large region."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-225",
"text": "We hypothesize that more visual features may unnecessarily interfere during the summation, and hence a more spatially sparse representation is preferred, or that sum pooling struggles to integrate across complex scenes."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-226",
"text": "The support is also not always contiguous: non-adaptive hard attention with 16 entities (row 4) in particular distributes its attention widely."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-227",
"text": "In Figure 4 , we show results with our best-performing model on VQA-CP: adaptive hard attention mechanism tied with a non-local, pairwise aggregation mechanism (AdaHAN+pairwise)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-228",
"text": "The qualitative behaviour of this mechanism subsumes various fixed hard-attention variants, and with a variable spatial support tends to be better qualitatively and quantitatively than others."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-229",
"text": "Interest- ?\" (1st row, 1st column) , the two attended regions are separated and quite localized."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-230",
"text": "However, for \"Is that an airplane in image?\" (1st row, 2nd column), the attended regions are contiguous and cover almost whole image."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-231",
"text": "The shape of the train in the image (1st row, 3rd column), despite of its elongated shape, is quite well captured by our method."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-232",
"text": "Similarly, we can observe that the attended regions overlap with the shape of a boat (1st row, 4th column), even though the method ultimately gets the question wrong."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-233",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-234",
"text": "**END-TO-END TRAINING.**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-235",
"text": "Since our network uses hard attention, which has zero gradients almost everywhere, one might suspect that it will become more difficult to train the lowerlevel features, or worse, that untrained features might prevent us from bootstrapping the attention mechanism."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-236",
"text": "Therefore, we also trained HAN+sum (with 16% of the input cells) end-to-end together with a relatively small convolutional neural network initialized from scratch."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-237",
"text": "We compare our method against our implementation of the SAN method trained using the same simple convolutional neural network."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-238",
"text": "We call the models: simple-SAN, and simple-HAN."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-239",
"text": "Analysis."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-240",
"text": "In our experiments, simple-SAN achieves about 21% performance on the test set."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-241",
"text": "Surprisingly, simple-HAN+sum achieves about 24% performance on the same split, on-par with the performance of normal SAN that uses more complex and deeper visual architecture [67] ; the results are reported by [7] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-242",
"text": "This result shows that the hard attention mechanism can indeed be tightly coupled within the training process, and that the whole procedure does not rely heavily on the properties of the ImageNet pre-trained networks."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-243",
"text": "In a sense, we see that a discrete notion of entities also \"emerges\" through the learning process, leading to efficient training."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-245",
"text": "In our experiments we use a simple CNN built of: 1 layer with 64 filters and 7-by-7 filter size followed up by 2 layers with 256 filters and 2 layers with 512 filters, all with 3-by-3 filter size."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-246",
"text": "We use strides 2 for all the layers."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-247",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-248",
"text": "**CLEVR**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-249",
"text": "To demonstrate the generality of our hard attention method, particularly in domains that are visually different from the VQA images, we experiment with a synthetic Visual QA dataset termed CLEVR [34] , using a setup similar to the one used for VQA-CP and [25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-250",
"text": "Due to the visual simplicity of CLEVR, we follow up the work of [25] , and instead of relying on the ImageNet pre-trained features, we train our HAN+sum and HAN+RN (hard attention with relation network) architectures end-to-end together with a relatively small CNN (following [25] )."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-251",
"text": "Analysis."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-252",
"text": "As reported in prior work [25, 34] , the soft attention mechanism used in SAN does not perform well on the CLEVR dataset, and achieves only 68.5% [34] (or 76.6% [25] ) performance."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-253",
"text": "In contrast, relation network, which also figure) , the second row shows AdaHAN+sum, the third row shows HAN+pairwise with fixed 32 entities, and the last row shows HAN+pairwise with fixed 16 entities, covering 32% and 16% of the input respectively."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-254",
"text": "In the images, attended regions are highlighted while unattended are darkened."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-255",
"text": "The green denotes correct answers, the red incorrect, and orange denotes partial consensus between the human answers."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-256",
"text": "This figure illustrates various strengths of the proposed methods."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-257",
"text": "Best viewed on a display."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-258",
"text": "annotations (lines 157-160)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-259",
"text": "We work with VQACP and CLEVR that h realizes a non-local and pairwise computational model, essentially solves this task, achieving 95.5% performance on the test set."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-260",
"text": "Surprisingly, our HAN+sum achieves 89.7% performance even without a relation network, and HAN+RN (i.e., relation network is used as an aggregation mechanism) achieves 93.9% on the test set."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-261",
"text": "These results show the mechanism can readily be used with other architectures on another dataset with different visuals."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-262",
"text": "Training with HAN requires far less computation than the original relation network [25] , although performance is slightly below relation network's 95.5%."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-263",
"text": "Figure 5a compares computation time: HAN+RN and relation network are trained for 12 hours under the same hyper-parameter set-up."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-264",
"text": "Here, HAN+RN achieves around 90% validation accuracy, whereas relation network only 70%."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-265",
"text": "Owing to hard-attention, we are able to train larger models, which we call HAN+sum + , HAN+RN + , and HAN+RN ++ ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-266",
"text": "These models use larger CNN and LSTM, and HAN+RN ++ also uses higher resolution of the input (see Implementation Details below)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-267",
"text": "The models achieve 94.7%, 96.9% and 98.8% respectively."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-312",
"text": "SAN* denotes the SAN implementation of [25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-268",
"text": "The relation network with hard attention operates on k Table 4 gives more context regarding the results on the CLEVR dataset that we are aware of, and compares our method with other approaches to answer questions about CLEVR images."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-269",
"text": "Our best performing method, denoted by HAN+RN ++ that uses a deeper model and operates on larger input tensor than the original RN [25] , is very competitive to alternative approaches such us FiLM [11] , TbD [50] , or MAC [49] ; and as [25] and [50] (TbD+hres) have noted increasing the spatial resolution definitely helps in achieving better performance."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-270",
"text": "As we can see in Table 4 , all the approaches seem to struggle with difficult counting questions, and RN is significantly worse on the Compare Numbers questions."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-271",
"text": "In the remaining question types, HAN+RN ++ is either on par or even better than TbD+hres that uses larger spatial resolution, deep pre-trained image CNN, more specialized modules, and requires an \"expert layout\" [52] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-272",
"text": "Here, we keep the conceptual simplicity of the original RN [25] coupled with our simple mechanism of selecting important features, as well as we trained the whole architecture end-to-end and from scratch."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-273",
"text": "Finally, through a visual inspection, we have observed that the fraction of input cells that we have experimented with (k = 16 for 8x8 spatial tensor, and k = 64 for 14x14 spatial tensor) is sufficient to cover all the important objects in the image, and thus the mechanism resembles more the saliency mechanism."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-274",
"text": "It is worth noting, the hard-attention mechanism often selects a few cells that correspond to the object as this is sufficient to recognize the object's properties such as size, material, color, type, and spatial location."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-275",
"text": "Straight-Through Estimator."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-276",
"text": "As an alternative to our hard attention, we have also implemented a few variants of the straight-through estimator [17] , which is a method introduced to deal with non-differentiable neural modules."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-277",
"text": "In a nutshell, during the forward pass we employ steps that are non-differentiable or have gradients that are zero almost everywhere (e.g., hard thresholding), but in the backward pass, we introduce skip-connections that the back-propagation mechanism uses to bypass these steps."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-278",
"text": "For the purpose of gracefully implementing this mechanism in TensorFlow, we have implemented the estimator as follows 1 . Let x \u2208 R n\u00d7d be spatial input, with n spatial cells, each d-dimensional."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-279",
"text": "All our estimators have the form"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-280",
"text": "Here, \u00b7 is the element-wise multiplication, stop(y) prevents from propagating the gradient through y, t(y, k) returns the k-th largest element of the vector y, 1 {P} outputs 1 if the predicate P is true and 0 otherwise, and g produces a spatial mask similar to the soft attention mask, i.e. g(y) \u2208 R n\u00d71 ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-281",
"text": "In all our experiments, g = \u00b5 \u2022 f is the composition of the normalization function (e.g. softmax) \u00b5 and an MLP f with one hidden layer of dimension d 2 , and one ReLU between the hidden and the output layers."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-282",
"text": "For \u00b5, we investigate identity, sigmoid or softmax."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-283",
"text": "Only the latter two yield results significantly better than 60%, but we still find the results either under-performing to our hard-attention approach, or very unstable."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-284",
"text": "For instance, Figure 5b shows our best results (accuracy-and stability-wise) with straight-through."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-285",
"text": "Moreover, our formulation of the straightthrough still requires to have gradients back-propagated through all the cells, even though they are ignored in the forward-pass, and hence the method lacks the computational benefits of our hard-attention mechanism."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-286",
"text": "Implementation Details."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-287",
"text": "In the experiments with HAN+Sum, and HAN+RN we follow the same setup as [25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-288",
"text": "However, we have made slight changes with our larger models: HAN+Sum + , HAN+RN + , and HAN+RN ++ ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-289",
"text": "HAN+Sum + , HAN+RN + , and HAN+RN ++ use an LSTM with 256 hidden units and 64 dimensional word embedding (jointly trained from scratch together with the whole architecture) for language."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-290",
"text": "For the image, we use a CNN with 4 layers, each with stride 2, 3x3 kernel size, ReLU non-linearities, and 128 features at each spatial cell."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-291",
"text": "Our classifier is an MLP with a single hidden layer (1024 dimensional), drop-out 50%, and a single ReLU."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-292",
"text": "Function g \u03b8 defined in [25] is an MLP with four hidden layers (each 256 dimensional) and ReLUs."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-293",
"text": "We also find that, before the sum-pooling in HAN+Sum + , and before the pairwise aggregation in HAN+RN +/++ it is worthwhile to process the multimodal embedding with a 1-by-1 convolution (we use 4 layers, with ReLUs, and 256 features)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-294",
"text": "We use l 2 -norm on all the weights as the regularization."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-295",
"text": "For hard-attention, we have also found batch-normalization in the image CNN to be crucial to achieve a good performance."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-296",
"text": "Moreover, batch-normalization before 1-by-1 convolutions is also helpful, but not critical."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-297",
"text": "The other hyper-parameters are identical to the ones presented in [25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-298",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-299",
"text": "**SUMMARY**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-300",
"text": "We have introduced a new approach for hard attention in computer vision that selects a subset of the feature vectors for further processing based on the their magnitudes."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-301",
"text": "We explored two models, one which selects subsets with a prespecified number of vectors (HAN), and the other one that adaptively chooses the subset size as a function of the inputs (AdaHAN)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-302",
"text": "Hard attention is often avoided in the literature because it poses a challenge for gradient-based methods due to non-differentiability."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-303",
"text": "However, since we found our feature vectors' magnitudes correlate with relevant information, our hard attention mechanism exploits this property to perform the selection."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-304",
"text": "Our results showed our HAN and AdaHAN gave competitive performance on challenging Visual QA datasets."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-305",
"text": "Our approaches seem to be at least as good as a more commonly used soft attention mechanism while providing additional computational efficiency benefits."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-306",
"text": "This is especially important for the increasingly popular class of non-local approaches, which often require computations and memory which are quadratic in the number of the input vectors."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-307",
"text": "Finally, our approach also provides interpretable representations, as the spatial locations of the selected features correspond most strongly to those parts of the image which contributed most strongly."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-308",
"text": "----------------------------------"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-309",
"text": "**MODEL**"
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-310",
"text": "Overall Table 4 : Results, in %, on CLEVR."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-311",
"text": "SAN denotes the SAN [9] implementation of [34] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-313",
"text": "Object RN** [55] and Stack-NMNs** [52] report the results only on the validation set, whereas others report on the test set."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-314",
"text": "Overall performance of Stack-NMNs** [52] is measured with the \"expert layout\" (similar to N2NMN) yielding 96.6 and without it (93.0)."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-315",
"text": "DDRprog \u2212 [71] , PG+EE (700k) \u2212 [70] , TbD \u2212 , and TbD+hres \u2212 [50] are trained with a privileged state-description, while others are trained directly from images-questions-answers."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-316",
"text": "TbD+hres [50] uses high-resolution (28x28) spatial tensor, while majority uses either 8x8 or 14x14."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-317",
"text": "HAN+Sum/RN + denotes a larger relational model, or a different hyper-parameters setup, than the model of [25] ."
},
{
"sent_id": "7bdb51a3ca6c322ef6e04d18ba8483-C001-318",
"text": "HAN+RN ++ denotes HAN+RN + with larger input images with spatial dimensions 224x224 as opposed to 128x128, and larger image tensors with spatial dimension 14x14 as opposed to 8x8."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-15"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-19"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-21"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-41"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-47"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-68"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-152",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-153"
]
],
"cite_sentences": [
"7bdb51a3ca6c322ef6e04d18ba8483-C001-15",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-19",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-21",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-41",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-47",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-68",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-152",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-153"
]
},
"@USE@": {
"gold_contexts": [
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-31"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-42"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-71"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-76"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-173"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-208"
]
],
"cite_sentences": [
"7bdb51a3ca6c322ef6e04d18ba8483-C001-31",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-42",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-71",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-76",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-173",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-208"
]
},
"@MOT@": {
"gold_contexts": [
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-47"
],
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-152",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-153",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-154",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-155"
]
],
"cite_sentences": [
"7bdb51a3ca6c322ef6e04d18ba8483-C001-47",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-152",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-153"
]
},
"@DIF@": {
"gold_contexts": [
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-176"
]
],
"cite_sentences": [
"7bdb51a3ca6c322ef6e04d18ba8483-C001-176"
]
},
"@SIM@": {
"gold_contexts": [
[
"7bdb51a3ca6c322ef6e04d18ba8483-C001-240",
"7bdb51a3ca6c322ef6e04d18ba8483-C001-241"
]
],
"cite_sentences": [
"7bdb51a3ca6c322ef6e04d18ba8483-C001-241"
]
}
}
},
"ABC_27be8a173136e48a15f637278fd831_2": {
"x": [
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-2",
"text": "Context is crucial for identifying argumentative relations in text, but many argument mining methods make little use of contextual features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-3",
"text": "This paper presents contextaware argumentative relation mining that uses features extracted from writing topics as well as from windows of context sentences."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-26",
"text": "The first hypothesis investigated in this paper is that the discourse relations of argument components with adjacent sentences (called context windows in this study, a formal definition is given in \u00a75.3) can help characterize the argumentative relations that connect pairs of argument components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-27",
"text": "Reconsidering the example in Figure 1 , without knowing the content \"horrendous images are displayed on the cigarette boxes\" in sentence 3, one cannot easily tell that \"reduction in the number of smokers\" in sentence 4 supports the \"pictures can influence\" claim in sentence 2."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-28",
"text": "We expect that such content relatedness can be revealed from a discourse analysis, e.g., the appearance of a discourse connective \"As a result\"."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-29",
"text": "While topic information in many writing genres (e.g., scientific publications, Wikipedia articles, student essays) has been used to create features for argument component mining (Teufel and Moens, 2002; Levy et al., 2014; Nguyen and Litman, 2015) , topic-based features have been less explored for argumentative relation mining."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-30",
"text": "The second hypothesis investigated in this paper is that features based on topic context also provide useful information for improving argumentative relation mining."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-31",
"text": "In the excerpt below, knowing that 'online game' and 'computer' are topically related might help a model decide that the claim in sentence 1 supports the claim in sentence 2:"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-32",
"text": "(1) People who are addicted to games, especially online games, can eventually bear dangerous consequences [Claim] ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-33",
"text": "(2) Although it is undeniable that computer is a crucial part of human life [P remise] , it still has its bad side [M ajorClaim] ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-34",
"text": "2 Motivated by the discussion above, we propose context-aware argumentative relation mining -a novel approach that makes use of contextual fea-2 In this excerpt, the Premise was annotated as an attack to the MajorClaim in sentence 2."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-35",
"text": "Figure 2: Structure of the argumentation in the excerpt."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-36",
"text": "Relations are illustrated accordingly to the annotation provided in the corpus."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-37",
"text": "Premises 3 and 4 were annotated for separate relations to Claim 2."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-38",
"text": "Our visualization should not mislead that the two premises are linked or convergent."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-39",
"text": "tures that are extracted by exploiting context sentence windows and writing topic to improve relation prediction."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-40",
"text": "In particular, we derive features using discourse relations between argument components and windows of their surrounding sentences."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-41",
"text": "We also derive features using an argument and domain word lexicon automatically created by post-processing an essay's topic model."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-42",
"text": "Experimental results show that our proposed contextual features help significantly improve performance in two argumentative relation classification tasks."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-44",
"text": "**RELATED WORK**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-45",
"text": "Unlike argument component identification where textual inputs are typically sentences or clauses (Moens et al., 2007; Stab and Gurevych, 2014b; Levy et al., 2014; Lippi and Torroni, 2015) , textual inputs of argumentative relation mining vary from clauses (Stab and Gurevych, 2014b; Peldszus, 2014 ) to multiple-sentences (Biran and Rambow, 2011; Cabrio and Villata, 2012; Boltu\u017ei\u0107 an\u010f Snajder, 2014) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-46",
"text": "Studying claim justification between user comments, Biran and Rambow (2011) proposed that the argumentation in justification of a claim can be characterized with discourse structure in the justification."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-47",
"text": "They however only considered discourse markers but not discourse relations."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-48",
"text": "Cabrio et al. (2013) conducted a corpus analysis and found certain similarity between Penn Discourse TreeBank relations (Prasad et al., 2008) and argumentation schemes (Walton et al., 2008) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-49",
"text": "However they did not discuss how such similarity could be applied to argument mining."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-50",
"text": "Motivated by these findings, we propose to use features extracted from discourse relations be-tween sentences for argumentative relation mining."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-51",
"text": "Moreover, to enable discourse relation features when the textual inputs are only sentences/clauses, we group the inputs with their context sentences."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-52",
"text": "Qazvinian and Radev (2010) used the term \"context sentence\" to refer to sentences surrounding a citation that contained information about the cited source but did not explicitly cite it."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-53",
"text": "In our study, we only require that the context sentences of an argument component must be in the same paragraph and adjacent to the component."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-54",
"text": "Prior work in argumentative relation mining has used argument component labels to provide constraints during relation identification."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-55",
"text": "For example, when an annotation scheme (e.g., (Peldszus and Stede, 2013; Stab and Gurevych, 2014a) ) does not allow relations from claim to premise, no relations are inferred during relation mining for any argument component pair where the source is a claim and the target is a premise."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-56",
"text": "In our work, we follow Stab and Gurevych (2014b) and use the predicted labels of argument components as features during argumentative relation mining."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-57",
"text": "We, however, take advantage of an enhanced argument component model (Nguyen and Litman, 2016 ) to obtain more reliable argument component labels than in (Stab and Gurevych, 2014b) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-58",
"text": "Argument mining research has studied different data-driven approaches for separating organizational content (shell) from topical content to improve argument component identification, e.g., supervised sequence model (Madnani et al., 2012) , unsupervised probabilistic topic models (S\u00e9aghdha and Teufel, 2014; Du et al., 2014) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-59",
"text": "Nguyen and Litman (2015) post-processed LDA (Blei et al., 2003) output to extract a lexicon of argument and domain words from development data."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-60",
"text": "Their semi-supervised approach exploits the topic context through essay titles to guide the extraction."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-61",
"text": "Finally, prior research has explored predicting different argumentative relationship labels between pairs of argument components, e.g., attachment (Peldszus and Stede, 2015a) , support vs. non-support (Biran and Rambow, 2011; Cabrio and Villata, 2012; Stab and Gurevych, 2014b) , {implicit, explicit}\u00d7{support, attack} (Boltu\u017ei\u0107 and\u0160najder, 2014) , verifiability of support (Park and Cardie, 2014) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-62",
"text": "Our experiments use two such argumentative relation classification tasks (Support vs. Non-support, Support vs. Attack) to evaluate the effectiveness of our proposed features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-64",
"text": "**PERSUASIVE ESSAY CORPUS**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-65",
"text": "Stab and Gurevych (2014a) compiled the Persuasive Essay Corpus consisting of 90 student argumentative essays and made it publicly available."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-66",
"text": "3 Because the corpus has been utilized for different argument mining tasks (Stab and Gurevych, 2014b; Nguyen and Litman, 2015; Nguyen and Litman, 2016) , we use this corpus to demonstrate our context-aware argumentative relation mining approach, and adapt the model developed by Stab and Gurevych (2014b) to serve as the baseline for evaluating our proposed approach."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-67",
"text": "Three experts identified possible argument components of three types within each sentence in the corpus (MajorClaim -writer's stance toward the writing topic, Claim -controversial statements that support or attack MajorClaim, and Premiseevidence used to underpin the validity of Claim), and also connected the argument components using two argumentative relations (Support and Attack)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-68",
"text": "According to the annotation manual, each essay has exactly one MajorClaim."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-69",
"text": "A sentence can have one or more argument components (Argumentative sentences)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-70",
"text": "Sentences that do not contain any argument component are labeled Nonargumentative."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-71",
"text": "Figure 1 shows an example essay with components annotated, and Figure 2 illustrates relations between those components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-72",
"text": "Argumentative relations are directed and can hold between a Premise and another Premise, a Premise and a (Major-) Claim, or a Claim and a MajorClaim."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-73",
"text": "Except for the relation from Claim to MajorClaim, an argumentative relation does not cross paragraph boundaries."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-74",
"text": "The three experts achieved inter-rater accuracy 0.88 for component labels and Krippendorff's \u03b1 U 0.72 for component boundaries."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-75",
"text": "Given the annotated argument components, the three experts obtained Krippendorff's \u03b1 0.81 for relation labels."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-76",
"text": "The number of relations are shown in Table 1."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-77",
"text": "4 Argumentative Relation Tasks 4.1 Task 1: Support vs. Non-support Our first task follows (Stab and Gurevych, 2014b) : given a pair of source and target argument components, identify whether the source argumentatively supports the target or not."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-78",
"text": "Note that when a support relation does not hold, the source may attack or has no relation with the target compo- Stab and Gurevych (2014b) split the corpus into an 80% training set and a 20% test set which have similar label distributions."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-79",
"text": "We use this split to train and test our proposed models, and directly compare our models' performance to the reported performance in (Stab and Gurevych, 2014b) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-81",
"text": "**TASK 2: SUPPORT VS. ATTACK**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-82",
"text": "To further evaluate the effectiveness of our approach, we conduct an additional task that classifies an argumentative relation as Support or Attack."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-83",
"text": "For this task, we assume that the relation (i.e., attachment (Peldszus, 2014) ) between two components is given, and aim at identifying the argumentative function of the relation."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-84",
"text": "Because we remove the paragraph constraint in this task, we obtain more Support relations than in Task 1."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-85",
"text": "As shown in Table 1 , of the total 1473 relations, we have 1312 (89%) Support and 161 (11%) Attack relations."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-86",
"text": "Because this task was not studied in (Stab and Gurevych, 2014b) , we adapt Stab and Gurevych's model to use as the baseline."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-88",
"text": "**ARGUMENTATIVE RELATION MODELS**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-90",
"text": "**BASELINE**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-91",
"text": "We adapt (Stab and Gurevych, 2014b ) to use as a baseline for evaluating our approach."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-92",
"text": "Given a pair of argument components, we follow (Stab and Gurevych, 2014b) by first extracting 3 feature sets: structural (e.g., word counts, sentence position), lexical (e.g., word pairs, first words), and grammatical production rules (e.g., S\u2192NP,VP)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-93",
"text": "Because a sentence may have more than one argument component, the relative component positions might provide useful information (Peldszus, 2014) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-94",
"text": "Thus, we also include 8 new component position features: whether the source and target components are the whole sentences or the beginning/end components of the sentences; if the source is before or after the target component; and the absolute difference of their positions."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-95",
"text": "Stab and Gurevych (2014b) used a 55-discourse marker set to extract indicator features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-96",
"text": "We expand their discourse maker set by combining them with a 298-discourse marker set developed in (Biran and Rambow, 2011)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-97",
"text": "We expect the expanded set of discourse markers will represent better possible discourse relations in the texts."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-98",
"text": "Stab and Gurevych (2014b) used predicted label of argument components as features for both training and testing their argumentation structure identification model."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-99",
"text": "5 As their predicted labels are not available to us, we adapt this feature set by using the argument component model in (Nguyen and Litman, 2016) which was shown to outperform the corresponding model of Stab and Gurevych."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-100",
"text": "For later presentation purposes, we name the set of all features from this section except word pairs and production rules as the common features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-101",
"text": "While word pairs and grammatical production rules were the most predictive features in (Stab and Gurevych, 2014b) , we hypothesize that this large and sparse feature space may have negative impact on model robustness (Nguyen and Litman, 2015) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-102",
"text": "Most of our proposed models replace word pairs and production rules with different combinations of new contextual features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-104",
"text": "**TOPIC-CONTEXT MODEL**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-105",
"text": "Our first proposed model (TOPIC) makes use of Topic-context features derived from a lexicon of argument and domain words for persuasive essays (Nguyen and Litman, 2015) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-106",
"text": "Argument words (e.g., 'believe', 'opinion') signal the argumentative content and are commonly used across different topics."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-107",
"text": "In contrast, domain words are specific terminologies commonly used within the topic (e.g., 'art', 'education')."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-108",
"text": "The authors first use topic prompts in development data of unannotated persuasive essays to semi-automatically collect argument and domain seed words."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-109",
"text": "In particular, they used 10 argument seed words: agree, disagree, reason, support, advantage, disadvantage, think, conclusion, result, opinion."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-110",
"text": "Domain seed words are those in the topic prompts but not argument seed words or stop words."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-111",
"text": "The seeds words are then used to supervise an automated extraction of argument and domain words from output of LDA topic model (Blei et al., 2003) on the development data."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-112",
"text": "The extracted lexicon consists of 263 (stemmed) argument words and 1806 (stemmed) domain words mapped to 36 LDA topics."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-113",
"text": "6 All argument words are from a single LDA topic while a domain word can map to multiple LDA topics (except the topic of argument words)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-114",
"text": "Using the lexicon, we extract the following Topic-context features:"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-115",
"text": "Argument word: from all word pairs extracted from the source and target components, we remove those that have at least one word not in the argument word list."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-116",
"text": "Each argument word pair defines a boolean feature indicating its presence in the argument component pair."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-117",
"text": "We also include each argument word of the source and target components as a boolean feature which is true if the word is present in the corresponding component."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-118",
"text": "We count number of common argument words, the absolute difference in number of argument words between source and target components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-119",
"text": "Domain word count: to measure the topic similarity between the source and target components, we calculate number of common domain words, number of pairs of two domain words that share an LDA topic, number of pairs that share no LDA topic, and the absolute difference in number of domain words between the two components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-120",
"text": "Non-domain MainVerb-Subject dependency: we extract MainVerb-Subject dependency triples, e.g., nsubj(belive, I), from the source and target components, and filter out triples that involve domain words."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-121",
"text": "We model each extracted triple as a boolean feature which is true if the corresponding argument component has the triple."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-122",
"text": "Finally, we include the common feature set."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-123",
"text": "To illustrate the Topic-context features, consider the following source and target components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-124",
"text": "Argument words are in boldface, and domain words are in italic."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-125",
"text": "Essay 54."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-126",
"text": "Topic: museum and art gallery will disappear soon?"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-127",
"text": "Source: more and more people can watch exhibitions through television or internet at home due to modern technology [P remise] Target: some people think museums and art galleries will disappear soon [Claim] An argument word pair is people-think."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-128",
"text": "There are 35 pairs of domain words."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-129",
"text": "A pair of two domain words that share an LDA topic is exhibitionsart."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-130",
"text": "A pair of two domain words that do not share any LDA topic is internet-galleries."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-132",
"text": "**WINDOW-CONTEXT MODEL**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-133",
"text": "Our second proposed model (WINDOW) extracts features from discourse relations and common words between context sentences in the context windows of the source and target components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-134",
"text": "Definition."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-135",
"text": "Context window of an argument component is a text segment formed by neighboring sentences and the covering sentence of the component."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-136",
"text": "The neighboring sentences are called context sentences, and must be in the same paragraph with the component."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-137",
"text": "In this study, context windows are determined using window-size heuristics."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-138",
"text": "7 Given a windowsize n, we form a context window by grouping the covering sentence with at most n adjacently preceding and n adjacently following sentences that must be in the same paragraph."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-139",
"text": "To minimize noise in feature space, we require that context windows of the source and target components must be mutually exclusive."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-140",
"text": "Biran and Rambow (2011) observed that the relation between a source argument and a target argument is usually instantiated by some elaboration/justification provided in a support of the source argument."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-141",
"text": "Therefore we prioritize the context window of source component when it overlaps with the target context window."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-142",
"text": "Particularly, we keep overlapping context sentences in the source window, and remove them from the target window."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-143",
"text": "For example, with window-size 1, context windows of the Claim in sentence 2 in Figure 1 and the Premise in sentence 4 overlap at sentence 3. When the Claim is set as source component, its context window includes sentences {2, 3}, and the Premise as a target has context window with only sentence 4."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-144",
"text": "We extract three Window-context feature sets from the context windows:"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-145",
"text": "Common word: as common word counts between adjacent sentences were shown useful for argument mining (Nguyen and Litman, 2016) , we count common words between the covering sentence with preceding context sentences, and with following context sentences, for source and target components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-146",
"text": "Discourse relation: for both source and target components, we extract discourse relations between context sentences, and within the covering sentence."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-147",
"text": "We also extract discourse relations between each pair of source context sentence and target context sentence."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-148",
"text": "Each relation defines a boolean feature."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-149",
"text": "We extract both Penn Discourse Treebank (PDTB) relations (Prasad et al., 2008) and Rhetorical Structure Theory Discourse Treebank (RST-DTB) relations (Carlson et al., 2001 ) using publicly available discourse parsers (Ji and Eisenstein, 2014; Wang and Lan, 2015) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-150",
"text": "Each PDTB relation has sense label defined in a 3-layered (class, type, subtype), e.g., CONTINGENCY.Cause.result."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-151",
"text": "While there are only four semantic class labels at the class-level which may not cover well different aspects of argumentative relation, subtype-level output is not available given the discourse parser we use."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-152",
"text": "Thus, we use relations at type-level as features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-153",
"text": "For RST-DTB relations, we use only relation labels, but ignore the nucleus and satellite labels of components as they do not provide more information given the component order in the pair."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-154",
"text": "Because temporal relations were shown not helpful for argument mining tasks (Biran and Rambow, 2011; Stab and Gurevych, 2014b) , we exclude them here."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-155",
"text": "Discourse marker: while the baseline model only considers discourse markers within the argument components, we define a boolean feature for each discourse marker classifying whether the marker is present before the covering sentence of the source and target components or not."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-156",
"text": "This implementation aims to characterize the discourse of the preceding and following text segments of each argument component separately."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-157",
"text": "Finally, we include the common feature set."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-159",
"text": "**COMBINED MODEL**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-160",
"text": "While Window-context features are extracted from surrounding text of the argument components, which exploits the local context, the Topic-context features are an abstraction of topicdependent information, e.g., domain words are defined within the context of topic domain (Nguyen and Litman, 2015) , and thus make use of the global context of the topic domain."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-161",
"text": "We believe that local and global context information represent complementary aspects of the relation between argument components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-162",
"text": "Thus, we expect to achieve the best performance by combining Window-context and Topic-context models."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-164",
"text": "**FULL MODEL**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-165",
"text": "Finally, the FULL model includes all features in BASELINE and COMBINED models."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-166",
"text": "That is, the FULL model is the COMBINED model plus word pairs and production rules."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-167",
"text": "A summary of all models is shown in Figure 3 ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-168",
"text": "window-size in range [0, 8] 8 that yields the best F1 score in 10-fold cross validation."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-169",
"text": "We use the training set as determined in (Stab and Gurevych, 2014b) to train/test 9 the models using LibLINEAR algorithm (Fan et al., 2008) without parameter or feature optimization."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-170",
"text": "Cross-validations are conducted using Weka (Hall et al., 2009) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-171",
"text": "We use Stanford parser (Klein and Manning, 2003) to perform text processing."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-172",
"text": "As shown in Figure 4 , while increasing the window-size from 2 to 3 improves performance (significantly), using window-sizes greater than 3 does not gain further improvement."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-173",
"text": "We hypothesize that after a certain limit, larger context windows will produce more noise than helpful information for the prediction."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-174",
"text": "Therefore, we set the window-size to 3 in all of our experiments involving Window-context model (all with a separate test set)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-176",
"text": "**PERFORMANCE ON TEST SET**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-177",
"text": "We train all models using the training set and report their performances on the test set in Table 2 ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-178",
"text": "We also compare our baseline to the reported performance (REPORT) for Support vs. Non-support classification in (Stab and Gurevych, 2014b) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-179",
"text": "The learning algorithm with parameters are kept the same as in the window-size tuning experiment."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-180",
"text": "Given the skewed class distribution of this data, Accuracy and F1 of Non-support (the major class) are less important than Kappa, F1, and F1 of Support (the minor class)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-181",
"text": "To conduct T-tests for performance significance, we split the test data into subsets by essays' ID, and record prediction performance for individual essays."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-182",
"text": "We first notice that the performances of our baseline model are better than (or equal to) RE-PORTED, except the Macro Recall."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-183",
"text": "We reason that these performance disparities may be due to the differences in feature extractions between our implementation and Stab and Gurevych's, and also due to the minor set of new features (e.g., new predicted labels, expanded marker set, component position) that we added in our baseline."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-184",
"text": "Comparing proposed models with BASELINE, we see that WINDOW, COMBINED, and FULL models outperform BASELINE in important metrics: Kappa, F1, Recall, but TOPIC yields worse performances than BASELINE."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-185",
"text": "However, the fact that COMBINED outperforms BASELINE, especially with significantly higher Kappa, F1, Recall, and F1:Support, has Overall, by combining TOPIC and WINDOW models, we obtain the best performance."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-186",
"text": "Most notably, we obtain the highest improvement in F1:Support, and have the best balance between Precision and Recall values among all models."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-187",
"text": "These reveal that our contextual features not only dominate generic features like word pairs and production rules, but also are effective to predict minor positive class (i.e., Support)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-188",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-189",
"text": "**TASK 2: SUPPORT VS. ATTACK**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-190",
"text": "To evaluate the robustness of our proposed models, we conduct an argumentative relation classification experiment that classifies a relation as Support or Attack."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-191",
"text": "Because this task was not studied in (Stab and Gurevych, 2014b ) and the training/test split for Support vs. Not task is not applicable here, we conduct 5\u00d710-fold cross validation."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-192",
"text": "We do not optimize the window-size parameter of the WINDOW model, and use the value 3 as set up before."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-193",
"text": "Average prediction performance of all models are reported in Table 3 ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-194",
"text": "Comparing our proposed models with the baseline shows that all of our proposed models significantly outperform the baseline in important metrics: Kappa, F1, F1:Attack."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-195",
"text": "More notably than in the Support vs. Non-support classification, all of our proposed models predict the minor class (Attack) significantly more effectively than the baseline."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-196",
"text": "The baseline achieves significantly higher F1:Support than WINDOW model."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-197",
"text": "However, F1:Support of the baseline is in a tie with TOPIC, COMBINED, and FULL."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-198",
"text": "Comparing our proposed models, we see that TOPIC and WINDOW models reveal different behaviors."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-199",
"text": "TOPIC model has significantly higher Precision and F1:Support, and significantly lower Recall and F1:Attack than WINDOW."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-200",
"text": "Moreover, WINDOW model has slightly higher Kappa, F1, but significantly lower Accuracy."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-201",
"text": "These comparisons indicate that Topic-context and Windowcontext features are equally effective but impact differently to the prediction."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-202",
"text": "The different nature between these two feature sets is clearer than in the prior experiment, as now the classification involves classes that are more semantically different, i.e., Support vs. Attack."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-203",
"text": "We recall that TOPIC model performs worse than WINDOW model in Support vs. Non-support task."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-204",
"text": "Our FULL model performs significantly worse than all of TOPIC, WINDOW, and COMBINED in Kappa, F1, Recall, and F1:Attack."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-205",
"text": "Along with results from Support vs. Non-support task, this further suggests that word pairs and production rules are less effective and cannot be combined well with our contextual features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-206",
"text": "Despite the fact that the Support vs. Attack task (Task 2) has smaller and more imbalanced data than the Support vs. Non-support (Task 1), our proposed contextual features seem to add even more value in Task 2 compared to Task 1."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-207",
"text": "Using Kappa to roughly compare prediction performance across the two tasks, we observe a greater performance improvement from Baseline to Combined model in Task 2 than in Task 1."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-208",
"text": "This is an evidence that our proposed context-aware features work well even in a more imbalanced with smaller data classification task."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-209",
"text": "The lower performance values of all models in Support vs. Attack than in Support vs. Non-support indirectly suggest that Support vs. Attack classification is a more difficult task."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-210",
"text": "We hypothesize that the difference between support and attack exposes a deeper semantic relation than that between support and no-relation."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-211",
"text": "We plan to extract textual text similarity and textual entailment features (Cabrio and Villata, 2012; Boltu\u017ei\u0107 and\u0160najder, 2014) to investigate this hypothesis in our future work."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-213",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-214",
"text": "In this paper, we have presented context-aware argumentative relation mining that makes use of contextual features by exploiting information from topic context and context sentences."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-215",
"text": "We have explored different ways to incorporate our proposed features with baseline features used in a prior study, and obtained insightful results about feature effectiveness."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-216",
"text": "Experimental results show that Topic-context and Window-context features are both effective but impact predictive performance measures differently."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-217",
"text": "In addition, predicting an argumentative relation will benefit most from combining these two set of features as they capture complementary aspects of context to better characterize the argumentation in justification."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-218",
"text": "The results obtained in this preliminary study are promising and encourage us to explore more directions to enable contextual features."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-219",
"text": "Our next step will investigate uses of topic segmentation to identify context sentences and compare this linguistically-motivated approach to our current window-size heuristic."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-220",
"text": "We plan to follow prior research on graph optimization to refine the argumentation structure and improve argumentative relation prediction."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-221",
"text": "Also, we will apply our contextaware argumentative relation mining to different argument mining corpora to further evaluate its generality."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-4",
"text": "Experiments on student essays demonstrate that the proposed features improve predictive performance in two argumentative relation classification tasks."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-7",
"text": "By supporting tasks such as automatically identifying argument components 1 (e.g., premises, claims) in text, and the argumentative relations (e.g., support, attack) between components, argument (argumentation) mining has been studied for applications in different research fields such as document summarization (Teufel and Moens, 2002) , opinion mining (Boltu\u017ei\u0107 and\u0160najder, 2014) , automated essay evaluation (Burstein et al., 2003) , legal information systems (Palau and Moens, 2009) , and policy modeling platforms (Florou et al., 2013) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-8",
"text": "Given a pair of argument components with one component as the source and the other as the target, argumentative relation mining involves determining whether a relation holds from the source to the target, and classifying the argumentative function of the relation (e.g., support vs. attack)."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-9",
"text": "Ar-1 There is no consensus yet on an annotation scheme for argument components, or on the minimal textual units to be annotated."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-10",
"text": "We follow Peldszus and Stede (2013) and consider \"argument mining as the automatic discovery of an argumentative text portion, and the identification of the relevant components of the argument presented there.\" We also borrow their term \"argumentative discourse unit\" to refer to the textual units (e.g., text segment, sentence, clause) which are considered as argument components."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-11",
"text": "Essay 73."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-12",
"text": "Topic: Is image more powerful than the written word?"
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-13",
"text": "... (1) Hence, I agree only to certain degree that in today's world, image serves as a more effective means of communication [M ajorClaim] ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-14",
"text": "(2) Firstly, pictures can influence the way people think [Claim] ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-15",
"text": "(3) For example, nowadays horrendous images are displayed on the cigarette boxes to illustrate the consequences of smoking [P remise] ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-16",
"text": "(4) As a result, statistics show a slight reduction in the number of smokers, indicating that they realize the effects of the negative habit [P remise] ... (Stab and Gurevych, 2014a) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-17",
"text": "Sentences are numbered and argument components are tagged."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-18",
"text": "gumentative relation mining -beyond argument component mining -is perceived as an essential step towards more fully identifying the argumentative structure of a text (Peldszus and Stede, 2013; Sergeant, 2013; Stab et al., 2014) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-19",
"text": "Consider the second paragraph shown in Figure 1 ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-20",
"text": "Only detecting the argument components (a claim in sentence 2 and two premises in sentences 3 and 4) does not give a complete picture of the argumentation."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-21",
"text": "By looking for relations between these components, one can also see that the two premises together justify the claim."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-22",
"text": "The argumentation structure of the text in Figure 1 is illustrated in Figure 2 ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-23",
"text": "Our current study proposes a novel approach for argumentative relation mining that makes use of contextual features extracted from surrounding sentences of source and target components as well as from topic information of the writings."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-24",
"text": "Prior argumentative relation mining studies have often used features extracted from argument components to model different aspects of the relations between the components, e.g., relative distance, word pairs, semantic similarity, textual entailment (Cabrio and Villata, 2012; Stab and Gurevych, 2014b; Boltu\u017ei\u0107 and\u0160najder, 2014; Peldszus and Stede, 2015b) ."
},
{
"sent_id": "27be8a173136e48a15f637278fd831-C001-25",
"text": "Features extracted from the text surrounding the components have been less explored, e.g., using words and their part-of-speech from adjacent sentences (Peldszus, 2014) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"27be8a173136e48a15f637278fd831-C001-24"
],
[
"27be8a173136e48a15f637278fd831-C001-45"
],
[
"27be8a173136e48a15f637278fd831-C001-61"
],
[
"27be8a173136e48a15f637278fd831-C001-95"
],
[
"27be8a173136e48a15f637278fd831-C001-98"
]
],
"cite_sentences": [
"27be8a173136e48a15f637278fd831-C001-24",
"27be8a173136e48a15f637278fd831-C001-45",
"27be8a173136e48a15f637278fd831-C001-61",
"27be8a173136e48a15f637278fd831-C001-95",
"27be8a173136e48a15f637278fd831-C001-98"
]
},
"@USE@": {
"gold_contexts": [
[
"27be8a173136e48a15f637278fd831-C001-56"
],
[
"27be8a173136e48a15f637278fd831-C001-77"
],
[
"27be8a173136e48a15f637278fd831-C001-92"
],
[
"27be8a173136e48a15f637278fd831-C001-95",
"27be8a173136e48a15f637278fd831-C001-96"
],
[
"27be8a173136e48a15f637278fd831-C001-154"
],
[
"27be8a173136e48a15f637278fd831-C001-169"
],
[
"27be8a173136e48a15f637278fd831-C001-178"
]
],
"cite_sentences": [
"27be8a173136e48a15f637278fd831-C001-56",
"27be8a173136e48a15f637278fd831-C001-77",
"27be8a173136e48a15f637278fd831-C001-92",
"27be8a173136e48a15f637278fd831-C001-95",
"27be8a173136e48a15f637278fd831-C001-154",
"27be8a173136e48a15f637278fd831-C001-169",
"27be8a173136e48a15f637278fd831-C001-178"
]
},
"@DIF@": {
"gold_contexts": [
[
"27be8a173136e48a15f637278fd831-C001-57"
]
],
"cite_sentences": [
"27be8a173136e48a15f637278fd831-C001-57"
]
},
"@SIM@": {
"gold_contexts": [
[
"27be8a173136e48a15f637278fd831-C001-61",
"27be8a173136e48a15f637278fd831-C001-62"
]
],
"cite_sentences": [
"27be8a173136e48a15f637278fd831-C001-61"
]
},
"@EXT@": {
"gold_contexts": [
[
"27be8a173136e48a15f637278fd831-C001-66"
],
[
"27be8a173136e48a15f637278fd831-C001-86"
],
[
"27be8a173136e48a15f637278fd831-C001-98",
"27be8a173136e48a15f637278fd831-C001-99"
]
],
"cite_sentences": [
"27be8a173136e48a15f637278fd831-C001-66",
"27be8a173136e48a15f637278fd831-C001-86",
"27be8a173136e48a15f637278fd831-C001-98"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"27be8a173136e48a15f637278fd831-C001-101"
]
],
"cite_sentences": [
"27be8a173136e48a15f637278fd831-C001-101"
]
}
}
},
"ABC_d2ce392240108203377d8e51e89d09_2": {
"x": [
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-2",
"text": "This paper presents the Virginia Tech system that participated in the CoNLL-2016 shared task on shallow discourse parsing."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-3",
"text": "We describe our end-to-end discourse parser that builds on the methods shown to be successful in previous work."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-4",
"text": "The system consists of several components, such that each module performs a specific subtask, and the components are organized in a pipeline fashion."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-5",
"text": "We also present our efforts to improve several componentsexplicit sense classification and argument boundary identification for explicit and implicit arguments -and present evaluation results."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-6",
"text": "In the closed evaluation, our system obtained an F1 score of 20.27% on the blind test."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-9",
"text": "The CoNLL-2016 shared task on shallow discourse parsing is an extension of last year's competition where participants built end-to-end discourse parsers."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-10",
"text": "In this paper, we present the Virginia Tech system that participated in the CoNLL-2016 shared task."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-11",
"text": "Our system is based on the methods and approaches introduced in earlier work that focused on developing individual components of an end-to-end shallow discourse parsing system, as well as the overall architecture ideas that were introduced and proved to be successful in the competition last year."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-12",
"text": "Our discourse parser consists of multiple components that are organized using a pipeline architecture."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-13",
"text": "We also present novel features -for the explicit sense classifier and argument extractorsthat show improvement over the respective components of state-of-the-art systems submitted last year."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-14",
"text": "In the closed evaluation track, our system achieved an F1 score of 20.27% on the official blind test set."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-15",
"text": "The remainder of the paper is organized as follows."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-16",
"text": "Section 2 describes the shared task."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-17",
"text": "In Section 3, we present our system architecture."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-18",
"text": "In Section 4, each component is described in detail."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-19",
"text": "The official evaluation results are presented in Section 5."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-20",
"text": "Section 6 concludes."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-135",
"text": "This component is trained with the Na\u00efve Bayes algorithm."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-21",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-22",
"text": "**TASK DESCRIPTION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-23",
"text": "The CoNLL-2016 shared task (Xue et al., 2016) focuses on shallow discourse parsing and is a second edition of the task."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-24",
"text": "The task is to identify discourse relations that are present in natural language text."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-25",
"text": "A discourse relation can be expressed explicitly or implicitly."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-26",
"text": "Explicit discourse relations are those that contain an overt discourse connective in text, e.g. because, but, and."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-27",
"text": "Implicit discourse relations, in contrast, are not expressed via an overt discourse connective."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-28",
"text": "Each discourse relation is also associated with two argumentsArgument 1 (Arg1) and Argument 2 (Arg2) -that can be realized as clauses, sentences, or phrases; each relation is labeled with a sense."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-29",
"text": "The overall task consists of identifying all components of a discourse relation -explicit connective (for an explicit relation), arguments with exact boundaries, as well as the sense of a relation."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-30",
"text": "In addition to explicit and implicit relations that are related by an overt or a non-overt discourse connective, two other relation types (AltLex and EntRel) are marked and need to be identified."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-31",
"text": "The arguments of these two relation types always correspond to entire sentences."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-32",
"text": "Examples below illustrate an explicit relation (1), an implicit relation (2); AltLex (3) and EntRel (4)."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-33",
"text": "The connective is underlined; Arg1 is italicized, and Arg2 is in bold in each example."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-34",
"text": "The relation sense is shown in parentheses."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-35",
"text": "2. In China, a great number of workers are engaged in pulling out the male organs of rice plants using tweezers, and one-third of rice produced in that country is grown from hybrid seeds."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-36",
"text": "Implicit=on the other hand At Plant Genetic Systems, researchers have isolated a pollen-inhibiting gene that can be inserted in a plant to confer male sterility."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-37",
"text": "3. On a commercial scale, the sterilization of the pollen-producing male part has only been achieved in corn and sorghum feed grains."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-38",
"text": "That's because the male part, the tassel, and the female, the ear, are some distance apart on the corn plant."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-39",
"text": "4. In a labor-intensive process, the seed companies cut off the tassels of each plant, making it male sterile."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-40",
"text": "They sow a row of malefertile plants nearby, which then pollinate the male-sterile plants."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-41",
"text": "EntRel The first hybrid corn seeds produced using this mechanical approach were introduced in the 1930s and they yielded as much as 20% more corn than naturally pollinated plants."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-42",
"text": "[wsj 0209] The training and the development data for the shared task was adapted from the Penn Discourse Treebank 2.0 (PDTB-2.0) (Prasad et al., 2008) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-43",
"text": "Our system was trained on the training partition and tuned using the development data."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-44",
"text": "Results in the paper are reported for the development and the test sets from PDTB, as well as for the blind test."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-46",
"text": "**SYSTEM DESCRIPTION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-47",
"text": "The system consists of multiple modules that are applied in a pipeline fashion."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-48",
"text": "This architecture is a standard approach that was originally proposed in Lin et al. (2014) and was followed with slight variations by systems in the last year competition (Xue et al., 2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-49",
"text": "Our design most closely resembles the pipeline proposed by the top system last year (Wang and Lan, 2015) , in that argument extraction for explicit relations is performed separately for Arg1 and Arg2, the non-explicit sense classifier is run twice."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-50",
"text": "The overall architecture of the system is shown in Figure 1 ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-51",
"text": "Given the input text, the connective classifier identifies explicit discourse connectives."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-52",
"text": "Next, the position classifier is invoked that determines for each explicit relation whether Arg1 is located in the same sentence as Arg2 (SS) or in a previous sentence (PS)."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-53",
"text": "The following three modules -SS Arg1/Arg2 Extractor, PS Arg1 Extractor, and PS Arg2 Extractor -extract text spans of the respective arguments."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-54",
"text": "Finally, the explicit sense classifier is applied."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-55",
"text": "Next, candidate sentence pairs for non-explicit relations are identified."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-56",
"text": "The non-explicit sense classifier is applied to these sentence pairs."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-57",
"text": "At this stage, it is run with the goal of separating EntRel relations from implicit relations, as EntRel relations have arguments corresponding to entire sentences, while the latter also require argument boundary identification."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-58",
"text": "Two argument extractors are then used to determine the argument boundaries boundaries of implicit relations."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-59",
"text": "After the argument boundaries of the implicit relations are identified, the non-explicit sense classifier is run again (the assumption is that with better boundary identification sense prediction can be improved)."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-61",
"text": "**SYSTEM COMPONENTS**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-62",
"text": "This section describes each component of the pipeline and introduces novel features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-64",
"text": "**IDENTIFYING EXPLICIT CONNECTIVES**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-65",
"text": "The purpose of the explicit connective classifier is to identify discourse connectives in text."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-66",
"text": "This is a binary classifier that, given a connective word or phrase (e.g. but or if . . . then) determines whether the connective functions as a discourse connective in the specific context."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-67",
"text": "We use the training data to generate a list of 145 connective words and phrases that may function as discourse connectives."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-68",
"text": "Only consecutive connectives that contain up to three tokens are addressed."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-69",
"text": "The features are based on previous work (Pitler et al., 2009; Lin et al., 2014; Wang and Lan, 2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-70",
"text": "Our classifier is a Maximum Entropy classifier implemented with the NLTK toolkit (Bird, 2006) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-72",
"text": "**IDENTIFYING ARG1 POSITION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-73",
"text": "For explicit relations, position of Arg2 is fixed to be the sentence where the connective itself occurs."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-74",
"text": "Arg1, on the other hand, can be located in the same sentence as the connective or in a previous sentence."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-75",
"text": "Given a connective and the sentence in which it occurs, the goal of the position classifier is to determine the location of Arg1."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-76",
"text": "This is a binary classifier with two classes: SS and PS."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-77",
"text": "We employ the features proposed in Lin et al. (2014) and additional features described in last year's top system (Wang and Lan, 2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-78",
"text": "The position classifier is trained using the Maximum Entropy algorithm and achieves an F1 score of 99.186% on the development data."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-79",
"text": "In line with prior work (Wang and Lan, 2015) , we consider PS to be the sentence that immediately precedes the connective."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-80",
"text": "About 10% of explicit discourse relations have Arg1 occurring in a sentence that does not immediately precede the connective."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-81",
"text": "These are missed at this point."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-83",
"text": "**EXPLICIT RELATIONS: ARGUMENT EXTRACTION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-84",
"text": "SS Argument Extractor: SS argument extractor identifies spans of Arg1 and Arg2 of explicit relations where Arg1 occurs in the same sentence, as the connective and Arg2."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-85",
"text": "We follow the constituent-based approach proposed in Kong et al. (2014) , without the joint inference and enhance it using features in Wang and Lan (2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-86",
"text": "This component is also trained with the Maximum Entropy algorithm."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-87",
"text": "PS Arg1 Extractor: We implement features described in Wang and Lan (2015) and add novel features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-88",
"text": "To identify candidate constituents, we follow Kong et al. (2014) , where constituents are defined loosely based on punctuation occurring in the sentence and clause boundaries as defined by SBAR tags."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-89",
"text": "We used the constituent split implemented in Wang and Lan (2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-90",
"text": "Based on earlier work (Wang and Lan, 2015; Lin et al., 2014) , we implement the following features: surface form of the verbs in the sentence (three features), last word of the current constituent (curr), last word of the previous constituent (prev), the first word of curr, and the lowercased form of the connective."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-91",
"text": "The novel features that we add are shown in Table 1."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-92",
"text": "These features use POS information of tokens in the constituents, punctuation between the constituents, and feature conjunctions."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-93",
"text": "PS Arg2 Extractor: Similar to PS Arg1 extractor, for this component we implement features described in Wang and Lan (2015) and add novel features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-94",
"text": "The novel features are the same as those introduced for PS Arg1 but also include the following additional features:"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-95",
"text": "\u2022 nextFirstW&puncBefore -the first word token of next and the punctuation before next."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-96",
"text": "\u2022 prevLastW&puncAfter -the last word token of prev and the punctuation after prev."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-97",
"text": "\u2022 POS of the connective string."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-98",
"text": "\u2022 The distance between the connective and the position of curr in the sentence."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-99",
"text": "The argument extractors are trained with the Averaged Perceptron algorithm, implemented within Learning Based Java (LBJ) (Rizzolo and Roth, 2010) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-101",
"text": "**EXPLICIT SENSE CLASSIFIER**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-102",
"text": "The goal of the explicit sense classifier is to determine what sense (e.g. Comparison."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-103",
"text": "Contrast, Expansion."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-104",
"text": "Conjunction, etc.) an explicit relation conveys."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-105",
"text": "A 3-level sense hierarchy has been defined in PDTB, which has four top-level senses: Comparison, Contingency, Expansion, and Temporal."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-106",
"text": "We use lexical and syntactic features based on previous work and also introduce new features:"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-107",
"text": "\u2022 C (Connective) string, C POS, prev + C, proposed in Lin et al. (2014) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-108",
"text": "\u2022 C self-category, parent-category of C, leftsibling-category of C, right-sibling-category of C, 4 C-Syn interactions, and 6 Syn-Syn interactions, introduced in Pitler et al. (2009) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-109",
"text": "\u2022 C parent-category linked context, previous connective and its POS of \"as\"(the connective and its POS of previous relation, if the connective of current relation is \"as\"), previous connective and its POS of \"when\", adopted from Wang and Lan (2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-110",
"text": "\u2022 Our new features: first token of C, second token of C (if exists), next word (next), C + next, prev + next, prev + C + next."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-111",
"text": "Table 1 : Novel features used in the PS Arg1 and PS Arg2 extractors."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-112",
"text": "Curr, prev, and next refer to the current, previous, and next constituent in the same sentence, respectively."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-113",
"text": "W denotes word token, and POS denotes the part-of-speech tag of a word."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-114",
"text": "For example, currFirstWAndCurrSecondW refers to the first two word tokens in curr, while prevLastPOS refers to the POS of the last token of prev, and nextFirstPOS refers to the POS of the first token of next."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-115",
"text": "For this task, we trained two classifiers -using Maximum Entropy and Averaged Perceptron algorithms -and chose Averaged Perceptron, as its performance was found to be superior."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-117",
"text": "**IDENTIFYING NON-EXPLICIT RELATIONS**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-118",
"text": "The first step in identifying non-explicit relations is the generation of sentence pairs that are candidate arguments for a non-explicit relation."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-119",
"text": "Following Wang and Lan (2015), we extract sentence pairs that satisfy the following three criteria:"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-120",
"text": "\u2022 Sentences are adjacent \u2022 Sentences occur within the same paragraph \u2022 Neither sentence participates in an explicit relation For all pairs of sentences that meet those criteria, we take the first sentence to be the location of Arg1, and the second sentence -the location of Arg2."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-121",
"text": "This approach is quite noisy since about 24% of all consecutive sentence pairs in the training data do not participate in a discourse relation."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-122",
"text": "We leave this for future work."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-124",
"text": "**NON-EXPLICIT SENSE CLASSIFIER**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-125",
"text": "Following previous work on non-explicit sense classification (Lin et al., 2009; Pitler et al., 2009; Rutherford and Xue, 2014) , we define four sets of binary feature groups: Brown clustering pairs, Brown clustering arguments, first-last words, and production rules."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-126",
"text": "Dependency rules and polarity features were also extracted, but did not improve the results and were removed from the final model."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-127",
"text": "A cutoff of 5 was used to prune all of the features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-128",
"text": "Additionally, Mutual Information (MI) was used to determine the most important features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-129",
"text": "The MI calculation took the 50 most important rules in each feature group, for each of the sixteen level 1 and level 2 hierarchies and EntRel."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-130",
"text": "This provided a total of 4 groups of 800 rules."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-131",
"text": "Recall that the non-explicit sense classifier has two passes."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-132",
"text": "On the first iteration, its primary goal is to separate EntRel from implicit relations."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-133",
"text": "On the second iteration, which is performed after the argument boundaries of implicit relations are identified, the sense classifier is run again on implicit relations with the predicted argument boundaries."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-134",
"text": "Note that the classifier in both cases is trained in the same way, as a multiclass classifier, even though the first time it is run with the purpose of distinguishing between AltLex relations and all other (implicit) relations."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-137",
"text": "**IMPLICIT RELATIONS: ARGUMENT EXTRACTION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-138",
"text": "The argument extractors for implicit relations are implemented in a way similar to explicit relation argument extraction."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-139",
"text": "Candidate sentences are split into constituents based on punctuation symbols and clause boundaries using the SBAR tag."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-140",
"text": "We use features in Lin et al. (2009) and Wang and Lan (2015) and augment these with novel features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-141",
"text": "Implicit Arg1 Extractor: The Implicit Arg1 extractor employs a rich set of features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-142",
"text": "Most of these are similar to those presented for PS Arg1 and PS Arg2 extractors in that we take into account POS information, punctuation symbols that occur on the boundaries of the constituents, as well as dependency relations in the constituent itself."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-143",
"text": "One key distinction of how we define the depen-dency relation features is that, in contrast to prior work that treats each dependency relation as a separate binary feature, we only consider the first two relations (r1 and r2, respectively) in curr, prev, and next, and take their conjunctions."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-144",
"text": "Our intuition is that the relations in the beginning of a constituent are most important, while the other relations are not that relevant."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-145",
"text": "This approach to feature generation also avoids sparseness, which was found to be a problem in earlier work."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-146",
"text": "Overall, we generate seven features that use dependency relations."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-147",
"text": "Implicit Arg2 Extractor: We use most of the features in Lin et al. (2014) and Wang and Lan (2015) to train the Arg2 extractor (for more details and explanation about the features, we refer the reader to the respective papers):"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-148",
"text": "\u2022 Lowercased and lemmatized verbs in curr"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-149",
"text": "\u2022 The first and last terms of curr"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-150",
"text": "\u2022 The last term of prev"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-151",
"text": "\u2022 The first term of next"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-152",
"text": "\u2022 The last term of prev + the first term of curr"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-153",
"text": "\u2022 The last term of curr + the first term of next"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-154",
"text": "\u2022 The position of curr in the sentence: start, middle, end, or whole sentence"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-155",
"text": "\u2022 Product of the curr and next production rules"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-157",
"text": "**EVALUATION AND RESULTS**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-158",
"text": "Evaluation in the shared task is conducted using a new web service called TIRA (Potthast et al., 2014) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-159",
"text": "We first evaluate the contribution of new features in individual components in 5.1."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-160",
"text": "In 5.2, we report performance of all components of the final system on the development set using gold."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-161",
"text": "Finally, in 5.3, we show official results on the development, test, and blind test sets."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-162",
"text": "Since the system is implemented as a pipeline, each component contributes errors."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-163",
"text": "We refer to the results as no error propagation (EP) when gold predictions are used, or with EP when automatic predictions generated from previous steps are employed."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-164",
"text": "The components of our final system are trained as follows: connective, position classifier, SS Arg1/Arg2 extractor and implicit Arg2 extractor (Maximum Entropy); explicit sense, PS Arg1, PS Arg2 extractors, Implicit Arg1 extractor (Averaged Perceptron); non-explicit sense (Na\u00efve Bayes)."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-165",
"text": "The choice of the learning algorithms was primarily motivated by prior work."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-166",
"text": "Additional experiments on argument extractors and explicit Wang and Lan (2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-167",
"text": "The new set of features is presented in Section 3."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-168",
"text": "Evaluation using gold connectives and argument boundaries (no EP)."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-169",
"text": "Table 3 : PS Arg1 extractor, no EP."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-170",
"text": "Baseline denotes taking the entire sentence as argument span."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-172",
"text": "**MODEL**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-173",
"text": "Base features refer to features used in Wang and Lan (2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-174",
"text": "sense classification indicated that Averaged Perceptron should be preferred for these sub-tasks."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-175",
"text": "Due to time constraints, we did not compare all three algorithms on all sub-tasks."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-177",
"text": "**IMPROVING INDIVIDUAL COMPONENTS**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-178",
"text": "We first evaluate the components for which we introduce new features."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-179",
"text": "We use gold annotations for evaluating the individual components below."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-180",
"text": "Explicit Sense Classifier: Table 2 evaluates the explicit sense classifier."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-181",
"text": "We compare our baseline model that implements the features proposed in Wang and Lan (2015) with the model that employs additional features introduced in 4.4."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-182",
"text": "Our baseline model performs slightly better than the one reported in Wang and Lan (2015) : we obtain 90.55 vs. 90.14, as reported in Wang and Lan (2015) ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-183",
"text": "Table 6 : Evaluation of each component on the development set (no EP)."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-184",
"text": "duced."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-185",
"text": "We implement the features in Wang and Lan (2015) and add our novel features shown in Table 1 ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-186",
"text": "Results for PS Arg1 extractor are shown in Table 3 ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-187",
"text": "The baseline refers to taking the entire sentence as argument span."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-188",
"text": "Overall, we obtain a 5 point improvement over the baseline method."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-189",
"text": "Similarly, Table 4 shows results for PS Arg2 extractor."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-190",
"text": "For PS Arg2 extractor, the classifiers are able to obtain a larger improvement compared to the baseline method."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-191",
"text": "Adding new features improves the results by three points."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-192",
"text": "We note that in Wang and Lan (2015) the numbers that correspond to the entire sentence baselines are not the same as those that we obtain, so we do not report a direct comparison with their models."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-193",
"text": "However, our base models implement the features they use."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-194",
"text": "Implicit Arg1 Extractor: In Table 5 , we evaluate the Implicit Arg1 extractor."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-195",
"text": "It achieves an improvement of 12 F1 points over the baseline method that considers the entire sentence to be the argument span."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-197",
"text": "**RESULTS ON THE DEVELOPMENT SET (NO EP)**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-198",
"text": "Performance of each component on the development set, as implemented in the submitted system, without EP, is shown in Table 6 ."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-199",
"text": "Table 8 : Official results on the test set."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-200",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-201",
"text": "**OFFICIAL EVALUATION RESULTS**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-202",
"text": "The overall system results on the three data sets -development, test, and blind test -are shown in Tables 7, 8 , and 9, respectively."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-203",
"text": "Table 9 : Official results on the blind test set."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-204",
"text": "----------------------------------"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-205",
"text": "**CONCLUSION**"
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-206",
"text": "This paper introduces an end-to-end discourse parser for English developed for the CoNLL-2016 shared task."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-207",
"text": "The entire system includes multiple components, which are organized in a pipeline fashion."
},
{
"sent_id": "d2ce392240108203377d8e51e89d09-C001-208",
"text": "We also present novel features and improve performance of several system components by incorporating these new features."
}
],
"y": {
"@SIM@": {
"gold_contexts": [
[
"d2ce392240108203377d8e51e89d09-C001-49"
]
],
"cite_sentences": [
"d2ce392240108203377d8e51e89d09-C001-49"
]
},
"@USE@": {
"gold_contexts": [
[
"d2ce392240108203377d8e51e89d09-C001-69"
],
[
"d2ce392240108203377d8e51e89d09-C001-77",
"d2ce392240108203377d8e51e89d09-C001-78",
"d2ce392240108203377d8e51e89d09-C001-79"
],
[
"d2ce392240108203377d8e51e89d09-C001-85"
],
[
"d2ce392240108203377d8e51e89d09-C001-87"
],
[
"d2ce392240108203377d8e51e89d09-C001-89",
"d2ce392240108203377d8e51e89d09-C001-90"
],
[
"d2ce392240108203377d8e51e89d09-C001-93"
],
[
"d2ce392240108203377d8e51e89d09-C001-109"
],
[
"d2ce392240108203377d8e51e89d09-C001-119"
],
[
"d2ce392240108203377d8e51e89d09-C001-140"
],
[
"d2ce392240108203377d8e51e89d09-C001-147"
],
[
"d2ce392240108203377d8e51e89d09-C001-173"
],
[
"d2ce392240108203377d8e51e89d09-C001-181"
],
[
"d2ce392240108203377d8e51e89d09-C001-185"
]
],
"cite_sentences": [
"d2ce392240108203377d8e51e89d09-C001-69",
"d2ce392240108203377d8e51e89d09-C001-77",
"d2ce392240108203377d8e51e89d09-C001-79",
"d2ce392240108203377d8e51e89d09-C001-85",
"d2ce392240108203377d8e51e89d09-C001-87",
"d2ce392240108203377d8e51e89d09-C001-89",
"d2ce392240108203377d8e51e89d09-C001-90",
"d2ce392240108203377d8e51e89d09-C001-93",
"d2ce392240108203377d8e51e89d09-C001-109",
"d2ce392240108203377d8e51e89d09-C001-119",
"d2ce392240108203377d8e51e89d09-C001-140",
"d2ce392240108203377d8e51e89d09-C001-147",
"d2ce392240108203377d8e51e89d09-C001-173",
"d2ce392240108203377d8e51e89d09-C001-181",
"d2ce392240108203377d8e51e89d09-C001-185"
]
},
"@DIF@": {
"gold_contexts": [
[
"d2ce392240108203377d8e51e89d09-C001-182"
],
[
"d2ce392240108203377d8e51e89d09-C001-192"
]
],
"cite_sentences": [
"d2ce392240108203377d8e51e89d09-C001-182",
"d2ce392240108203377d8e51e89d09-C001-192"
]
}
}
},
"ABC_06917a1dd02d55c827e7e07eeae2da_2": {
"x": [
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-24",
"text": "In the former approach argument span extraction is applied right after discourse connective detection, while the latter approach also requires argument position classification."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-25",
"text": "The decision on argument span can be made on different levels: from token-level to sentence-level."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-26",
"text": "In (Ghosh et al., 2011 ) the decision is made on tokenlevel, and the problem is cast as sequence labeling using conditional random fields (CRFs) (Lafferty et al., 2001) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-27",
"text": "In this paper we focus on argument span extraction, and extend the token-level sequence labeling approach of (Ghosh et al., 2011) with the separate models for arguments of intra-sentential and intersentential explicit discourse relations."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-28",
"text": "To compare to the other approaches (i.e. (Lin et al., 2012) and (Xu et al., 2012) ) we adopt the immediately previous sentence heuristic to select a candidate Arg1 sentence for the inter-sentential relations."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-29",
"text": "Additionally to the heuristic, we train and test CRF argument span extraction models to extract exact argument spans."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-30",
"text": "The paper is structured as follows."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-31",
"text": "In Section 2 we briefly present the corpus that was used in the experiments -Penn Discourse Treebank."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-32",
"text": "Section 3 describes related works."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-33",
"text": "Section 4 defines the problem and assesses its complexity."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-2",
"text": "Discourse relation parsing is an important task with the goal of understanding text beyond the sentence boundaries."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-3",
"text": "One of the subtasks of discourse parsing is the extraction of argument spans of discourse relations."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-4",
"text": "A relation can be either intra-sentential -to have both arguments in the same sentence -or inter-sentential -to have arguments span over different sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-5",
"text": "There are two approaches to the task."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-6",
"text": "In the first approach the parser decision is not conditioned on whether the relation is intra-or intersentential."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-7",
"text": "In the second approach relations are parsed separately for each class."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-8",
"text": "The paper evaluates the two approaches to argument span extraction on Penn Discourse Treebank explicit relations; and the problem is cast as token-level sequence labeling."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-9",
"text": "We show that processing intra-and inter-sentential relations separately, reduces the task complexity and significantly outperforms the single model approach."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-12",
"text": "Discourse analysis is one of the most challenging tasks in Natural Language Processing, that has applications in many language technology areas such as opinion mining, summarization, information extraction, etc."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-13",
"text": "(see (Webber et al., 2011) and (Taboada and Mann, 2006) for detailed review)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-14",
"text": "With the availability of annotated corpora, such as Penn Discourse Treebank (PDTB) (Prasad et al., 2008) , statistical discourse parsers were developed (Lin et al., 2012; Ghosh et al., 2011; Xu et al., 2012) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-15",
"text": "PDTB adopts non-hierarchical binary view on discourse relations: Argument 1 (Arg1) and Argument 2 (Arg2), which is syntactically attached to a discourse connective."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-16",
"text": "Thus, PDTB-based discourse parsing can be roughly partitioned into discourse relation detection, argument position classification, argument span extraction, and relation sense classification."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-17",
"text": "For discourse relations signaled by a connective (explicit relations), discourse relation detection is cast as classification of connectives as discourse and non-discourse."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-18",
"text": "Argument position classification involves detection of the location of Arg1 with respect to Arg2: usually either the same sentence (SS) or previous ones (PS)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-19",
"text": "1 Argument span extraction, on the other hand, is extraction (labeling) of text segments that belong to each of the arguments."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-20",
"text": "Finally, relation sense classification is the annotation of relations with the senses from PDTB."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-21",
"text": "Since arguments of explicit discourse relations can appear in the same sentence or in different ones (i.e. relations can be intra-or inter-sentential); there are two approaches to argument span extraction."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-22",
"text": "In the first approach the parser decision is not conditioned on whether the relation is intra-or inter-sentential (e.g. (Ghosh et al., 2011) )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-23",
"text": "In the second approach relations are parsed separately for each class (e.g. (Lin et al., 2012; Xu et al., 2012) )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-34",
"text": "In Section 5 we describe argument span extraction cast as the token-level sequence labeling; and in Section 6 we present the evaluation of the two approaches -either single or separate processing of intra-and inter-sentential relations."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-35",
"text": "Section 7 provides concluding remarks."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-37",
"text": "**THE PENN DISCOURSE TREEBANK**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-38",
"text": "The Penn Discourse Treebank (PDTB) (Prasad et al., 2008 ) is a corpus that contains discourse relation annotation on top of WSJ corpus; and it is aligned with Penn Treebank (PTB) syntactic tree annotation."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-39",
"text": "Discourse relations in PDTB are binary: Arg1 and Arg2, where Arg2 is an argument syntactically attached to a discourse connective."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-40",
"text": "With respect to Arg2, Arg1 can appear in the same sentence (SS case), one of the preceding (PS case) or following (FS case) sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-41",
"text": "A discourse connective is a member of a well defined list of 100 connectives and a relation expressed via such connective is an Explicit relation."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-42",
"text": "There are other types of discourse and non-discourse relations annotated in PDTB; however, they are out of the scope of this paper."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-43",
"text": "Discourse relations are annotated using 3-level hierarchy of senses."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-44",
"text": "The top level (level 1) senses are the most general: Comparison, Contingency, Expansion, and Temporal (Prasad et al., 2008) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-46",
"text": "**RELATED WORK**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-47",
"text": "Pitler and Nenkova (2009) applied machine learning methods using lexical and syntactic features and achieved high classification performance on discourse connective detection task (F 1 : 94.19%, 10 fold crossvalidation on PDTB sections 02-22)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-48",
"text": "Later, Lin et al. (2012) achieved an improvement with additional lexico-syntactic and path features (F 1 : 95.76%)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-49",
"text": "After a discourse connective is identified as such, it is classified into relation senses annotated in PDTB."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-50",
"text": "Pitler and Nenkova (2009) classify discourse connectives into 4 top level senses -Comparison, Contingency, Expansion, and Temporal -and achieve accuracy of 94.15%, which is slightly above the interannotator agreement."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-51",
"text": "In this paper we focus on the parsing steps after discourse connective detection; thus, we use gold reference connectives and their senses as features."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-52",
"text": "The approaches used for the argument position classification even though useful, are incomplete as they do not make decision on argument spans."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-53",
"text": "(Wellner and Pustejovsky, 2007) and (Elwell and Baldridge, 2008) , following them, used machine learning methods to identify head words of the arguments of explicit relations expressed by discourse connectives."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-54",
"text": "(Prasad et al., 2010) , on the other hand, addressed a more difficult task of identification of sentences that contain Arg1 for cases when arguments are located in different sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-55",
"text": "Dinesh et al. (2005) and Lin et al. (2012) approach the problem of argument span extraction on syntactic tree node-level."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-56",
"text": "In the former, it is a rule based system that covers limited set of connectives; whereas in the latter it is a machine learning approach with full PDTB coverage."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-57",
"text": "Both apply syntactic tree subtraction to get argument spans."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-58",
"text": "Xu et al. (2012) approach the problem on a constituent-level: authors first decide whether a constituent is a valid argument and then whether it is Arg1, Arg2, or neither."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-59",
"text": "Ghosh et al. (2011 ) (and further (Ghosh et al., 2012a Ghosh et al., 2012b) ), on the other hand, cast the problem as tokenlevel sequence labeling."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-60",
"text": "In this paper we follows the approach of (Ghosh et al., 2011 (Prasad et al., 2008) ); and distribution of Arg2 with respect to extent in inter-sentential explicit discourse relations."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-61",
"text": "SS = same sentence as the connective; IPS = immediately previous sentence; NAPS = non-adjacent previous sentence; FS = some sentence following the sentence containing the connective; SingFull = Single Full sentence; SingPart = Part of single sentence; MultFull = Multiple full sentences; MultPart = Parts of multiple sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-63",
"text": "**IMMEDIATELY PREVIOUS SENTENCE HEURISTIC**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-64",
"text": "According to Prasad et al. (2008) 's analysis of explicit discourse relations annotated in PDTB, out of 18,459 relations, 11,236 (60.9%) have both of the arguments in the same sentence (SS case), 7,215 (39.1%) have Arg1 in the sentences preceding the Arg2 (PS case), and only 8 instances have Arg1 in the sentences following Arg2 (FS case)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-65",
"text": "Since FS case has too few instances it is usually ignored."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-66",
"text": "For the PS case, the Arg1 is located either in Immediately Previous Sentences (IPS: 30.1%) or in some Non-Adjacent Previous Sentences (NAPS: 9.0%)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-67",
"text": "CRF-based discourse parser of Ghosh et al. (2011) , which processes SS and PS cases with the same model, uses \u00b12 sentence window as a hypothesis space (5 sentences: 1 sentence containing the connective, 2 preceding and 2 following sentences)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-68",
"text": "The window size is motivated by the observation that it entirely covers arguments of 94% of all explicit relations."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-69",
"text": "The authors also report that the performance of the parser on inter-sentential relations (i.e. mainly PS case) has F-measure of 36.0."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-70",
"text": "However, since in 44.2% of inter-sentential explicit discourse relations Arg1 fully covers the sentence immediately preceding Arg2 (see Table 1 partially copied from (Prasad et al., 2008) ), the heuristic that selects the immediately previous sentence and tags all of its tokens as Arg1 already yields F-measure of 44.2 over all PDTB (the performance on the test set may vary)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-71",
"text": "The same heuristic is mentioned in (Lin et al., 2012 ) and (Xu et al., 2012) as a majority classifier for the relations with Arg1 in previous sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-72",
"text": "Compared to the \u00b12 window, the heuristic covers Arg1 of only 88.4% explicit discourse relations (60.9% SS + 27.5% PS); since it ignores all the relations with Arg1 in Non-Adjacent Previous Sentences (NAPS) (9.0% of all explicit relations), and does not accommodate Arg1 spanning multiple immediately preceding sentences (2.6% of all explicit relations)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-73",
"text": "Nevertheless, 70.2% of all PS explicit relations have Arg1 entirely inside the immediately previous sentence."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-74",
"text": "Thus, the integration of the heuristic is expected to improve the argument span extraction performance for inter-sentential Arg1."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-75",
"text": "In 98.5% of all PS cases Arg2 is within the sentence containing the connective (remaining 1.5% are multisentence Arg2); and in 71.7% of all PS cases it fully covers the sentence containing the discourse connective (see Table 1 )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-76",
"text": "Thus, similar heuristic for Arg2 is to tag all the tokens of the sentence except the connective as Arg2."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-77",
"text": "For the heuristics to be applicable, a discourse connective has to be classified as requiring its Arg1 in the same sentence (SS) or the previous ones (PS), i.e. it requires argument position classification."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-79",
"text": "**ARGUMENT POSITION CLASSIFICATION**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-80",
"text": "Explicit discourse connectives, annotated in PDTB, belong to one of the three syntactic categories: (1) subordinating conjunctions (e.g. when), (2) coordinating conjunctions (e.g. and), and (3) discourse adverbials (e.g. for example)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-81",
"text": "With few exceptions, a discourse connective belongs to a single syntactic category (see Appendix A in (Knott, 1996) Table 2 : Distribution of discourse connectives in PDTB with respect to syntactic category (rows) and position in the sentence (columns) and the location of Arg1 as in the same sentence (SS) as the connective or the previous sentences (PS)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-82",
"text": "The case when Arg1 appears in some following sentence (FS) is ignored, since it has only 8 instances."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-83",
"text": "sition of Arg1, depending on whether the connective appears sentence-initially or sentence-medially."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-84",
"text": "Here, a connective is considered sentence-initial if it appears as the first sequence of words in a sentence."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-85",
"text": "Table 2 presents the distribution of discourse connectives in PDTB with respect to the syntactic categories, their position in the sentence, and having Arg1 in the same or previous sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-86",
"text": "The distribution of sentencemedial discourse adverbials, which is the most ambiguous class, between SS and PS cases is 17.5% to 82.5%; for all other classes it higher than 90% to 10%."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-87",
"text": "Thus, the overall accuracy of the SS vs. PS majority classification using just syntactic category and position information is already 95.0%."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-88",
"text": "When analyzed on per connective basis, the observation is that some connectives require Arg1 in the same or previous sentence irrespective of their position in the sentence."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-89",
"text": "For instance, sentence-initial subordinating conjunction so always has its Arg1 in the previous sentence; and the parallel sentence-initial subordinating conjunction if..then in the same sentence."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-90",
"text": "Others, such as sentence-medial adverbials however and meanwhile mainly require their Arg1 in the previous sentence."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-91",
"text": "Even though low, there is still an ambiguity: e.g. for sentence-medial adverbials also, therefore, still, instead, in fact, etc."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-92",
"text": "Arg1 appears in SS and PS cases evenly."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-93",
"text": "Consequently, assigning the position of the Arg1 considering the discourse connective, together with its syntactic category and its position in the sentence, for PDTB will be correct in more than 95% of instances."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-94",
"text": "In the literature, the task of argument position classification was addressed by several researchers (e.g. (Prasad et al., 2010) , (Lin et al., 2012) )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-95",
"text": "Lin et al. (2012) , for instance, report F 1 of 97.94% for a classifier trained on PDTB sections 02-21, and tested on section 23."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-96",
"text": "The task has a very high baseline and even higher performance on supervised machine learning, Table 3 : Feature sets for Arg2 and Arg1 argument span extraction in (Ghosh et al., 2011) which is an additional motivation to process intra-and inter-sentential relations separately."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-97",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-98",
"text": "**PARSING MODELS**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-99",
"text": "We replicate and evaluate the discourse parser of (Ghosh et al., 2011) , then modify it to process intraand inter-sentential explicit relations separately."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-100",
"text": "This is achieved by integrating Argument Position Classification and Immediately Previous Sentence heuristic into the parsing pipe-line."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-101",
"text": "Since the features used to train argument span extraction models for both approaches are the same, we first describe them in Subsection 5.1."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-102",
"text": "Then we proceed with the description of the single model discourse parser (our baseline) and separate models discourse parser, Subsections 5.2 and 5.3, respectively."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-104",
"text": "**FEATURES**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-105",
"text": "The features used to train the models for Arg1 and Arg2 are given in Table 3 ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-106",
"text": "Besides the token itself (TOK), the rest of the features is described below."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-107",
"text": "Lemma (LEM) and inflectional affixes (INFL) are extracted using morpha tool (Minnen et al., 2001) , that requires token and its POS-tag as input."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-108",
"text": "For instance, for the word flashed the lemma and infection features are 'flash' and '+ed', respectively."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-109",
"text": "IOB-Chain (IOB) is the path string of the syntactic tree nodes from the root node to the token, prefixed with the information whether a token is at the beginning (B-) or inside (I-) the constituent."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-110",
"text": "The feature is extracted using the chunklink tool (Buchholz, 2000) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-111",
"text": "For example, the IOB-Chain 'I-S/B-VP' indicates that a token is the first word of the verb phrase (B-VP) of the main clause (I-S)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-112",
"text": "PDTB Level 1 Connective sense (CONN) is the most general sense of a connective in PDTB sense hierarchy: one of Comparison, Contingency, Expansion, or Temporal."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-113",
"text": "For instance, a discourse connective when might have the CONN feature 'Temporal' or 'Contingency' depending on the discourse relation it appears in, or 'NULL' in case of non-discourse usage."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-114",
"text": "The value of the feature is 'NULL' for all tokens except the discourse connective."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-115",
"text": "Boolean Main Verb (BMV) is a feature that indicates whether a token is a main verb of a sentence or not (Yamada and Matsumoto, 2003) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-116",
"text": "For instance in the sentence Prices collapsed when the news flashed, the main verb is collapsed; thus, its BMV feature is '1', whereas for the rest of tokens it is '0'."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-117",
"text": "Previous Sentence Feature (PREV) signals if a sentence immediately precedes the sentence starting with a connective, and its value is the first token of the connective (Ghosh et al., 2011) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-118",
"text": "For instance, if some sentence A is followed by a sentence B starting with discourse connective On the other hand, all the tokens of the sentence A have the PREV feature value 'On'."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-119",
"text": "The feature is similar to a heuristic to select the sentence immediately preceding a sentence starting with a connective as a candidate for Arg1."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-120",
"text": "Arg2 Label (ARG2) is an output of Arg2 span extraction model, and it is used as a feature for Arg1 span extraction."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-121",
"text": "Since for sequence labeling we use IOBE (Inside, Out, Begin, End) notation, the possible values of ARG2 are IOBE-tagged labels, i.e. 'ARG2-B' -if a word is the first word of Arg2, 'ARG2-I' -if a word is inside the argument span, 'ARG2-E' -if a word is in the last word of Arg2, and 'O' otherwise."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-122",
"text": "CRF++ 2 -conditional random field implementation we use -allows definition of feature templates."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-123",
"text": "Via templates these features are enriched with ngrams: tokens with 2-grams in the window of \u00b11 to- Figure 1: Single model discourse parser architecture of (Ghosh et al., 2011) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-124",
"text": "CRF argument span extraction models are in bold."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-125",
"text": "kens, and the rest of the features with 2 & 3-grams in the window of \u00b12 tokens."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-126",
"text": "For instance, labeling a token as Arg2 is an assignment of one of the four possible labels: ARG2-B, ARG2-I, ARG2-E and O (ARG2 with IOBE notation)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-127",
"text": "The feature set (token, lemma, inflection, IOBchain and connective sense (see Table 3 )) is expanded by CRF++ via template into 55 features (5 * 5 unigrams, 2 token bigrams, 4 * 4 bigrams and 4 * 3 trigrams of other features)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-129",
"text": "**SINGLE MODEL DISCOURSE PARSER**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-130",
"text": "The discourse parser of (Ghosh et al., 2011 ) is a cascade of CRF models to sequentially label Arg2 and Arg1 spans (since Arg2 label is a feature for Arg1 model) (see Figure 1 )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-131",
"text": "There is no distinction between intra-and inter-sentential relations, rather the single model jointly decides on the position and the span of an argument (either Arg1 or Arg2, not both together) in the window of \u00b12 sentences (the parser will be further abbreviated as W5P -Window 5 Parser)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-132",
"text": "The single model parser achieves F-measure of 81.7 for Arg2 and 60.3 for Arg1 using CONNL evaluation script."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-133",
"text": "The performance is higher than (Ghosh et al., 2011 ) -Arg2: F 1 of 79.1 and Arg1: F 1 of 57.3 -due to improvements in feature and instance extraction, such as the treatment of multi-word connectives."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-134",
"text": "These models are the baseline for comparison with separate models architecture."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-135",
"text": "However, we change the evaluation method (see Section 6)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-136",
"text": "Figure 2 depicts the architecture of the discourse parser processing intra-and inter-sentential relations separately."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-137",
"text": "It is a combination of argument position classification with specific CRF models for each of the arguments of SS and PS cases, i.e. there are 4 CRF models -SS Arg1 and Arg2, and PS Arg1 and Arg2 (following sentence case (FS) is ignored)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-138",
"text": "SS models are applied in a cascade and, similar to the baseline single model parser, Arg2 label is a feature for Arg1 span extraction."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-139",
"text": "These SS models are trained using exactly the same features, with the exception of PREV feature: since we consider only the sentence containing the connective, it naturally falls out."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-141",
"text": "**SEPARATE MODELS DISCOURSE PARSER**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-142",
"text": "For the PS case, we apply a heuristic to select candidate sentences."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-143",
"text": "Based on the observation that in PDTB for the PS case Arg2 span is fully located in the sentence containing the connective in 98.5% of instances; and Arg1 span is fully located in the sentence immediately preceding Arg2 in 71.7% of instances; we select sentences in these positions to train and test respective CRF models."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-144",
"text": "The feature set for Arg2 remains the same, whereas, from Arg1 feature set we remove PREV and Arg2 label (since in PS case Arg2 is in different sentence, the feature will always have the same value of 'O')."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-145",
"text": "For Argument Position Classification we train unigram BoosTexter (Schapire and Singer, 2000) model with 100 iterations 3 on PDTB sections 02-22 and test on sections 23-24; and, similar to other researchers, achieve high results: F 1 = 98.12."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-146",
"text": "The features are connective surface string, POS-tags, and IOBchains."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-147",
"text": "The results obtained using automatic features (F 1 = 97.87) are insignificantly lower (McNemar's \u03c7 2 (1, 1595) = 0.75, p = 0.05); thus, this step will not cause deterioration in performance with automatic features."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-148",
"text": "Here we used Stanford Parser (Klein and Manning, 2003) to obtain POS-tags and automatic constituency-based parse trees."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-149",
"text": "Since both argument span extraction approaches are equally affected by the discourse connective detection step, we use gold reference connectives."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-150",
"text": "As an alternative, discourse connectives can be detected with high accuracy using addDiscourse tool (Pitler and Nenkova, 2009 )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-151",
"text": "In the separate models discourse parser, the steps of the process to extract argument spans given a discourse connective are as follows: The separate model parser with CRF models will be further abbreviated as SMP; and with the heuristics for PS case as hSMP."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-153",
"text": "**EXPERIMENTS AND RESULTS**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-154",
"text": "We first describe the evaluation methodology."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-155",
"text": "Then present evaluation of PS case CRF models against the heuristic."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-156",
"text": "In subsection 6.3 we compare the performance of the single and separate model parsers on SS and PS cases of the test set separately and together."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-178",
"text": "**ERROR PROPAGATION**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-157",
"text": "Finally, we compare the results of the separate model parser to (Lin et al., 2012) and (Xu et al., 2012) ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-159",
"text": "**EVALUATION**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-160",
"text": "There are two important aspects regarding the evaluation."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-161",
"text": "First, in this paper it is different from (Ghosh et al., 2011) ; thus, we first describe it and evaluate the difference."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-162",
"text": "Second, in order to compare the baseline single and separate model parsers, the error from argument position classification has to be propagated for the latter one; and the process is described in 6.1.2."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-163",
"text": "Since both versions of the parser are affected by automatic features, the evaluation is on gold features only."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-164",
"text": "The exception is for Arg2 label; since it is generated within the segment of the pipeline we are in-terested in."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-165",
"text": "Unless stated otherwise, all the results for Arg1 are reported for automatic Arg2 labels as a feature."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-166",
"text": "Following (Ghosh et al., 2011) PDTB is split as Sections 02-22 for training, 00-01 for development, and 23-24 for testing."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-168",
"text": "**CONLL VS. STRING-BASED EVALUATION**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-169",
"text": "Ghosh et al. (2011) report using CONLL-based evaluation script."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-170",
"text": "However, it is not well suited for the evaluation of argument spans because the unit of evaluation is a chunk -a segment delimited by any outof-chunk token or a sentence boundary."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-171",
"text": "However, in PDTB arguments can (1) span over several sentences, (2) be non-contiguous in the same sentence."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-172",
"text": "Thus, CONLL-based evaluation yields incorrect number of test instances: Ghosh et al. (2011) report 1,028 SS and 617 PS test instances for PDTB sections 23-24 (see caption of Table 7 in the original paper), which is 1,645 in total; whereas there is only 1,595 explicit relations in these sections."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-173",
"text": "In this paper, the evaluation is string-based; i.e. an argument span is correct, if it matches the whole reference string."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-174",
"text": "Following (Ghosh et al., 2011) and (Lin et al., 2012) , argument initial and final punctuation marks are removed; and precision (p), recall (r) and F 1 score are computed using the equations 1 -3."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-175",
"text": "In the equations, Exact Match is the count of correctly tagged argument spans; No Match is the count of argument spans that do not match the reference string exactly (even one token difference is counted as an error); and References in Gold is the total number of arguments in the reference."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-176",
"text": "String-based evaluation of the single model discourse parser with gold features reduces F 1 for Arg2 from 81.7 to 77.8 and for Arg1 from 60.33 to 55.33."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-179",
"text": "Since the single model parser applies argument span extraction right after discourse connective detection, Table 4 : Argument span extraction performance of the heuristics (hSMP) and the CRF models (SMP) on intersentential relations (PS case)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-180",
"text": "Results are reported as precision (P), recall (R) and F-measure (F1) whereas in the separate model parser there is an additional step of argument position classification; for the two to be comparable an error from the argument position classification is propagated."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-181",
"text": "Even though, the performance of the classifier is very high (98.12%) there are still some misclassified instances."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-182",
"text": "These instances are propagated to the counts of Exact Match and No Match of the argument span extraction."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-183",
"text": "For example, if the argument position classifier misclassified an SS connective as PS; in the SS evaluation its Arg1 and Arg2 are considered as not recalled regardless of argument span extractor's decision (i.e. neither Exact Match nor No Match); and in the PS evaluation, they are both considered as No Match."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-184",
"text": "The separate model discourse parser results are reported without error propagation for in-class comparison of the heuristic and CRF models, and with error propagation for cross-class comparison with the single model parser."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-185",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-186",
"text": "**HEURISTIC VS. CRF MODELS**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-187",
"text": "The goal of this section is to assess the benefit of training CRF models for the extraction of exact argument spans of PS Arg1 and Arg2 on top of the heuristics."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-188",
"text": "The performance of the heuristics (immediately previous sentence for Arg1 and the full sentence except the connective for Arg2) and the CRF models is reported in Table 4 ."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-189",
"text": "CRF models perform significantly better for Arg2 (McNemar's \u03c7 2 (1, 620) = 7.48, p = 0.05)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-190",
"text": "Even though, they perform 2.7% better for Arg1, the difference is insignificant (McNemar's \u03c7 2 (1, 620) = 0.66, p = 0.05)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-191",
"text": "For both arguments, the CRF model results are lower than expected."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-193",
"text": "**SINGLE VS. SEPARATE MODELS**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-194",
"text": "To compare the single and the separate model parsers, the results of the former must be split into SS and PS cases."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-195",
"text": "For the latter, on the other hand, we propagate Table 6 : Performance of the single model parser (W5P) and the separate model parser with the heuristics (hSMP) and CRF models (SMP) on argument span extraction of PS relations; reported as precision (P), recall (R) and F-measure (F1)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-196",
"text": "For the separate model parsers, results include error propagation from argument position classification."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-197",
"text": "error from the argument position classification step."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-198",
"text": "For the PS case we also report the performance of the heuristic with error propagation."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-199",
"text": "Table 5 reports the results for the SS case, and Table 6 reports the results for the PS case."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-200",
"text": "In both cases the separate model parser with error propagation from argument position classification step significantly outperforms the single model parser."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-201",
"text": "The performance of the separate model parsers (reported in Table 7 ) with heuristics and CRF models on all relations (SS + PS) both are significantly better than the performance of single \u00b12 window model parser (for SMP McNemar's \u03c7 2 (1, 1595) = 17.75 for Arg2 and \u03c7 2 (1, 1595) = 19.82 for Arg1, p = 0.05)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-202",
"text": "Table 7 : Performance of the single model parser (W5P) and the separate model parser with the heuristics (hSMP) and CRF models (SMP) on argument span extraction of all relations; reported as precision (P), recall (R) and F-measure (F1)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-203",
"text": "For the separate model parsers, results include error propagation from argument position classification."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-204",
"text": "Lin et al. (2012) (Lin et al., 2012) and (Xu et al., 2012) reported as F-measure (F1)."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-205",
"text": "Trained on PDTB sections 02-21, tested on 23."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-206",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-207",
"text": "**ARG2 ARG1**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-208",
"text": "6.4 Comparison of Separate Model Parser to (Lin et al., 2012) and (Xu et al., 2012) The separate model parser allows to compare argument span extraction cast as token-level sequence labeling to the syntactic tree-node level classification approach of (Lin et al., 2012) and constituent-level classification approach of (Xu et al., 2012) ; since now the complexity and the hypothesis spaces are equal."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-209",
"text": "For this purpose we train models on sections 02-21 and test on 23."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-210",
"text": "Unfortunately, the authors do not report the results on SS and PS cases separately, but only the combined results that include the heuristic."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-211",
"text": "Moreover, the performance of the heuristic is mentioned to be 76.9% instead of 44.2% for the exact match (see IPS x SingFull cell in Table 1 or Table 1 in (Prasad et al., 2008) )."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-212",
"text": "Thus, the comparison provided here is not definite."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-213",
"text": "Since all systems have different components up the pipe-line, the only possible comparison is without error propagation."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-214",
"text": "From the results in Table 8 , we can observe that all the systems perform well on Arg2."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-215",
"text": "As expected, for the harder case of Arg1, performances are lower."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-217",
"text": "**CONCLUSION**"
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-218",
"text": "In this paper we compare two strategies for the argument span extraction: to process intra-and intersentential explicit relations by a single model, or separate ones."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-219",
"text": "We extend the approach of (Ghosh et al., 2011) to argument span extraction cast as token-level sequence labeling using CRFs and integrate argument position classification and immediately previous sentence heuristic."
},
{
"sent_id": "06917a1dd02d55c827e7e07eeae2da-C001-220",
"text": "The evaluation of parsing strategies on the PDTB explicit discourse relations shows that the models trained specifically for intra-and intersentential relations significantly outperform the single \u00b12 window models."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"06917a1dd02d55c827e7e07eeae2da-C001-14"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-22"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-26"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-67"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-96"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-117"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-123"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-130"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-169"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-172"
]
],
"cite_sentences": [
"06917a1dd02d55c827e7e07eeae2da-C001-14",
"06917a1dd02d55c827e7e07eeae2da-C001-22",
"06917a1dd02d55c827e7e07eeae2da-C001-26",
"06917a1dd02d55c827e7e07eeae2da-C001-67",
"06917a1dd02d55c827e7e07eeae2da-C001-96",
"06917a1dd02d55c827e7e07eeae2da-C001-117",
"06917a1dd02d55c827e7e07eeae2da-C001-123",
"06917a1dd02d55c827e7e07eeae2da-C001-130",
"06917a1dd02d55c827e7e07eeae2da-C001-169",
"06917a1dd02d55c827e7e07eeae2da-C001-172"
]
},
"@EXT@": {
"gold_contexts": [
[
"06917a1dd02d55c827e7e07eeae2da-C001-26",
"06917a1dd02d55c827e7e07eeae2da-C001-27"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-99"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-219"
]
],
"cite_sentences": [
"06917a1dd02d55c827e7e07eeae2da-C001-26",
"06917a1dd02d55c827e7e07eeae2da-C001-27",
"06917a1dd02d55c827e7e07eeae2da-C001-99",
"06917a1dd02d55c827e7e07eeae2da-C001-219"
]
},
"@USE@": {
"gold_contexts": [
[
"06917a1dd02d55c827e7e07eeae2da-C001-60"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-166"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-174"
]
],
"cite_sentences": [
"06917a1dd02d55c827e7e07eeae2da-C001-60",
"06917a1dd02d55c827e7e07eeae2da-C001-166",
"06917a1dd02d55c827e7e07eeae2da-C001-174"
]
},
"@DIF@": {
"gold_contexts": [
[
"06917a1dd02d55c827e7e07eeae2da-C001-133",
"06917a1dd02d55c827e7e07eeae2da-C001-134",
"06917a1dd02d55c827e7e07eeae2da-C001-135"
],
[
"06917a1dd02d55c827e7e07eeae2da-C001-161"
]
],
"cite_sentences": [
"06917a1dd02d55c827e7e07eeae2da-C001-133",
"06917a1dd02d55c827e7e07eeae2da-C001-161"
]
}
}
},
"ABC_c897c2ea0d641f1f35072be4a5a7d3_2": {
"x": [
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-2",
"text": "Abstract."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-22",
"text": "The paper is organized as follows."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-3",
"text": "For many text classification tasks, there is a major problem posed by the lack of labeled data in a target domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-4",
"text": "Although classifiers for a target domain can be trained on labeled text data from a related source domain, the accuracy of such classifiers is usually lower in the cross-domain setting."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-5",
"text": "Recently, string kernels have obtained state-ofthe-art results in various text classification tasks such as native language identification or automatic essay scoring."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-6",
"text": "Moreover, classifiers based on string kernels have been found to be robust to the distribution gap between different domains."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-7",
"text": "In this paper, we formally describe an algorithm composed of two simple yet effective transductive learning approaches to further improve the results of string kernels in cross-domain settings."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-8",
"text": "By adapting string kernels to the test set without using the ground-truth test labels, we report significantly better accuracy rates in cross-domain English polarity classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-11",
"text": "Domain shift is a fundamental problem in machine learning, that has attracted a lot of attention in the natural language processing and vision communities [2, 6, 11, 13, 29, 30, 32, 37, 39, 40, 42] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-12",
"text": "To understand and address this problem, generated by the lack of labeled data in a target domain, researchers have studied the behavior of machine learning methods in cross-domain settings [12, 13, 29] and came up with various domain adaptation techniques [6, 11, 28, 39] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-13",
"text": "In crossdomain classification, a classifier is trained on data from a source domain and tested on data from a (different) target domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-14",
"text": "The accuracy of machine learning methods is usually lower in the cross-domain setting, due to the distribution gap between different domains."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-15",
"text": "However, researchers proposed several domain adaptation techniques by using the unlabeled test data to obtain better performance [5, 14, 16, 25, 37] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-16",
"text": "Interestingly, some recent works [13, 18] indicate that string kernels can yield robust results in the cross-domain setting without any domain adaptation."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-17",
"text": "In fact, methods based on string kernels have demonstrated impressive results in various text classification tasks ranging from native language identification [22] [23] [24] 36] and authorship identification [34] to dialect identification [4, 18, 21] , sentiment analysis [13, 35] and automatic essay scoring [7] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-18",
"text": "As long as a labeled training set is available, string kernels can reach state-of-the-art results in various languages including English [7, 13, 23] , Arabic [4, 17, 18, 24] , Chinese [35] and Norwegian [24] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-19",
"text": "Different from all these recent approaches, we use unlabeled data from the test set in a transductive setting in order to significantly increase the performance of string kernels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-20",
"text": "In our recent work [19] , we proposed two transductive learning approaches combined into a unified framework that improves the results of string kernels in two different tasks."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-21",
"text": "In this paper, we provide a formal and detailed description of our transductive algorithm and present results in cross-domain English polarity classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-23",
"text": "Related work on cross-domain text classification and string kernels is presented in Section 2."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-24",
"text": "Section 3 presents our approach to obtain domain adapted string kernels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-25",
"text": "The transductive transfer learning method is described in Section 4."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-26",
"text": "The polarity classification experiments are presented in Section 5."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-27",
"text": "Finally, we draw conclusions and discuss future work in Section 6."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-29",
"text": "**RELATED WORK**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-31",
"text": "**CROSS-DOMAIN CLASSIFICATION**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-32",
"text": "Transfer learning (or domain adaptation) aims at building effective classifiers for a target domain when the only available labeled training data belongs to a different (source) domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-33",
"text": "Domain adaptation techniques can be roughly divided into graph-based methods [6, [31] [32] [33] , probabilistic models [30, 42] , knowledgebased models [3, 12, 16] and joint optimization frameworks [28] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-34",
"text": "The transfer learning methods from the literature show promising results in a variety of realworld applications, such as image classification [28] , text classification [14, 25, 42] , polarity classification [11, [30] [31] [32] [33] and others [8] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-35",
"text": "General transfer learning approaches."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-36",
"text": "Long et al. [28] proposed a novel transfer learning framework to model distribution adaptation and label propagation in a unified way, based on the structural risk minimization principle and the regularization theory."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-37",
"text": "Shu et al. [39] proposed a method that bridges the distribution gap between the source domain and the target domain through affinity learning, by exploiting the existence of a subset of data points in the target domain that are distributed similarly to the data points in the source domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-38",
"text": "In [37] , deep learning is employed to jointly optimize the representation, the cross-domain transformation and the target label inference in an end-to-end fashion."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-39",
"text": "More recently, Sun et al. [40] proposed an unsupervised domain adaptation method that minimizes the domain shift by aligning the second-order statistics of source and target distributions, without requiring any target labels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-40",
"text": "Chang et al. [6] proposed a framework based on using a parallel corpus to calibrate domain-specific kernels into a unified kernel for leveraging graph-based label propagation between domains."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-41",
"text": "Cross-domain text classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-42",
"text": "Joachims [25] introduced the Transductive Support Vector Machines (TSVM) framework for text classification, which takes into account a particular test set and tries to minimize the error rate for those particular test samples."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-43",
"text": "Ifrim et al. [16] presented a transductive learning approach for text classification based on combining latent variable models for decomposing the topic-word space into topic-concept and concept-word spaces, and explicit knowledge models with named concepts for populating latent variables."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-44",
"text": "Guo et al. [14] proposed a transductive subspace representation learning method to address domain adaptation for cross-lingual text classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-45",
"text": "Zhuang et al. [42] presented a probabilistic model, by which both the shared and distinct concepts in different domains can be learned by the ExpectationMaximization process which optimizes the data likelihood."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-46",
"text": "In [1] , an algorithm to adapt a classification model by iteratively learning domain-specific features from the unlabeled test data is described."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-47",
"text": "Cross-domain polarity classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-48",
"text": "In recent years, cross-domain sentiment (polarity) classification has gained popularity due to the advances in domain adaptation on one side, and to the abundance of documents from various domains available on the Web, expressing positive or negative opinion, on the other side."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-49",
"text": "Some of the general domain adaptation frameworks have been applied to polarity classification [1, 6, 42] , but there are some approaches that have been specifically designed for the cross-domain sentiment classification task [2, 11-13, 26, 30-33] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-50",
"text": "To the best of our knowledge, Blitzer et al. [2] were the first to report results on cross-domain classification proposing the structural correspondence learning (SCL) method, and its variant based on mutual information (SCL-MI)."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-51",
"text": "Pan et al. [32] proposed a spectral feature alignment (SFA) algorithm to align domain-specific words from different domains into unified clusters, using domain-independent words as a bridge."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-52",
"text": "Bollegala et al. [3] used a cross-domain lexicon creation to generate a sentiment-sensitive thesaurus (SST) that groups different words expressing the same sentiment, using unigram and bigram features as [2, 32] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-53",
"text": "Luo et al. [30] proposed a cross-domain sentiment classification framework based on a probabilistic model of the author's emotion state when writing."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-54",
"text": "An Expectation-Maximization algorithm is then employed to solve the maximum likelihood problem and to obtain a latent emotion distribution of the author."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-55",
"text": "Franco-Salvador et al. [12] combined various recent and knowledge-based approaches using a meta-learning scheme (KE-Meta)."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-56",
"text": "They performed cross-domain polarity classification without employing any domain adaptation technique."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-57",
"text": "More recently, Fern\u00e1ndez et al. [11] introduced the Distributional Correspondence Indexing (DCI) method for domain adaptation in sentiment classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-58",
"text": "The approach builds term representations in a vector space common to both domains where each dimension reflects its distributional correspondence to a highly predictive term that behaves similarly across domains."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-59",
"text": "A graph-based approach for sentiment classification that models the relatedness of different domains based on shared users and keywords is proposed in [31] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-61",
"text": "**STRING KERNELS**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-62",
"text": "In recent years, methods based on string kernels have demonstrated remarkable performance in various text classification tasks [7, 10, 13, 18, 23, 27, 34] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-63",
"text": "String kernels represent a way of using information at the character level by measuring the similarity of strings through character n-grams."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-64",
"text": "Lodhi et al. [27] used string kernels for document categorization, obtaining very good results."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-65",
"text": "String kernels were also successfully used in authorship identification [34] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-66",
"text": "More recently, various combinations of string kernels reached state-of-the-art accuracy rates in native language identification [23] and Arabic dialect identification [18] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-67",
"text": "Interestingly, string kernels have been used in cross-domain settings without any domain adaptation, obtaining impressive results."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-68",
"text": "For instance, Ionescu et al. [23] have employed string kernels in a cross-corpus (and implicitly cross-domain) native language identification experiment, improving the state-of-the-art accuracy by a remarkable 32.3%."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-69",
"text": "Gim\u00e9nez-P\u00e9rez et al. [13] have used string kernels for single-source and multi-source polarity classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-70",
"text": "Remarkably, they obtain state-of-the-art performance without using knowledge from the target domain, which indicates that string kernels provide robust results in the cross-domain setting without any domain adaptation."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-71",
"text": "Ionescu et al. [18] obtained the best performance in the Arabic Dialect Identification Shared Task of the 2017 VarDial Evaluation Campaign [41] , with an improvement of 4.6% over the second-best method."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-72",
"text": "It is important to note that the training and the test speech samples prepared for the shared task were recorded in different setups [41] , or in other words, the training and the test sets are drawn from different distributions."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-73",
"text": "Different from all these recent approaches [13, 18, 23] , we use unlabeled data from the target domain to significantly increase the performance of string kernels in cross-domain text classification, particularly in English polarity classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-75",
"text": "**TRANSDUCTIVE STRING KERNELS**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-76",
"text": "String kernels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-77",
"text": "Kernel functions [38] capture the intuitive notion of similarity between objects in a specific domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-78",
"text": "For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character n-grams."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-79",
"text": "Various string kernel functions have been proposed to date [23, 27, 38] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-80",
"text": "Perhaps one of the most recently introduced string kernels is the histogram intersection string kernel [23] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-81",
"text": "For two strings over an alphabet \u03a3, x, y \u2208 \u03a3 * , the intersection string kernel is formally defined as follows:"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-82",
"text": "where num v (x) is the number of occurrences of n-gram v as a substring in x, and p is the length of v. The spectrum string kernel or the presence bits string kernel can be defined in a similar fashion [23] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-83",
"text": "Transductive string kernels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-84",
"text": "We present a simple and straightforward approach to produce a transductive similarity measure suitable for strings."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-85",
"text": "We take the following steps to derive transductive string kernels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-86",
"text": "For a given kernel (similarity) function k, we first build the full kernel matrix K, by including the pairwise similarities of samples from both the train and the test sets."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-87",
"text": "For a training set X = {x 1 , x 2 , ..., x m } of m samples and a test set Y = {y 1 , y 2 , ..., y n } of n samples, such that X \u2229 Y = \u2205, each component in the full kernel matrix is defined as follows:"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-88",
"text": "where z i and z j are samples from the set Z = X\u222aY = {x 1 , x 2 , ..., x m , y 1 , y 2 , ..., y n }, for all 1 \u2264 i, j \u2264 m + n. We then normalize the kernel matrix by dividing each component by the square root of the product of the two corresponding diagonal components:"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-89",
"text": "We transform the normalized kernel matrix into a radial basis function (RBF) kernel matrix as follows:K"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-90",
"text": "Each row in the RBF kernel matrixK is now interpreted as a feature vector."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-91",
"text": "In other words, each sample z i is represented by a feature vector that contains the similarity between the respective sample z i and all the samples in Z. Since Z includes the test samples as well, the feature vector is inherently adapted to the test set."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-92",
"text": "Indeed, it is easy to see that the features will be different if we choose to apply the string kernel approach on a set of test samples"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-93",
"text": "It is important to note that through the features, the subsequent classifier will have some information about the test samples at training time."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-94",
"text": "More specifically, the feature vector conveys information about how similar is every test sample to every training sample."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-95",
"text": "We next consider the linear kernel, which is given by the scalar product between the new feature vectors."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-96",
"text": "To obtain the final linear kernel matrix, we simply need to compute the product between the RBF kernel matrix and its transpose:"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-97",
"text": "In this way, the samples from the test set, which are included in Z, are used to obtain new (transductive) string kernels that are adapted to the test set at hand."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-99",
"text": "**TRANSDUCTIVE KERNEL CLASSIFIER**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-100",
"text": "We next present a simple yet effective approach for adapting a one-versus-all kernel classifier trained on a source domain to a different target domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-101",
"text": "Our transductive kernel classifier (TKC) approach is composed of two learning iterations."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-102",
"text": "Our entire framework is formally described in Algorithm 1. Notations."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-103",
"text": "We use the following notations in the algorithm."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-104",
"text": "Sets, arrays and matrices are written in capital letters."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-105",
"text": "All collection types are considered to be indexed starting from position 1."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-106",
"text": "The elements of a set S are denoted by s i , the elements of an array A are alternatively denoted by A(i) or A i , and the elements of a matrix M are denoted by M (i, j) or M ij when convenient."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-107",
"text": "The sequence 1, 2, ..., n is denoted by 1 : n. We use sequences to index arrays or matrices as well."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-108",
"text": "For example, for an array A and two integers i and j, A(i : j) denotes the sub-array (A i , A i+1 , ..., A j )."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-109",
"text": "In a similar manner, M (i : j, k : l) denotes a sub-matrix of the matrix M , while M (i, :) returns the i-th row of M and M (:, j) returns the j-th column of M. The zero matrix of m \u00d7 n components is denoted Algorithm 1: Transductive Kernel Algorithm 1 Input: 2 X = (X, T ) = {(xi, ti) | xi \u2208 R q , ti \u2208 {1, 2, ..., c}, i \u2208 {1, 2, ..., m}} -the training set of m training samples and associated class labels; 3 Y = {yi | yi \u2208 R q , i \u2208 {1, 2, ..., n}} -the set of n test samples;"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-110",
"text": "4 k -a kernel function; 5 r -the number of test samples to be added in the second round of training; 6 C -a binary kernel classifier."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-111",
"text": "7 Domain-Adapted Kernel Matrix Computation Steps: 8 Z \u2190 {x1, x2, ..., xm, y1, y2, ..., yn}; 9 K \u2190 0m+n;K \u2190 0m+n;K \u2190 0m+n;K \u2190 0m+n; 10 for zi \u2208 Z do 11 for zj \u2208 Z do 12 Kij \u2190 k(zi, zj);"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-112",
"text": "18 Transductive Kernel Classifier Steps: isort \u2190 sort S in descending order and return the sorted indexes;"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-113",
"text": "40 Output: 41 P = {pi | pi \u2208 {1, 2, ..., c}, i \u2208 {1, 2, ..., n}} -the set of predicted labels for the test samples in Y ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-114",
"text": "by 0 m,n , and the square zero matrix is denoted by 0 n ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-115",
"text": "The identity matrix is denoted by 1 n ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-116",
"text": "Algorithm description."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-117",
"text": "In steps 8-17, we compute the domain-adapted string kernel matrix, as described in the previous section."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-118",
"text": "In the first learning iteration (when s = 1), we train several classifiers to distinguish each individual class from the rest, according to the one-versus-all (OVA) scheme."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-119",
"text": "In step 27, the kernel classifier C is trained to distinguish a class from the others, assigning a dual weight to each training sample from the source domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-120",
"text": "The returned column vector of dual weights is denoted by \u03b1 and the bias value is denoted by b. The vector of weights \u03b1 contains m values, such that the weight \u03b1 i corresponds to the training sample x i ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-121",
"text": "When the test kernel matrixK test of n \u00d7 m components is multiplied with the vector \u03b1 in step 28, the result is a column vector of n positive or negative scores."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-122",
"text": "Afterwards (step 34), the test samples are sorted in order to maximize the probability of correctly predicted labels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-123",
"text": "For each test sample y i , we consider the score S i (step 32) produced by the classifier for the chosen class P i (step 31), which is selected according to the OVA scheme."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-124",
"text": "The sorting is based on the hypothesis that if the classifier associates a higher score to a test sample, it means that the classifier is more confident about the predicted label for the respective test sample."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-125",
"text": "Before the second learning iteration, a number of r test samples from the top of the sorted list are added to the training set (steps 35-39) for another round of training."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-126",
"text": "As the classifier is more confident about the predicted labels P keep of the added test samples, the chance of including noisy examples (with wrong labels) is minimized."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-127",
"text": "On the other hand, the classifier has the opportunity to learn some useful domain-specific patterns of the test domain."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-128",
"text": "We believe that, at least in the cross-domain setting, the added test samples bring more useful information than noise."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-129",
"text": "We would like to stress out that the ground-truth test labels are never used in our transductive algorithm."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-130",
"text": "Although the test samples are required beforehand, their labels are not necessary."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-131",
"text": "Hence, our approach is suitable in situations where unlabeled data from the target domain can be collected cheaply, and such situations appear very often in practice, considering the great amount of data available on the Web."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-133",
"text": "**POLARITY CLASSIFICATION**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-134",
"text": "Data set."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-135",
"text": "For the cross-domain polarity classification experiments, we use the second version of Multi-Domain Sentiment Dataset [2] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-136",
"text": "The data set contains Amazon product reviews of four different domains: Books (B), DVDs (D), Electronics (E) and Kitchen appliances (K)."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-137",
"text": "Reviews contain star ratings (from 1 to 5) which are converted into binary labels as follows: reviews rated with more than 3 stars are labeled as positive, and those with less than 3 stars as negative."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-138",
"text": "In each domain, there are 1000 positive and 1000 negative reviews."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-139",
"text": "Baselines."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-140",
"text": "We compare our approach with several methods [3, 12, 13, 15, 32, 40] in two cross-domain settings."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-141",
"text": "Using string kernels, Gim\u00e9nez-P\u00e9rez et al. [13] reported better performance than SST [3] and KE-Meta [12] in the multi-source domain setting."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-142",
"text": "In addition, we compare our approach with SFA [32] , CORAL [40] and TR-TrAdaBoost [15] in the single-source setting."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-143",
"text": "Method DEK\u2192B BEK\u2192D BDK\u2192E BDE\u2192K SST [3] 76.3 78.3 83.9 85.2 KE-Meta [12] 77.9 80.4 78.9 82.5 K 0/1 [13] 82 Table 1 ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-144",
"text": "Multi-source cross-domain polarity classification accuracy rates (in %) of our transductive approaches versus a state-of-the-art baseline based on string kernels [13] , as well as SST [3] and KE-Meta [12] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-145",
"text": "The best accuracy rates are highlighted in bold."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-146",
"text": "The marker * indicates that the performance is significantly better than the best baseline string kernel according to a paired McNemar's test performed at a significance level of 0.01."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-147",
"text": "Evaluation procedure and parameters."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-148",
"text": "We follow the same evaluation methodology of Gim\u00e9nez-P\u00e9rez et al. [13] , to ensure a fair comparison."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-149",
"text": "Furthermore, we use the same kernels, namely the presence bits string kernel (K 0/1 ) and the intersection string kernel (K \u2229 ), and the same range of character n-grams (5) (6) (7) (8) . To compute the string kernels, we used the open-source code provided by Ionescu et al. [20, 23] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-150",
"text": "For the transductive kernel classifier, we select r = 1000 unlabeled test samples to be included in the training set for the second round of training."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-151",
"text": "We choose Kernel Ridge Regression [38] as classifier and set its regularization parameter to 10 \u22125 in all our experiments."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-152",
"text": "Although Gim\u00e9nez-P\u00e9rez et al. [13] used a different classifier, namely Kernel Discriminant Analysis, we observed that Kernel Ridge Regression produces similar results (\u00b10.1%) when we employ the same string kernels."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-153",
"text": "As Gim\u00e9nez-P\u00e9rez et al. [13] , we evaluate our approach in two cross-domain settings."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-154",
"text": "In the multi-source setting, we train the models on all domains, except the one used for testing."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-155",
"text": "In the single-source setting, we train the models on one of the four domains and we independently test the models on the remaining three domains."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-156",
"text": "Results in multi-source setting."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-157",
"text": "The results for the multi-source crossdomain polarity classification setting are presented in Table 1 ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-158",
"text": "Both the transductive presence bits string kernel (K 0/1 ) and the transductive intersection kernel (K \u2229 ) obtain better results than their original counterparts."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-159",
"text": "Moreover, according to the McNemar's test [9] , the results on the DVDs, the Electronics and the Kitchen target domains are significantly better than the best baseline string kernel, with a confidence level of 0.01."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-160",
"text": "When we employ the transductive kernel classifier (TKC), we obtain even better results."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-161",
"text": "On all domains, the accuracy rates yielded by the transductive classifier are more than 1.5% better than the best baseline."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-162",
"text": "For example, on the Books domain the accuracy of the transductive classifier based on the presence bits kernel (84.1%) is 2.1% above the best baseline (82.0%) represented by the intersection string kernel."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-163",
"text": "Remark- Table 2 ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-164",
"text": "Single-source cross-domain polarity classification accuracy rates (in %) of our transductive approaches versus a state-of-the-art baseline based on string kernels [13] , as well as SFA [32] , CORAL [40] and TR-TrAdaBoost [15] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-165",
"text": "The best accuracy rates are highlighted in bold."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-166",
"text": "The marker * indicates that the performance is significantly better than the best baseline string kernel according to a paired McNemar's test performed at a significance level of 0.01."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-167",
"text": "ably, the improvements brought by our transductive string kernel approach are statistically significant in all domains."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-168",
"text": "Results in single-source setting."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-169",
"text": "The results for the single-source crossdomain polarity classification setting are presented in Table 2 ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-170",
"text": "We considered all possible combinations of source and target domains in this experiment, and we improve the results in each and every case."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-171",
"text": "Without exception, the accuracy rates reached by the transductive string kernels are significantly better than the best baseline string kernel [13] , according to the McNemar's test performed at a confidence level of 0.01."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-172",
"text": "The highest improvements (above 2.7%) are obtained when the source domain contains Books reviews and the target domain contains Kitchen reviews."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-173",
"text": "As in the multi-source setting, we obtain much better results when the transductive classifier is employed for the learning task."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-174",
"text": "In all cases, the accuracy rates of the transductive classifier are more than 2% better than the best baseline string kernel."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-175",
"text": "Remarkably, in four cases (E\u2192B, E\u2192D, B\u2192K and D\u2192K) our improvements are greater than 4%."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-176",
"text": "The improvements brought by our transductive classifier based on string kernels are statistically significant in each and every case."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-177",
"text": "In comparison with SFA [32] , we obtain better results in all but one case (K\u2192D)."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-178",
"text": "Remarkably, we surpass the other state-of-the-art approaches [15, 40] in all cases."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-179",
"text": "----------------------------------"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-180",
"text": "**CONCLUSION**"
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-181",
"text": "In this paper, we presented two domain adaptation approaches that can be used together to improve the results of string kernels in cross-domain settings."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-182",
"text": "We provided empirical evidence indicating that our framework can be successfully applied in cross-domain text classification, particularly in cross-domain English polarity classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-183",
"text": "Indeed, the polarity classification experiments demonstrate that our framework achieves better accuracy rates than other state-ofthe-art methods [3, 12, 13, 15, 32, 40] ."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-184",
"text": "By using the same parameters across all the experiments, we showed that our transductive transfer learning framework can bring significant improvements without having to fine-tune the parameters for each individual setting."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-185",
"text": "Although the framework described in this paper can be generally applied to any kernel method, we focused our work only on string kernel approaches used in text classification."
},
{
"sent_id": "c897c2ea0d641f1f35072be4a5a7d3-C001-186",
"text": "In future work, we aim to combine the proposed transductive transfer learning framework with different kinds of kernels and classifiers, and employ it for other cross-domain tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-11"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-16",
"c897c2ea0d641f1f35072be4a5a7d3-C001-17"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-18"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-62"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-69"
]
],
"cite_sentences": [
"c897c2ea0d641f1f35072be4a5a7d3-C001-11",
"c897c2ea0d641f1f35072be4a5a7d3-C001-16",
"c897c2ea0d641f1f35072be4a5a7d3-C001-17",
"c897c2ea0d641f1f35072be4a5a7d3-C001-18",
"c897c2ea0d641f1f35072be4a5a7d3-C001-62",
"c897c2ea0d641f1f35072be4a5a7d3-C001-69"
]
},
"@DIF@": {
"gold_contexts": [
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-18",
"c897c2ea0d641f1f35072be4a5a7d3-C001-19"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-73"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-141"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-152"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-171"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-183"
]
],
"cite_sentences": [
"c897c2ea0d641f1f35072be4a5a7d3-C001-18",
"c897c2ea0d641f1f35072be4a5a7d3-C001-73",
"c897c2ea0d641f1f35072be4a5a7d3-C001-141",
"c897c2ea0d641f1f35072be4a5a7d3-C001-152",
"c897c2ea0d641f1f35072be4a5a7d3-C001-171",
"c897c2ea0d641f1f35072be4a5a7d3-C001-183"
]
},
"@USE@": {
"gold_contexts": [
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-140"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-144"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-148"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-153"
],
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-164"
]
],
"cite_sentences": [
"c897c2ea0d641f1f35072be4a5a7d3-C001-140",
"c897c2ea0d641f1f35072be4a5a7d3-C001-144",
"c897c2ea0d641f1f35072be4a5a7d3-C001-148",
"c897c2ea0d641f1f35072be4a5a7d3-C001-153",
"c897c2ea0d641f1f35072be4a5a7d3-C001-164"
]
},
"@SIM@": {
"gold_contexts": [
[
"c897c2ea0d641f1f35072be4a5a7d3-C001-152"
]
],
"cite_sentences": [
"c897c2ea0d641f1f35072be4a5a7d3-C001-152"
]
}
}
},
"ABC_6b1432f4aac35e6acd8ca8770fe484_2": {
"x": [
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-73",
"text": "**EXPERIMENTS ON LANGUAGE MODELING**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-2",
"text": "This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-3",
"text": "We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-4",
"text": "Our proposed method constructs word embeddings from character ngram embeddings and combines them with ordinary word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-5",
"text": "We demonstrate that the proposed method achieves the best perplexities on the language modeling datasets: Penn Treebank, WikiText-2, and WikiText-103."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-6",
"text": "Moreover, we conduct experiments on application tasks: machine translation and headline generation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-7",
"text": "The experimental results indicate that our proposed method also positively affects these tasks."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-10",
"text": "Neural language models have played a crucial role in recent advances of neural network based methods in natural language processing (NLP)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-11",
"text": "For example, neural encoderdecoder models, which are becoming the de facto standard for various natural language generation tasks including machine translation (Sutskever, Vinyals, and Le 2014) , summarization (Rush, Chopra, and Weston 2015) , dialogue (Wen et al. 2015) , and caption generation (Vinyals et al. 2015) can be interpreted as conditional neural language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-12",
"text": "Moreover, neural language models can be used for rescoring outputs from traditional methods, and they significantly improve the performance of automatic speech recognition (Du et al. 2016) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-13",
"text": "This implies that better neural language models improve the performance of application tasks."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-14",
"text": "In general, neural language models require word embeddings as an input (Zaremba, Sutskever, and Vinyals 2014)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-15",
"text": "However, as described by (Verwimp et al. 2017) , this approach cannot make use of the internal structure of words although the internal structure is often an effective clue for considering the meaning of a word."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-16",
"text": "For example, we can comprehend that the word 'causal' is related to 'cause' immediately because both words include the same character sequence 'caus'."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-17",
"text": "Thus, if we incorporate a method that handles the internal structure such as character information, we can improve the quality of neural language models and probably make them robust to infrequent words."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-18",
"text": "To incorporate the internal structure, (Verwimp et al. 2017) concatenated character embeddings with an input word embedding."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-19",
"text": "They demonstrated that incorporating character embeddings improved the performance of RNN language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-20",
"text": "Moreover, (Kim et al. 2016 ) and (Jozefowicz et al. 2016) applied Convolutional Neural Networks (CNN) to construct word embeddings from character embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-21",
"text": "On the other hand, in the field of word embedding construction, some previous researchers found that character n-grams are more useful than single characters (Wieting et al. 2016; Bojanowski et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-22",
"text": "In particular, (Wieting et al. 2016) demonstrated that constructing word embeddings from character n-gram embeddings outperformed the methods that construct word embeddings from character embeddings by using CNN or a Long Short-Term Memory (LSTM)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-23",
"text": "Based on their reports, in this paper, we propose a neural language model that utilizes character n-gram embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-24",
"text": "Our proposed method encodes character n-gram embeddings to a word embedding with simplified Multi-dimensional Selfattention (MS) (Shen et al. 2018) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-25",
"text": "We refer to this constructed embedding as charn-MS-vec."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-26",
"text": "The proposed method regards charn-MS-vec as an input in addition to a word embedding."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-27",
"text": "We conduct experiments on the well-known benchmark datasets: Penn Treebank, WikiText-2, and WikiText-103."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-28",
"text": "Our experiments indicate that the proposed method outperforms neural language models trained with well-tuned hyperparameters and achieves state-of-the-art scores on each dataset."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-29",
"text": "In addition, we incorporate our proposed method into a standard neural encoder-decoder model and investigate its effect on machine translation and headline generation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-30",
"text": "We indicate that the proposed method also has a positive effect on such tasks."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-32",
"text": "**RNN LANGUAGE MODEL**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-33",
"text": "In this study, we focus on RNN language models, which are widely used in the literature."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-34",
"text": "This section briefly overviews the basic RNN language model."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-35",
"text": "In language modeling, we compute joint probability by using the product of conditional probabilities."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-36",
"text": "Let w 1:T be a word sequence with length T , namely, w 1 , ..., w T ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-37",
"text": "We formally obtain the joint probability of word sequence w 1:T as follows:"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-38",
"text": "(1) p(w 1 ) is generally assumed to be 1 in this literature, i.e., p(w 1 ) = 1, and thus we can ignore its calculation 1 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-39",
"text": "To estimate the conditional probability p(w t+1 |w 1:t ), RNN language models encode sequence w 1:t into a fixedlength vector and compute the probability distribution of each word from this fixed-length vector."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-40",
"text": "Let V be the vocabulary size and let P t \u2208 R V be the probability distribution of the vocabulary at timestep t. Moreover, let D h be the dimension of the hidden state of an RNN and let D e be the dimensions of embedding vectors."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-41",
"text": "Then, RNN language models predict the probability distribution P t+1 by the following equation:"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-42",
"text": "where W \u2208 R V \u00d7D h is a weight matrix, b \u2208 R V is a bias term, and E \u2208 R De\u00d7V is a word embedding matrix."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-43",
"text": "x t \u2208 {0, 1} V and h t \u2208 R D h are a one-hot vector of an input word w t and the hidden state of the RNN at timestep t, respectively."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-44",
"text": "We define h t at timestep t = 0 as a zero vector, that is, h 0 = 0."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-45",
"text": "Let f (\u00b7) represent an abstract function of an RNN, which might be the LSTM, the Quasi-Recurrent Neural Network (QRNN) , or any other RNN variants."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-47",
"text": "**INCORPORATING CHARACTER N-GRAM EMBEDDINGS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-71",
"text": "Concretely, we use E + C instead of W in Equation 2, where C \u2208 R De\u00d7V contains charn-MS-vec for all words in the vocabulary."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-48",
"text": "We incorporate charn-MS-vec, which is an embedding constructed from character n-gram embeddings, into RNN language models since, as discussed earlier, previous studies revealed that we can construct better word embeddings by using character n-gram embeddings (Wieting et al. 2016; Bojanowski et al. 2017 )."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-49",
"text": "In particular, we expect charn-MSvec to help represent infrequent words by taking advantage of the internal structure."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-50",
"text": "Figure 1 is the overview of the proposed method using character 3-gram embeddings (char3-MS-vec)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-51",
"text": "As illustrated in this figure, our proposed method regards the sum of char3-MS-vec and the standard word embedding as an input of an RNN."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-52",
"text": "In other words, let c t be charn-MS-vec and we replace Equation 4 with the following:"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-53",
"text": "3.1 Multi-dimensional Self-attention"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-54",
"text": "To compute c t , we apply an encoder to character n-gram embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-55",
"text": "Previous studies demonstrated that additive composition, which computes the (weighted) sum of embeddings, is a suitable method for embedding construction Wieting et al. 2016 the number of character n-grams extracted from the word, and let S be the matrix whose i-th column corresponds to s i , that is, S = [s 1 , ..., s I ]."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-56",
"text": "The multi-dimensional self-attention constructs the word embedding c t by the following equations:"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-57",
"text": "where means element-wise product of vectors, W c \u2208 R De\u00d7De is a weight matrix, [\u00b7] j is the j-th column of a given matrix, and {\u00b7} j is the j-th element of a given vector."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-58",
"text": "In short, Equation 7 applies the softmax function to each row of [W c S] and extracts the i-th column as g i ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-59",
"text": "Let us consider the case where an input word is 'the' and we use character 3-gram in Figure 1 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-60",
"text": "We prepare special characters '\u02c6' and '$' to represent the beginning and end of the word, respectively."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-61",
"text": "Then, 'the' is composed of three character 3-grams: '\u02c6th', 'the', and 'he$'."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-62",
"text": "We multiply the embeddings of these 3-grams by transformation matrix W c and apply the softmax function to each row 2 as in Equation 7."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-63",
"text": "As a result of the softmax, we obtain a matrix that contains weights for each embedding."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-64",
"text": "The size of the computed matrix is identical to the input embedding matrix: D e \u00d7 I. We then compute Equation 6, i.e., the weighted sum of the embeddings, and add the resulting vector to the word embedding of 'the'."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-65",
"text": "Finally, we input the vector into an RNN to predict the next word."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-67",
"text": "**WORD TYING**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-68",
"text": "(Inan, Khosravi, and Socher 2017) and (Press and Wolf 2017) proposed a word tying method (WT) that shares the word embedding matrix (E in Equation 4) with the weight matrix to compute probability distributions (W in Equation 2)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-69",
"text": "They demonstrated that WT significantly improves the performance of RNN language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-70",
"text": "In this study, we adopt charn-MS-vec as the weight matrix in language modeling."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-74",
"text": "We investigate the effect of charn-MS-vec on the word-level language modeling task."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-75",
"text": "In detail, we examine the following four research questions;"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-76",
"text": "1. Can character n-gram embeddings improve the performance of state-of-the-art RNN language models?"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-77",
"text": "2. Do character n-gram embeddings have a positive effect on infrequent words?"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-78",
"text": "3. Is multi-dimensional self-attention effective for word embedding construction as compared with several other similar conventional methods?"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-79",
"text": "4. How many n should we use?"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-81",
"text": "**DATASETS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-82",
"text": "We used the standard benchmark datasets for the wordlevel language modeling: Penn Treebank (PTB) (Marcus, Marcinkiewicz, and Santorini 1993) , WikiText-2 (WT2), and WikiText-103 (WT103) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-83",
"text": "(Mikolov et al. 2010 ) and ) published pre-processed PTB 3 , WT2, and WT103 4 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-84",
"text": "Following the previous studies, we used these pre-processed datasets for our experiments."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-85",
"text": "Table 1 describes the statistics of the datasets."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-86",
"text": "Table 1 demonstrates that the vocabulary size of WT103 is too large, and thus it is impractical to compute charn-MS-vec for all"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-88",
"text": "**BASELINE RNN LANGUAGE MODEL**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-89",
"text": "For base RNN language models, we adopted the state-ofthe-art LSTM language model (Merity, Keskar, and Socher 2018b) for PTB and WT2, and QRNN for WT103 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-90",
"text": "(Melis, Dyer, and Blunsom 2018) demonstrated that the standard LSTM trained with appropriate hyperparameters outperformed various architectures such as Recurrent Highway Networks (RHN) (Zilly et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-91",
"text": "In addition to several regularizations, (Merity, Keskar, and Socher 2018b) introduced Averaged Stochastic Gradient Descent (ASGD) (Polyak and Juditsky 1992) to train the 3-layered LSTM language model."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-92",
"text": "As a result, their ASGD WeightDropped LSTM (AWD-LSTM) achieved state-of-the-art results on PTB and WT2."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-93",
"text": "For WT103, (Merity, Keskar, and Socher 2018a) achieved the top score with the 4-layered QRNN."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-94",
"text": "Thus, we used AWD-LSTM for PTB and WT2, and QRNN for WT103 as the base language models, respectively."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-95",
"text": "We used their implementations 5 for our experiments."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-96",
"text": "Table 2 shows perplexities of the baselines and the proposed method."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-97",
"text": "We varied n for charn-MS-vec from 2 to 4."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-98",
"text": "For the baseline, we also applied two word embeddings to investigate the performance in the case where we use more kinds of word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-99",
"text": "In detail, we prepared E 1 , E 2 \u2208 R De\u00d7V and used E 1 + E 2 instead of E in Equation 4."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-100",
"text": "Table 2 also shows the number of character n-grams in each dataset."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-101",
"text": "This table indicates that charn-MS-vec improved the performance of state-of-the-art models except for char4-MS-vec on WT103."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-102",
"text": "These results indicate that charn-MS-vec can raise the quality of word-level language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-103",
"text": "In particular, Table 2 shows that char3-MS-vec achieved the best scores consistently."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-104",
"text": "In contrast, an additional word embedding did not improve the performance."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-105",
"text": "This fact implies that the improvement of charn-MS-vec is caused by using character n-grams."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-106",
"text": "Thus, we answer yes to the first research question."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-107",
"text": "Table 3 shows the training time spent on each epoch."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-108",
"text": "We calculated it on the NVIDIA Tesla P100."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-109",
"text": "Table 3 indicates that the proposed method requires more computational time than the baseline unfortunately."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-110",
"text": "We leave exploring a faster structure for our future work."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-111",
"text": "Table 4 shows perplexities on the PTB dataset where the frequency of an input word is lower than 2,000 in the training data."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-112",
"text": "This table indicates that the proposed method can improve the performance even if an input word is infrequent."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-113",
"text": "In other words, charn-MS-vec helps represent the meanings of infrequent words."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-114",
"text": "Therefore, we answer yes to the second research question in the case of our experimental settings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-116",
"text": "**RESULTS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-117",
"text": "We explored the effectiveness of multi-dimensional selfattention for word embedding construction."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-118",
"text": "Table 5 shows perplexities of using several encoders on the PTB dataset."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-119",
"text": "As in (Kim et al. 2016 ), we applied CNN to construct word embeddings (charCNN in Table 4 : Perplexities on the PTB dataset where an input word is infrequent in the training data, which means its frequency is lower than 2,000."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-120",
"text": "summation and standard self-attention, which computes the scalar value as a weight for a character n-gram embedding, to construct word embeddings (charn-Sum-vec and charn-SS-vec, respectively)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-121",
"text": "For CNN, we used hyperparameters identical to (Kim et al. 2016 ) (\"Original Settings\" in Table 5 ) but the setting has two differences from other architectures: 1."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-122",
"text": "The dimension of the computed vectors is much larger than the dimension of the baseline word embeddings and 2."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-123",
"text": "The dimension of the input character embeddings is much smaller than the dimension of the baseline word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-124",
"text": "Therefore, we added two configurations: assigning the dimension of the computed vectors and input character embeddings a value identical to the baseline word embeddings (in Table 5 , \"Small CNN result dims\" and \"Large embedding dims\", respectively)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-125",
"text": "Table 5 shows that the proposed charn-MS-vec outperformed charCNN even though the original settings of char-CNN had much larger parameters."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-126",
"text": "Moreover, we trained char-CNN with two additional settings but CNN did not improve the baseline performance."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-127",
"text": "This result implies that charn-MSvec is better embeddings than ones constructed by applying CNN to character embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-128",
"text": "Table 5 : Perplexities of each structure on PTB dataset."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-129",
"text": "use of multi-dimensional self-attention is more appropriate for constructing word embeddings from character n-gram embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-130",
"text": "Table 5 also shows that excluding C from word tying (\"Exclude C from word tying\") achieved almost the same score as the baseline."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-131",
"text": "Moreover, this table indicates that performance fails as the the number of parameters is increased."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-132",
"text": "Thus, we need to assign C to word tying to prevent over-fitting for the PTB dataset."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-133",
"text": "In addition, this result implies that the performance of WT103 in Table 2 might be raised if we can apply word tying to WT103."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-134",
"text": "Moreover, to investigate the effect of only charn-MS-vec, we ignore Ex t in Equation 5."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-135",
"text": "We refer to this setting as \"Remove word embeddings E\" in Table 5 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-136",
"text": "Table 5 shows cahr3-MS-vec and char4-MS-vec are superior to char2-MS-vec."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-137",
"text": "In the view of perplexity, char3-MS-vec and char4-MS-vec achieved comparable scores to each other."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-138",
"text": "On the other hand, char3-MS-vec is composed of much smaller parameters."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-139",
"text": "Furthermore, we decreased the embedding size D e to adjust the number of parameters to the same size as the baseline (\"Same #Params as baseline\" in Table 5 )."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-140",
"text": "In this setting, char3-MSvec achieved the best perplexity."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-141",
"text": "Therefore, we consider that char3-MS-vec is more useful than char4-MS-vec, which is the answer to the fourth research question."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-142",
"text": "We use the combination of the char3-MS-vec c t and word embedding Ex t in the following experiments."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-143",
"text": "Finally, we compare the proposed method with the published scores reported in previous studies."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-144",
"text": "Tables 6, 7 , and 8, respectively, show perplexities of the proposed method and previous studies on PTB, WT2, and WT103 6 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-145",
"text": "Since AWD-LSTM-MoS (Yang et al. 2018 ) and AWD-LSTM-DOC (Takase, Suzuki, and Nagata 2018) achieved the stateof-the-art scores on PTB and WT2, we combined char3-MSvec with them."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-146",
"text": "These tables show that the proposed method improved the performance of the base model and outperformed the state-of-the-art scores on all datasets."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-147",
"text": "In particular, char3-MS-vec improved perplexity by at least 1 point from current best scores on the WT103 dataset."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-149",
"text": "**EXPERIMENTS ON APPLICATIONS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-150",
"text": "As described in Section 1, neural encoder-decoder models can be interpreted as conditional neural language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-151",
"text": "Therefore, to investigate if the proposed method contributes to encoder-decoder models, we conduct experiments on machine translation and headline generation tasks."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-153",
"text": "**DATASETS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-154",
"text": "For machine translation, we used two kinds of language pairs: English-French and English-German sentences in the IWSLT 2016 dataset 7 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-155",
"text": "The dataset contains about 208K EnglishFrench pairs and 189K English-German pairs."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-156",
"text": "We conducted four translation tasks: from English to each language (En-Fr and En-De), and their reverses (Fr-En and De-En)."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-157",
"text": "For headline generation, we used sentence-headline pairs extracted from the annotated English Gigaword corpus (Napoles, Gormley, and Van Durme 2012) in the same manner as (Rush, Chopra, and Weston 2015) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-158",
"text": "The training set contains about 3.8M sentence-headline pairs."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-159",
"text": "For evaluation, we exclude the test set constructed by (Rush, Chopra, and Weston 2015) because it contains some invalid instances, as reported in (Zhou et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-160",
"text": "We instead used the test sets constructed by (Zhou et al. 2017) and (Kiyono et al. 2017 )."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-162",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-163",
"text": "We employed the neural encoder-decoder with attention mechanism described in (Kiyono et al. 2017 ) as the base model."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-164",
"text": "Its encoder consists of a 2-layer bidirectional LSTM and its decoder consists of a 2-layer LSTM with attention mechanism proposed by (Luong, Pham, and Manning 2015) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-165",
"text": "We refer to this neural encoder-decoder as EncDec."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-166",
"text": "To investigate the effect of the proposed method, we introduced char3-MS-vec into EncDec."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-167",
"text": "Here, we applied char3-MS-vec to both the encoder and decoder."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-168",
"text": "Moreover, we did not apply word tying technique to EncDec because it is default setting in the widely-used encoder-decoder implementation 8 ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-169",
"text": "We set the embedding size and dimension of the LSTM hidden state to 500 for machine translation and 400 for headline generation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-170",
"text": "The mini-batch size is 64 for machine translation and 256 for headline generation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-171",
"text": "For other hyperparameters, we followed the configurations described in (Kiyono et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-172",
"text": "We constructed the vocabulary set by using BytePair-Encoding 9 (BPE) (Sennrich, Haddow, and Birch 2016) because BPE is a currently widely-used technique for vocabulary construction."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-173",
"text": "We set the number of BPE merge operations to 16K for machine translation and 5K for headline generation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-174",
"text": "Tables 9 and 10 show the results of machine translation and headline generation, respectively."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-175",
"text": "These tables show that EncDec+char3-MS-vec outperformed EncDec in all test data."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-176",
"text": "In other words, these results indicate that our proposed method also has a positive effect on the neural encoderdecoder model."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-177",
"text": "Moreover, it is noteworthy that char3-MS-vec improved the performance of EncDec even though the vocabulary set constructed by BPE contains subwords."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-178",
"text": "This implies that character n-gram embeddings improve the quality of not only word embeddings but also subword embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-179",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-180",
"text": "**RESULTS**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-181",
"text": "In addition to the results of our implementations, the lower portion of Table 10 contains results reported in previous studies."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-182",
"text": "Table 10 shows that EncDec+char3-MS-vec also outperformed the methods proposed in previous studies."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-183",
"text": "Therefore, EncDec+char3-MS-vec achieved the top scores in the test sets constructed by (Zhou et al. 2017) and (Kiyono et al. 2017) even though it does not have a task-specific architecture such as the selective gate proposed by (Zhou et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-184",
"text": "In these experiments, we only applied char3-MS-vec to EncDec but (Morishita, Suzuki, and Nagata 2018) indicated that combining multiple kinds of subword units can improve the performance."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-185",
"text": "We will investigate the effect of combining several character n-gram embeddings in future work."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-186",
"text": "6 Related Work RNN Language Model (Mikolov et al. 2010 ) introduced RNN into language modeling to handle arbitrary-length sequences in computing conditional probability p(w t+1 |w 1:t )."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-187",
"text": "They demonstrated that the RNN language model outperformed the Kneser-Ney smoothed 5-gram language model (Chen and Goodman 1996) , which is a sophisticated n-gram language model."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-188",
"text": "(Zaremba, Sutskever, and Vinyals 2014) drastically improved the performance of language modeling by applying LSTM and the dropout technique (Srivastava et al. 2014) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-189",
"text": "(Zaremba, Sutskever, and Vinyals 2014) applied dropout to all the connections except for recurrent connections but (Gal and Ghahramani 2016) proposed variational inference based dropout to regularize recurrent connections."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-190",
"text": "(Melis, Dyer, and Blunsom 2018) demonstrated that the standard LSTM can achieve superior performance by selecting appropriate hyperparameters."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-191",
"text": "Finally, (Merity, Keskar, and Socher 2018b) introduced DropConnect (Wan et al. 2013 ) and averaged SGD (Polyak and Juditsky 1992) into the LSTM language model and achieved state-of-the-art perplexities on PTB and WT2."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-192",
"text": "For WT103, (Merity, Keskar, and Socher 2018a) Table 10 : ROUGE F1 scores on the headline generation test sets provided by (Zhou et al. 2017) and (Kiyono et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-193",
"text": "The upper part is the results of our implementation and the lower part shows the scores reported in previous studies."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-194",
"text": "In the upper part, we report the average score of 3 runs."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-195",
"text": "that QRNN , which is a faster architecture than LSTM, achieved the best perplexity."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-196",
"text": "Our experimental results show that the proposed charn-MS-vec improved the performance of these state-of-the-art language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-197",
"text": "(Yang et al. 2018) explained that the training of RNN language models can be interpreted as matrix factorization."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-198",
"text": "In addition, to raise an expressive power, they proposed Mixture of Softmaxes (MoS) that computes multiple probability distributions from a final RNN layer and combines them with a weighted average."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-199",
"text": "(Takase, Suzuki, and Nagata 2018) proposed Direct Output Connection (DOC) that is a generalization of MoS. They used middle layers in addition to the final layer to compute probability distributions."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-200",
"text": "These methods (AWD-LSTM-MoS and AWD-LSTM-DOC) achieved the current state-of-the-art perplexities on PTB and WT2."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-201",
"text": "Our proposed method can also be combined with MoS and DOC."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-202",
"text": "In fact, Tables 6 and 7 indicate that the proposed method further improved the performance of them."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-203",
"text": "( Kim et al. 2016 ) introduced character information into RNN language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-204",
"text": "They applied CNN to character embeddings for word embedding construction."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-205",
"text": "Their proposed method achieved perplexity competitive with the basic LSTM language model (Zaremba, Sutskever, and Vinyals 2014) even though its parameter size is small."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-206",
"text": "(Jozefowicz et al. 2016 ) also applied CNN to construct word embeddings from character embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-207",
"text": "They indicated that CNN also positively affected the LSTM language model in a huge corpus."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-208",
"text": "(Verwimp et al. 2017 ) proposed a method concatenating character embeddings with a word embedding to use character information."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-209",
"text": "In contrast to these methods, we used character n-gram embeddings to construct word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-210",
"text": "To compare the proposed method to these methods, we combined the CNN proposed by (Kim et al. 2016 ) with the state-of-the-art LSTM language model (AWD-LSTM) (Merity, Keskar, and Socher 2018b) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-211",
"text": "Our experimental results indicate that the proposed method outperformed the method using character embeddings (charCNN in Table 5 )."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-212",
"text": "Some previous studies focused on boosting the performance of language models during testing (Grave, Joulin, and Usunier 2017; Krause et al. 2017) ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-213",
"text": "For example, (Krause et al. 2017) proposed dynamic evaluation that updates model parameters based on the given correct sequence during evaluation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-214",
"text": "Although these methods might further improve our proposed language model, we omitted these methods since it is unreasonable to obtain correct outputs in applications such as machine translation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-215",
"text": "Embedding Construction Previous studies proposed various methods to construct word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-216",
"text": "(Luong, Socher, and Manning 2013) applied Recursive Neural Networks to construct word embeddings from morphemic embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-217",
"text": "(Ling et al. 2015) applied bidirectional LSTMs to character embeddings for word embedding construction."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-218",
"text": "On the other hand, (Bojanowski et al. 2017 ) and (Wieting et al. 2016) focused on character n-gram."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-219",
"text": "They demonstrated that the sum of character n-gram embeddings outperformed ordinary word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-220",
"text": "In addition, (Wieting et al. 2016) found that the sum of character n-gram embeddings also outperformed word embeddings constructed from character embeddings with CNN and LSTM."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-221",
"text": "As an encoder, previous studies argued that additive composition, which computes the (weighted) sum of embeddings, is a suitable method theoretically (Tian, Okazaki, and Inui 2016) and empirically (Muraoka et al. 2014; ."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-222",
"text": "In this paper, we used multidimensional self-attention to construct word embeddings because it can be interpreted as an element-wise weighted sum."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-223",
"text": "Through experiments, we indicated that multi-dimensional self-attention is superior to the summation and standard selfattention as an encoder."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-224",
"text": "----------------------------------"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-225",
"text": "**CONCLUSION**"
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-226",
"text": "In this paper, we incorporated character information with RNN language models."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-227",
"text": "Based on the research in the field of word embedding construction (Wieting et al. 2016) , we focused on character n-gram embeddings to construct word embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-228",
"text": "We used multi-dimensional self-attention (Shen et al. 2018 ) to encode character n-gram embeddings."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-229",
"text": "Our proposed charn-MS-vec improved the performance of stateof-the-art RNN language models and achieved the best perplexities on Penn Treebank, WikiText-2, and WikiText-103."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-230",
"text": "Moreover, we investigated the effect of charn-MS-vec on application tasks, specifically, machine translation and headline generation."
},
{
"sent_id": "6b1432f4aac35e6acd8ca8770fe484-C001-231",
"text": "Our experiments show that charn-MS-vec also improved the performance of a neural encoder-decoder on both tasks."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"6b1432f4aac35e6acd8ca8770fe484-C001-3"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-21",
"6b1432f4aac35e6acd8ca8770fe484-C001-22",
"6b1432f4aac35e6acd8ca8770fe484-C001-23"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-48"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-145"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-191"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-227"
]
],
"cite_sentences": [
"6b1432f4aac35e6acd8ca8770fe484-C001-3",
"6b1432f4aac35e6acd8ca8770fe484-C001-21",
"6b1432f4aac35e6acd8ca8770fe484-C001-22",
"6b1432f4aac35e6acd8ca8770fe484-C001-48",
"6b1432f4aac35e6acd8ca8770fe484-C001-145",
"6b1432f4aac35e6acd8ca8770fe484-C001-191",
"6b1432f4aac35e6acd8ca8770fe484-C001-227"
]
},
"@BACK@": {
"gold_contexts": [
[
"6b1432f4aac35e6acd8ca8770fe484-C001-11"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-14",
"6b1432f4aac35e6acd8ca8770fe484-C001-15"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-18"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-21",
"6b1432f4aac35e6acd8ca8770fe484-C001-22"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-55"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-205"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-218"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-220"
]
],
"cite_sentences": [
"6b1432f4aac35e6acd8ca8770fe484-C001-11",
"6b1432f4aac35e6acd8ca8770fe484-C001-14",
"6b1432f4aac35e6acd8ca8770fe484-C001-15",
"6b1432f4aac35e6acd8ca8770fe484-C001-18",
"6b1432f4aac35e6acd8ca8770fe484-C001-21",
"6b1432f4aac35e6acd8ca8770fe484-C001-22",
"6b1432f4aac35e6acd8ca8770fe484-C001-55",
"6b1432f4aac35e6acd8ca8770fe484-C001-205",
"6b1432f4aac35e6acd8ca8770fe484-C001-218",
"6b1432f4aac35e6acd8ca8770fe484-C001-220"
]
},
"@MOT@": {
"gold_contexts": [
[
"6b1432f4aac35e6acd8ca8770fe484-C001-14",
"6b1432f4aac35e6acd8ca8770fe484-C001-15",
"6b1432f4aac35e6acd8ca8770fe484-C001-16",
"6b1432f4aac35e6acd8ca8770fe484-C001-17"
],
[
"6b1432f4aac35e6acd8ca8770fe484-C001-220",
"6b1432f4aac35e6acd8ca8770fe484-C001-221",
"6b1432f4aac35e6acd8ca8770fe484-C001-222"
]
],
"cite_sentences": [
"6b1432f4aac35e6acd8ca8770fe484-C001-14",
"6b1432f4aac35e6acd8ca8770fe484-C001-15",
"6b1432f4aac35e6acd8ca8770fe484-C001-220"
]
}
}
},
"ABC_97f8d0af85eda3e453fc4fb00819f0_2": {
"x": [
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-180",
"text": "al., 2016; Melamud et al., 2016) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-9",
"text": "Finally, the limited sense coverage in the annotated datasets is a major limitation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-2",
"text": "LSTM-based language models have been shown effective in Word Sense Disambiguation (WSD)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-3",
"text": "In particular, the technique proposed by Yuan et al. (2016) returned state-of-the-art performance in several benchmarks, but neither the training data nor the source code was released."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-4",
"text": "This paper presents the results of a reproduction study and analysis of this technique using only openly available datasets (GigaWord, SemCor, OMSTI) and software (TensorFlow)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-5",
"text": "Our study showed that similar results can be obtained with much less data than hinted at by Yuan et al. (2016) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-6",
"text": "Detailed analyses shed light on the strengths and weaknesses of this method."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-7",
"text": "First, adding more unannotated training data is useful, but is subject to diminishing returns."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-10",
"text": "All code and trained models are made freely available."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-8",
"text": "Second, the model can correctly identify both popular and unpopular meanings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-13",
"text": "Word Sense Disambiguation (WSD) is a long-established task in the NLP community (see Navigli (2009) for a survey) which goal is to annotate lemmas in text with the most appropriate meaning from a lexical database like WordNet (Fellbaum, 1998) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-14",
"text": "Many approaches have been proposed -the more popular ones include the usage of Support Vector Machine (SVM) (Zhong and Ng, 2010) , SVM combined with unsupervised trained embeddings (Iacobacci et al., 2016; Rothe and Sch\u00fctze, 2017) , and graph-based approaches (Agirre et al., 2014; Weissenborn et al., 2015) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-15",
"text": "In recent years, there has been a surge in interest in using Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) to perform WSD (Raganato et al., 2017b; Melamud et al., 2016) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-16",
"text": "These approaches are characterized by their high performance, simplicity and their ability to extract a lot of information from raw text."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-17",
"text": "Among the best-performing ones is the approach by Yuan et al. (2016) , in which an LSTM language model trained on a corpus with 100 billion tokens was coupled with small sense-annotated datasets to achieve state-of-the-art performance in all-words WSD."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-18",
"text": "Even though the results obtained by Yuan et al. (2016) outperform the previous state-of-the-art, neither the used datasets nor the constructed models are available to the community."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-19",
"text": "This is unfortunate because this makes the re-application of this technique a non-trivial process, and it hinders further studies for understanding which limitations prevent even higher accuracies."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-20",
"text": "These could be, for instance, of algorithmic nature or relate to the input (either size or quality), and a deeper understanding is crucial for enabling further improvements."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-21",
"text": "In addition, some details are not reported, and this could prevent other attempts from replicating the results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-22",
"text": "To address these issues, we reimplemented Yuan et al. (2016) 's method with the goal of: 1) reproducing and making available the code, trained models, and results and 2) understanding which are the main factors that constitute the strengths and weaknesses of this method."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-23",
"text": "While a full replication is not possible due to the unavailability of the original data, we nevertheless managed to reproduce their approach with other public text corpora, and this allowed us to perform a deeper investigation on the performance of this technique."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-24",
"text": "This investigation aimed at understanding how sensitive the WSD approach is w.r.t."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-25",
"text": "the amount of unannotated data (i.e., raw text) used for training, model complexity, how biased the method is towards the choice of the most frequent senses (MFS), and identifying limitations that cannot be overcome with bigger unannotated datasets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-26",
"text": "The contribution of this paper is thus two-fold: On the one hand, we present a reproduction study whose results are publicly available and hence can be freely used by the community."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-27",
"text": "Notice that the lack of available models has been explicitly mentioned, in a recent work, as the cause for the missing comparison of this technique with other competitors (Raganato et al., 2017b, footnote 10) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-28",
"text": "On the other hand, we present other experiments to shed more light on the value of this and similar methods."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-29",
"text": "We anticipate some conclusions."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-30",
"text": "First, a positive result is that we were able to reproduce the method from Yuan et al. (2016) and obtain similar results to the ones originally published."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-31",
"text": "However, to our surprise, these results were obtained using a much smaller corpus of 1.8 billion tokens (Gigaword), which is less than 2% of the data used in the original study."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-79",
"text": "The main intuition is that words used with the same sense are mentioned in contexts which are very similar to each other as well."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-32",
"text": "In addition, we observe that the amount of unannotated data is important, but that the relationship between its size and the improvement is not linear, meaning that exponentially more unannotated data is needed in order to improve the performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-33",
"text": "Moreover, we show that the percentage of correct sense assignments is more balanced w.r.t sense popularity, meaning that the system has a less-strong bias towards the most-frequent sense (MFS) and is better at recognizing both popular and unpopular meanings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-34",
"text": "Finally, we show that the limited sense coverage in the annotated datasets is a major limitation, as shown by the fact that resulting model does not have a representation for more than 30% of the meanings which should have been considered for disambiguating the test sets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-36",
"text": "**BACKGROUND**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-37",
"text": "Current WSD systems can be categorized according to two dimensions: whether they use raw text without any preassigned meaning (unannotated data henceforth), and whether they exploit the relations between synsets in WordNet (synset relations henceforth)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-38",
"text": "One prominent state-of-the-art system that does not rely on unannotated data nor exploits synset relations is It Makes Sense (IMS) (Zhong and Ng, 2010; Taghipour and Ng, 2015) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-39",
"text": "This system uses an SVM to train classifiers for each lemma using only annotated data as training evidence."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-40",
"text": "In contrast, graph-based WSD systems do not use (un)annotated data but rely on the synset relations."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-41",
"text": "The system UKB (Agirre et al., 2014) represents WordNet as a graph where the synsets are the nodes and the relations are the edges."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-42",
"text": "After the node weights have been initialized using the Personalized Page Rank algorithm, they are updated depending on context information."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-43",
"text": "Then, the synset with the highest weight is chosen."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-44",
"text": "Babelfy (Moro et al., 2014) and the system by Weissenborn et al. (2015) both represent the whole input document as a graph with synset relations as edges and jointly disambiguate nouns and verbs."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-45",
"text": "In the case of Babelfy, a densest-subgraph heuristic is used to compute the high-coherence semantic interpretations of the text."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-46",
"text": "Instead, in Weissenborn et al. (2015) a set of complementary objectives, which include sense probabilities and type classification, are combined together to perform WSD."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-47",
"text": "A number of systems make use of both unannotated data and synset relations."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-48",
"text": "Both Tripodi and Pelillo (2017) and Camacho-Collados et al. (2016) make use of statistical information from unannotated data to weigh the relevance of nodes in a graph, which is then used to perform WSD."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-49",
"text": "Rothe and Sch\u00fctze (2017) use word embeddings as a starting point and then rely on the formal constraints in a lexical resource to create synset embeddings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-50",
"text": "Recently, there has been a surge in WSD approaches that use unannotated data but do not consider synset relations."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-51",
"text": "One example is provided by Iacobacci et al. (2016) , who investigated the role of word embeddings as features in a WSD system."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-52",
"text": "Four methods (concatenation, average, fractional decay, and exponential decay) are used to extract features from the sentential context using word embeddings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-53",
"text": "The features are then added to the default feature set of IMS (Zhong and Ng, 2010) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-54",
"text": "Moreover, Raganato et al. (2017b) present a number of end-to-end neural WSD architectures."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-55",
"text": "The best performing one is based on a bidirectional Long Short-Term Memory (BLSTM) with attention and two auxiliary loss functions (part-of-speech and the WordNet coarse-grained semantic labels)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-56",
"text": "Melamud et al. (2016) also make use of unannotated data to train a BLSTM."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-57",
"text": "The work by Yuan et al. (2016) , which we consider in this paper, belongs to this last category."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-58",
"text": "Different from Melamud et al. (2016) , it uses significantly more unannotated data, the model contains more hidden units (2048 vs. 600), and the sense assignment is more elaborated."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-59",
"text": "We describe this approach in more detail in the following section."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-61",
"text": "**WSD WITH LANGUAGE MODELS**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-62",
"text": "The method proposed by Yuan et al. (2016) performs WSD by annotating each lemma in a text with one WordNet synset that is associated with its meaning."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-63",
"text": "Broadly speaking, the disambiguation is done by: 1) constructing a language model from a large unannotated dataset; 2) extracting sense embeddings from this model using a much smaller annotated dataset; 3) relying on the sense embeddings to make predictions on the lemmas in unseen sentences."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-64",
"text": "Each operation is described below."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-65",
"text": "Constructing Language Models."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-66",
"text": "Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997 ) is a celebrated recurrent neural network architecture that has proven to be effective in many natural language processing tasks (Sutskever et al., 2014; Dyer et al., 2015; He et al., 2017, among others) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-67",
"text": "Different from previous architectures, LSTM is equipped with trainable gates that control the flow of information, allowing the neural networks to learn both short-and long-range dependencies."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-68",
"text": "In Yuan et al. (2016) , the first operation consists of constructing an LSTM language model to capture the meaning of words in context."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-69",
"text": "They use an LSTM network with a single hidden layer of h nodes."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-70",
"text": "Given a sentence s = (w 1 , w 2 , . . . , w n ), they replace word w k (1 \u2264 k \u2264 n) by a special token $."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-71",
"text": "The model takes this new sentence as input and produces a context vector c of dimensionality p (see Figure 1 )."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-72",
"text": "1 Each word w in the vocabulary V is associated with an embedding \u03c6 o (w) of the same dimensionality."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-73",
"text": "The model is trained to predict the omitted word, minimizing the softmax loss over a big collection D of sentences."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-74",
"text": "After the model is trained, we can use it to extract context embeddings, i.e., latent numerical representations of the sentence surrounding a given word."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-75",
"text": "Calculating Sense Embeddings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-76",
"text": "The model produced by the LSTM network is meant to capture the \"meaning\" of words in the context they are mentioned."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-77",
"text": "In order to perform the sense disambiguation, we need to extract from it a suitable representation for word senses."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-78",
"text": "To this purpose, the method relies on another corpus where each word is annotated with the corresponding sense."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-80",
"text": "This suggests a simple way to calculate sense embeddings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-81",
"text": "First, the LSTM model is invoked to compute the context vector for each occurrence of one sense in the annotated dataset."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-82",
"text": "Once all context vectors are computed, the sense embedding is defined as the average of all vectors."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-83",
"text": "Let us assume, for instance, that the sense horse 2 n (that is, the second sense of horse as a noun) appears in the two sentences:"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-84",
"text": "(1)"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-85",
"text": "The move of the horse 2 n to the corner forced the checkmate."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-86",
"text": "(2) Karjakin makes up for his lost bishop a few moves later, trading rooks and winning black's horse 2 n ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-87",
"text": "In this case, the method will replace the sense by $ in the sentences and feed them to the trained LSTM model to calculate two context vectors c 1 and c 2 ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-88",
"text": "The sense embedding s horse 2 n is then computed as:"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-89",
"text": "This procedure is computed for every sense that appears in the annotated corpus."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-90",
"text": "Averaging technique to predict senses."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-91",
"text": "After all sense embeddings are computed, the method is ready to disambiguate target words."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-92",
"text": "This procedure proceeds as follows:"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-93",
"text": "1. Given an input sentence and a target word, it replaces the occurrence of the target word by $ and uses the LSTM model to predict a context vector c t ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-94",
"text": "2."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-95",
"text": "The lemma of the target word is used to retrieve from WordNet the candidate synsets s 1 , . ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-96",
"text": ". , s n where n is the number of synsets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-97",
"text": "Then, the procedure looks up the corresponding sense embeddings s 1 , . ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-98",
"text": ". , s n computed in the previous step."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-99",
"text": "3. The procedure invokes a subroutine to choose one of the n senses for the context vector c t ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-100",
"text": "It selects the sense whose vector is closest to c t using cosine as the similarity function."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-101",
"text": "Label Propagation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-102",
"text": "Yuan et al. (2016) argue that the averaging procedure is suboptimal because of two reasons."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-103",
"text": "First, the distribution of occurrences of senses is unknown whereas averaging is only suitable for spherical clusters."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-104",
"text": "Second, averaging reduces the representation of occurrences of each sense to a single vector and therefore ignores sense prior."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-105",
"text": "For this reason, they propose to use label propagation for inference as an alternative to averaging."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-106",
"text": "Label propagation (Zhu and Ghahramani, 2002 ) is a classic semi-supervised algorithm that has been employed in WSD (Niu et al., 2005) and other NLP tasks (Chen et al., 2006; Zhou, 2011) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-107",
"text": "The procedure involves predicting senses for not only the target cases but also for unannotated words queried from a corpus."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-108",
"text": "It represents both the target cases and unannotated words as points in a vector space and iteratively propagates classification labels from the target classes to the words."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-109",
"text": "In this way, it can be used to construct non-spherical clusters and to give more influence to frequent senses."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-110",
"text": "Overall algorithm."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-111",
"text": "The overall disambiguation procedure that we implemented proceeds as follows:"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-112",
"text": "1. Monosemous: First, the WSD algorithm checks whether the target lemma is monosemous (i.e., there is only one synset)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-113",
"text": "In this case, the disambiguation is trivial."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-115",
"text": "**REPRODUCTION STUDY: METHODOLOGY**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-116",
"text": "Before we report the results of our experiments, we describe the datasets used and give some details regarding our implementation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-117",
"text": "Training data."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-118",
"text": "The 100-billion-token corpus used in the original publication is not publicly available."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-119",
"text": "Therefore, for the training of the LSTM models, we used the English Gigaword Fifth Edition (Linguistic Data Consortium (LDC) catalog number LDC2011T07)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-120",
"text": "The corpus consists of 1.8 billion tokens in 4.1 million documents, originated from four major news agencies."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-121",
"text": "We leave the study of bigger corpora for future work."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-122",
"text": "For the training of the sense embeddings, we use the same two corpora used by Yuan et al. (2016): 1. SemCor (Miller et al., 1993 ) is a corpus containing approximately 240,000 sense annotated words."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-123",
"text": "The tagged documents originate from the Brown corpus (Francis and Kucera, 1979) and cover various genres."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-124",
"text": "2. OMSTI (Taghipour and Ng, 2015) contains one million sense annotations automatically tagged by exploiting the English-Chinese part of the parallel MultiUN corpus (Eisele and Chen, 2010) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-125",
"text": "A list of English translations were manually created for each WordNet sense."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-126",
"text": "If the Chinese translation of an English word matches one of the manually curated translations for a WordNet sense, that sense is selected."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-127",
"text": "Implementation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-128",
"text": "We used the BeautifulSoup HTML parser to extract plain text from the Gigaword corpus."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-129",
"text": "Then, we used the English models 3 of Spacy 1.8.2 for sentence boundary detection and tokenization."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-130",
"text": "The LSTM model is implemented using TensorFlow 1.2.1 (Abadi et al., 2015) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-131",
"text": "We chose TensorFlow because of its industrial-grade quality and because it can train large-scale models."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-132",
"text": "The main computational bottleneck of the entire process is the training of the LSTM model."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-133",
"text": "Although we do not use a 100-billion-token corpus, training the model on Gigaword can already take years if not optimized properly."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-134",
"text": "To reduce training time, we assumed that all (padded) sentences in the batch have the same length."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-135",
"text": "This optimization increases the speed by 17% as measured on a smaller model (h = 100, p = 10)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-136",
"text": "Second, following Yuan et al., we use the sampled softmax loss function (Jean et al., 2015) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-137",
"text": "Third, we grouped sentences of similar length together while varying the number of sentences in a batch to fully utilize GPU RAM."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-138",
"text": "Together, these heuristics increased training speed by 42 times."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-139",
"text": "Although Yuan et al. proposed to use a distributed implementation of label propagation (Ravi and Diao, 2015) , we found that scikit-learn (Pedregosa et al., 2011) was fast enough for our experiments."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-140",
"text": "For hyperparameter tuning, we use the annotations in OMSTI (which are not used at test time)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-141",
"text": "After measuring the performance of some variations of label propagation (scikit-learn implementation: LabelPropagation or LabelSpreading; similarity measure: inner product or radial basis function with different values of \u03b3), we found that the combination of LabelSpreading and inner product similarity leads to the best result which is also better than averaging on the development set."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-142",
"text": "Evaluation framework."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-143",
"text": "For evaluating the WSD predictions, we selected two test sets: one from the Senseval2 (Palmer et al., 2001 ) competition, which tests the disambiguation of nouns, verbs, adjectives and adverbs, and one from the 2013 edition (Navigli et al., 2013) , which focuses only on nouns."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-144",
"text": "The test set from Senseval-2 is the English All-Words Task; senseval2 henceforth."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-145",
"text": "This dataset contains 2,282 annotations from three articles from the Wall Street Journal."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-146",
"text": "Most of the annotations are nominal, but the competition also contains annotations for verbs, adjectives, and adverbs."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-147",
"text": "In this test set, 66.8% of all target words are annotated with the most-frequent sense (MFS) of the lemma."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-148",
"text": "This means that the simple strategy of always selecting the MFS would score 66.8% F 1 on this dataset."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-149",
"text": "The test set from SemEval-2013 is the one taken from task 12: Multilingual Word Sense Disambiguation; semeval2013 henceforth."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-150",
"text": "This task consists of two disambiguation tasks: Entity Linking and Word Sense Disambiguation for English, German, French, Italian, and Spanish."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-151",
"text": "This test set contains 13 articles from previous editions of the workshop on Statistical Machine Translation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-152",
"text": "4 The articles contain Table 1 : Performance of our implementation compared to already published results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-153",
"text": "We report the model/method used to perform WSD, the used annotated dataset and scorer, and F1 for each test set."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-154",
"text": "In the naming of our models, LSTM indicates that the averaging technique was used for the sense assignment, while LSTMLP refers to the results obtained using label propagation (see Section 3)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-155",
"text": "The datasets following T: indicate the annotated corpus used to represent the senses while U:OMSTI stands for using OMSTI as unlabeled sentences in case label propagation is used."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-156",
"text": "P: SemCor indicates that sense distributions from SemCor are used in the system architecture."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-157",
"text": "Three scorers are used: \"framework\" refers to the WSD evaluation framework from Raganato et al. (2017a) ; \"mapping to WN3.0\" refers to the evaluation used by Yuan et al. (2016) while \"competition\" refers to the scorer provided by the competition itself (e.g., semeval2013)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-158",
"text": "1,644 test instances in total, which are all nouns."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-159",
"text": "The application of the MFS baseline on this dataset yields an F 1 score of 63.0%."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-161",
"text": "**RESULTS**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-162",
"text": "In this section, we report our reproduction of the results of Yuan et al. (2016) and additional experiments to gain a deeper insight into the strengths and weaknesses of the approach."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-163",
"text": "These experiments focus on the performance on the most-and less-frequent senses, coverage of the annotated dataset and the consequent impact on the overall predictions, the granularity of the sense representation, and the impact of the unannotated data and model complexity on the accuracy of WSD."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-164",
"text": "Reproduction results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-165",
"text": "We trained the LSTM model with the best reported settings in Yuan et al. (2016) (hidden layer size h = 2048, embedding dimensionality p = 512) using a machine equipped with an Intel Xeon E5-2650, 256GB of RAM, 8TB of disk space, and two nVIDIA GeForce GTX 1080 Ti GPUs."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-166",
"text": "During our training, one epoch took about one day to finish with TensorFlow fully utilizing one GPU."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-167",
"text": "The whole training process took four months."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-168",
"text": "We tested the performance of the downstream WSD task three times during the training and observed that the best performance is obtained at the 65 th epoch, despite a later model producing a lower negative log-likelihood."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-169",
"text": "Thus, we used the model produced at the 65 th epoch for our experiments below."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-170",
"text": "Table 1 presents the results using the test sets senseval2 and semeval2013, respectively."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-171",
"text": "The top part of the table presents our reproduction results, the middle part reports the results from Yuan et al. (2016) , while the bottom part reports a representative sample of the other state-of-the-art approaches."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-172",
"text": "It should be noted that with the test set semeval2013, all scorers use WordNet 3.0, therefore the performance of the various methods can be directly compared."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-173",
"text": "However, not all answers in senseval2 can be mapped to WN3.0 and we do not know how Yuan et al. (2016) handled these cases."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-174",
"text": "In the WSD evaluation framework (Moro et al., 2014 ) that we selected for evaluation, these cases were either re-annotated or removed."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-175",
"text": "Thus, our F 1 on senseval2 cannot be directly compared with the F 1 in the original paper."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-176",
"text": "From a first glance at Table 1 , we observe that if we use SemCor to train the synset embeddings, then our results come close to the state-of-the-art on senseval2 (0.720 vs. 0.733)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-177",
"text": "On semeval2013, we achieve results comparable to other embeddings-based approaches (Raganato et al., 2017b; Iacobacci et Table 2 : Performance of our implementation with respect to MFS and LFS recall."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-178",
"text": "R mfs and R lfs are the recall on the mostfrequent-sense and least-frequent-sense instances respectively."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-179",
"text": "n represents the number of considered instances."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-181",
"text": "However, the gap with the graph-based approach of Weissenborn et al. (2015) is still significant."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-182",
"text": "When we use both SemCor and OMSTI for the annotated data, our results drop 0.02 point for senseval2, whereas they increase by almost 0.01 for semeval2013."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-183",
"text": "Different from Yuan et al. (2016), we did not observe improvement by using label propagation (comparing T: SemCor, U: OMSTI against T:SemCor without propagation)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-184",
"text": "However, the performance of the label propagation strategy is still competitive on both test sets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-185",
"text": "Most-vs."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-186",
"text": "less-frequent-sense instances."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-187",
"text": "The original paper only analyses the performance on the whole test sets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-188",
"text": "We extend this analysis by looking at the performance for disambiguating the most frequent-sense (MFS) and less-frequent-sense (LFS) instances."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-189",
"text": "The first type of instances are the ones for which the correct link is the most-frequent sense, whereas the second subset consists of the remaining ones."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-190",
"text": "This analysis is important because it is well-known that the simple strategy of always choosing the MFS is a strong baseline in WSD, thus there is a tendency for WSD systems to overfit towards the MFS (Postma et al., 2016) ."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-191",
"text": "Table 2 shows that the method by Yuan et al. (2016) does not overfit towards the MFS to the same extent as other supervised systems since the recall on LFS instances is still quite high 0.41 (a lower recall on LFS instances than on MFS ones is expected due to the reduced training data for them)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-192",
"text": "On semeval13, the recall on LFS is already relatively high using only SemCor (0.33), and reaches 0.38 when using both SemCor and OMSTI."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-193",
"text": "For comparison, the default system IMS (Zhong and Ng, 2010) trained on SemCor only obtains an R lfs of 0.15 on semeval13 (Postma et al., 2016) and only reaches 0.33 with a large amount of annotated data."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-194",
"text": "Finally, our implementation of the label propagation does seem to slightly overfit towards the MFS."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-195",
"text": "When we compare the results of the averaging technique using SemCor and OMSTI versus when we use label propagation, we notice an increase in the MFS recall (from 0.85 to 0.91), whereas the LFS recall drops from 0.40 to 0.32."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-196",
"text": "Meaning coverage in annotated datasets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-197",
"text": "The WSD procedure depends on an annotated corpus to compose its sense representations, making missing annotations an insurmountable obstacle."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-198",
"text": "In fact, annotated datasets only contain annotations for a proper subset of the possible candidate synsets listed in WordNet."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-199",
"text": "We analyze this phenomenon using four statistics:"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-200",
"text": "1. Candidate Coverage: For each test set, we performed a lookup in WordNet to determine the unique candidate synsets of all target lemmas."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-201",
"text": "We then determined what percentage of these candidate synsets that have at least one annotation in the annotated dataset."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-202",
"text": "2. Lemma Coverage: Given a target lemma in a test set, we performed a lookup in WordNet to determine the unique candidate synsets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-203",
"text": "If all candidate synsets of that target lemma have at least one annotation in the annotated dataset, we claim that the lemma is covered."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-204",
"text": "The lemma coverage is then the percentage of all covered target lemmas."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-205",
"text": "A high lemma coverage indicates that annotated dataset covers most of the meanings in the test set."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-206",
"text": "3. Gold Coverage: We calculate the percentage of the correct answers in the test set that has at least one annotation in the annotated dataset."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-207",
"text": "The column \"Candidate Coverage\" of Table 3 shows that SemCor only contains less than 70% of all candidate synsets for senseval2 and semeval2013, meaning that a model will never have a representation for more than 30% of the candidate synsets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-208",
"text": "Even with the addition of OMSTI, the coverage does not exceed 70%, meaning that we lack evidence for a significant number of potential annotations."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-209",
"text": "Moreover, the column \"Lemma Coverage\" illustrates that we have evidence for all potential solutions for only 30% of the lemmas in both WSD competitions, meaning that in the large majority of the cases some solutions are never seen."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-210",
"text": "The column \"Gold coverage\" measures whether the right answers are at least seen in the annotated dataset."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-211",
"text": "The numbers illustrate that 20% of the solutions in the test sets do not have any annotations."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-212",
"text": "With our approach, these answers can only be returned if the lemma is monosemous or by random guess otherwise."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-213",
"text": "To further investigate these issues, Table 4 reports the recall of the various disambiguation strategies which could be invoked depending on the coverage of the lemma (these can be: monosemous, averaging, label propagation, MFS -see the overall procedure reported in Section 3)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-214",
"text": "We observe that the MFS fallback plays a significant role in obtaining the overall high accuracy since it is invoked many times, especially with OMSTI due to the low coverage of the dataset (in this case it is invoked in 775 cases vs. 1072 of averaging)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-215",
"text": "For example, if we had not applied the MFS fallback strategy for senseval2 using SemCor as the annotated corpus, our performance would have dropped from 0.72 to 0.66, below the MFS baseline of 0.67 for this task."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-216",
"text": "5 Label propagation was indeed applied on half of the cases, but leads to lower results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-217",
"text": "From these results, we learn that the effectiveness of this method strongly depends on the coverage of the annotated datasets: If it is not high, as it is with OMSTI, then the performance of this method reduces to the one of choosing the MFS."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-218",
"text": "Granularity of sense representation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-219",
"text": "Rothe and Sch\u00fctze (2017) provided evidence for the claim that the granularity of the sense representations has an influence on WSD performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-220",
"text": "More in particular, their WSD system performed better when trained on sensekeys (called lexemes in their paper) than on synsets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-221",
"text": "Although a sensekey-based disambiguation results in less annotated data per target lemma, the sensekey representation is more precise (since it is a lemma associated with a particular meaning) than at the synset level."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-222",
"text": "The reimplementation discussed in this paper allows us to answer the question: \"How will LSTM models work if we lower the disambiguation level from synset to sensekey?\" Table 5 presents the results of this experiment."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-223",
"text": "As we can see from the table, our method also returns better performance on both test sets."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-224",
"text": "This behavior is interesting and one possible explanation is that sensekeys are more discriminative than synsets and this favors the disambiguation."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-225",
"text": "The effect of (a) the size of unannotated corpus and (b) the number of parameters on WSD performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-226",
"text": "Number of parameters includes the weights of the hidden layer, the weights of the projection layer, and the input and output embeddings."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-227",
"text": "Notice that the horizontal axis is in log scale."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-228",
"text": "Impact of unannotated data and model size."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-229",
"text": "Since unannotated data is abundant, it is tempting to use more and more data to train language models, hoping that better word embeddings would translate into improved WSD performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-230",
"text": "The fact that Yuan et al. (2016) used a 100-billion-token corpus only reinforces this intuition."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-231",
"text": "We empirically evaluate the effectiveness of unlabeled data by varying the size of the corpus used to train LSTM models and measure the corresponding WSD performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-232",
"text": "More in particular, the size of the training data was set at 1%, 10%, 25%, and 100% of the GigaWord corpus (which contains 1.8 \u00d7 10 7 , 1.8 \u00d7 10 8 , 4.5 \u00d7 10 8 and 1.8 \u00d7 10 9 words, respectively)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-233",
"text": "Figure 2a shows the effect of unannotated data volume on WSD performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-234",
"text": "The data points at 100 billion (10 11 ) tokens correspond to Yuan et al. (2016) 's reported results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-235",
"text": "As might be expected, a bigger corpus leads to more meaningful context vectors and therefore higher performance on WSD."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-236",
"text": "However, the amount of data needed for 1% of improvement in F 1 grows exponentially fast (notice that the horizontal axis is in log scale)."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-237",
"text": "Extrapolating from this graph, to get a performance of 0.8 F 1 by adding more unannotated data, one would need a corpus of 10 12 tokens."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-238",
"text": "This observation also applies to the balance of the sense assignment."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-239",
"text": "Using only 25% of the unannotated data already yields a recall of 35% on the less-frequent senses."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-240",
"text": "In addition, one might expect to push the performance further by increasing the capacity of the LSTM models."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-241",
"text": "To evaluate this possibility, we performed an experiment in which we varied the sizes of LSTM models trained on 100% of the GigaWord corpus and evaluated against senseval2 and semeval2013, respectively."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-242",
"text": "Figure 2b suggests that it is possible but one would need exponentially bigger models."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-243",
"text": "Finally, Reimers and Gurevych (2017) have showed that it is crucial to report the distribution of test scores instead of only one score as this practice might lead to wrong conclusions."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-244",
"text": "As pointed out at the beginning of Section 5, our biggest models take months to train, making training multiple versions of them impractical."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-245",
"text": "However, we trained our smallest model (h = 100, p = 10) ten times and our second smallest model (h = 256, p = 64) five times and observed that as the number of parameters increased, the standard deviation of F 1 decreased from 0.008 to 0.003."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-246",
"text": "We, therefore, believe random fluctuation does not affect the interpretation of the results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-247",
"text": "----------------------------------"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-248",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-249",
"text": "This paper reports the results of a reproduction study of the model proposed by Yuan et al. (2016) and an additional analysis to gain a deeper understanding of the impact of various factors on its performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-250",
"text": "A number of interesting conclusions can be drawn from our results."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-251",
"text": "First, we observed that we do not need a very large unannotated dataset to achieve state-of-the-art all-words WSD performance since we used the Gigaword corpus, which is two orders of magnitude smaller than Yuan et al. (2016) 's proprietary corpus, and got similar performance on senseval2 and semeval2013."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-252",
"text": "A more detailed analysis hints that adding more unannotated data and increasing model capacity are subject to diminishing returns."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-253",
"text": "Moreover, we observed that this approach has a more balanced sense assignment than other techniques, as shown by the relatively good performance on less-frequent-sense instances."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-254",
"text": "In addition, we identified that the limited sense coverage in annotated dataset places a potentially upper bound for the overall performance."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-255",
"text": "The code with detailed replication instructions is available at: https://github.com/ cltl/wsd-dynamic-sense-vector and the trained models at: https://figshare."
},
{
"sent_id": "97f8d0af85eda3e453fc4fb00819f0-C001-256",
"text": "com/articles/A_Deep_Dive_into_Word_Sense_Disambiguation_with_LSTM/ 6352964."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-3"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-17"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-62"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-68"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-102"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-3",
"97f8d0af85eda3e453fc4fb00819f0-C001-17",
"97f8d0af85eda3e453fc4fb00819f0-C001-62",
"97f8d0af85eda3e453fc4fb00819f0-C001-68",
"97f8d0af85eda3e453fc4fb00819f0-C001-102"
]
},
"@USE@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-3",
"97f8d0af85eda3e453fc4fb00819f0-C001-4"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-30"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-57"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-122"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-157"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-162"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-165"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-171"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-191"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-234"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-249"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-3",
"97f8d0af85eda3e453fc4fb00819f0-C001-30",
"97f8d0af85eda3e453fc4fb00819f0-C001-57",
"97f8d0af85eda3e453fc4fb00819f0-C001-122",
"97f8d0af85eda3e453fc4fb00819f0-C001-157",
"97f8d0af85eda3e453fc4fb00819f0-C001-162",
"97f8d0af85eda3e453fc4fb00819f0-C001-165",
"97f8d0af85eda3e453fc4fb00819f0-C001-171",
"97f8d0af85eda3e453fc4fb00819f0-C001-191",
"97f8d0af85eda3e453fc4fb00819f0-C001-234",
"97f8d0af85eda3e453fc4fb00819f0-C001-249"
]
},
"@DIF@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-5"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-30",
"97f8d0af85eda3e453fc4fb00819f0-C001-31"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-183"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-251"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-5",
"97f8d0af85eda3e453fc4fb00819f0-C001-30",
"97f8d0af85eda3e453fc4fb00819f0-C001-183",
"97f8d0af85eda3e453fc4fb00819f0-C001-251"
]
},
"@MOT@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-18",
"97f8d0af85eda3e453fc4fb00819f0-C001-19",
"97f8d0af85eda3e453fc4fb00819f0-C001-20",
"97f8d0af85eda3e453fc4fb00819f0-C001-21",
"97f8d0af85eda3e453fc4fb00819f0-C001-22"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-18",
"97f8d0af85eda3e453fc4fb00819f0-C001-22"
]
},
"@EXT@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-22"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-22"
]
},
"@SIM@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-171",
"97f8d0af85eda3e453fc4fb00819f0-C001-172"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-183",
"97f8d0af85eda3e453fc4fb00819f0-C001-184"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-234"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-171",
"97f8d0af85eda3e453fc4fb00819f0-C001-183",
"97f8d0af85eda3e453fc4fb00819f0-C001-234"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"97f8d0af85eda3e453fc4fb00819f0-C001-173"
],
[
"97f8d0af85eda3e453fc4fb00819f0-C001-230"
]
],
"cite_sentences": [
"97f8d0af85eda3e453fc4fb00819f0-C001-173",
"97f8d0af85eda3e453fc4fb00819f0-C001-230"
]
}
}
},
"ABC_4f646eceef2e5fc447a367488b6aaf_2": {
"x": [
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-262",
"text": "The random Table 2 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-2",
"text": "We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-3",
"text": "By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-4",
"text": "Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-5",
"text": "To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-6",
"text": "While some of these techniques transfer, we find significantly lower absolute performance in the continuous setting -suggesting that performance in prior 'navigation-graph' settings may be inflated by the strong implicit assumptions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-9",
"text": "Springing forth from the pages of science fiction and capturing the daydreams of weary chore-doers everywhere, the promise and potential of general-purpose robotic assistants that follow natural language instructions has been long understood."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-10",
"text": "Taking a small step towards this goal, recent work has begun developing artificial agents that follow natural language navigation instructions in perceptually-rich, simulated environments [4, 6] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-140",
"text": "These trajectories consist of a sequence of nodes \u03c4 = [v 1 , . . . , v T ] with length T averaging between 4 and 6 nodes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-11",
"text": "An example instruction might be \"Go down the hall and turn left at the wooden desk."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-12",
"text": "Continue until you reach the kitchen and then stop by the kettle.\" and agents are evaluated by their ability to follow the described path in (potentially novel) simulated environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-13",
"text": "Many of these tasks have been developed from datasets of panoramic images captured in real scenes -e.g."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-14",
"text": "Google StreetView images in Touchdown [6] or Matterport3D panoramas captured in homes in Vision-and-Language Navigation (VLN) [4] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-15",
"text": "This paradigm enables efficient data collection and high visual fidelity compared to 3D scanning or creating synthetic environments; however, scenes are only observed from a sparse set of points relative to the full 3D environment (\u223c117 viewpoints per environment in VLN)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-16",
"text": "As a consequence, environments in these tasks are defined in terms of a navigation graph (or nav-graph for short) (a) Vision-and-Language Navigation (VLN) (b) VLN in Continuous Environments (VLN-CE) Fig. 1 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-17",
"text": "The VLN setting (a) operates on a fixed topology of panoramic images (shown in blue) -assuming perfect navigation between nodes (often meters apart) and precise localization."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-18",
"text": "Our VLN-CE setting (b) lifts these assumptions by instantiating the task in continuous environments with low-level actions -providing a more realistic testbed for robot instruction following."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-19",
"text": "-a static topological representation of 3D space."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-20",
"text": "As shown in Fig. 1(a) , nodes in the nav-graph correspond to 360\u00b0panoramic images taken at fixed locations and edges between nodes indicate navigability."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-21",
"text": "This nav-graph based formulation introduces a number of assumptions that make it a poor proxy for what a robotic agent would encounter while navigating the real world."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-22",
"text": "Focusing our discussion on Vision-and-Language Navigation (VLN), the existence and common usage of the nav-graph imply the following assumptions: -Known topology."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-23",
"text": "Rather than continuous environments in which agents can move freely, agents operate on a fixed topology of traversable nodes (shown in blue in Fig. 1(a) )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-24",
"text": "Aside from being a poor match to robot control, this also provides prior information about environment layout to agents -even in \"unseen\" test settings."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-25",
"text": "For example, it is common practice to define agent actions by selecting directions in the current panorama and 'snapping' to the nearest adjacent nav-graph node in that direction."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-26",
"text": "How an actual agent might acquire and update such a topology in new environments is an open question."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-27",
"text": "-Oracle navigation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-28",
"text": "Movement between adjacent nodes in the nav-graph is deterministic, implying the existence of an oracle navigator capable of accurately traversing multiple meters in the presence of obstacles -abstracting away the problem of visual navigation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-29",
"text": "Further, this movement between nodes is perceptually akin to teleportation -the current panorama is simply replaced by the panorama at the new location meters away."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-30",
"text": "This is in contrast to the continuous stream of observations a real agent would encounter while moving."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-31",
"text": "-Perfect localization."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-32",
"text": "Agents are given their precise location and heading at all times."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-33",
"text": "Most works use this data to encode precise geometry between nodes in the nav-graph as part of the decision making process, e.g. moving 30\u00b0W and 1.12m forward from the previous node."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-34",
"text": "Others use precise agent localization to construct spatial maps of the environment on which to reason about paths [3] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-35",
"text": "However, precise localization indoors is still a challenging problem."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-36",
"text": "Taken together, these assumptions make current settings poor reflections of the real world both in terms of control (ignoring actuation, navigation, and localization error) and visual stimuli (lacking the poor framing and long observationsequences agents will encounter)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-37",
"text": "In essence, the problem is reduced to that of visually-guided graph search."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-38",
"text": "As such, closing the loop by transferring these trained agents to physical robotic platforms has not been examined."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-39",
"text": "These assumptions are often justified by invoking existing technologies as potential oracles."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-40",
"text": "For example, simultaneous localization and mapping (SLAM) or odometry systems can offer strong localization in appropriate conditions [16, 21] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-41",
"text": "Likewise, algorithms for path planning and control can navigate short distances in the presence of obstacles [11, 25, 31] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-42",
"text": "Further, it is reasonable to suggest that issuing commands at the level of relative waypoints (in analogy to nav-graph nodes) is the proper interface between language-guided AI navigators and lowerlevel agent control."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-43",
"text": "However, these techniques are each independently far from perfect and such an agent would need to learn the limitations of these lowerlevel control systems -facing consequences when proposed waypoints cannot be reached effectively."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-44",
"text": "Integrative studies such as these that combine and evaluate techniques for control and mapping with learned AI agents are not possible in current nav-graph based problem settings."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-45",
"text": "In this work, we develop a continuous setting that enables these types of studies and take a first step towards integrating VLN agents with control via low-level actions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-46",
"text": "Vision-and-Language Navigation in Continuous Environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-47",
"text": "In this work, we focus in on the Vision-and-Language Navigation (VLN) [4] task and lift these implicit assumptions by instantiating it in continuous 3D environments rendered in a high-throughput simulator [19] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-48",
"text": "Consequently, we call this task Vision-and-Language Navigation in Continuous Environments (VLN-CE)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-49",
"text": "Agents in our task are free to navigate to any unobstructed point through a set of low-level actions (e.g. move forward 0.25m, turn-left 15 degrees) rather than teleporting between fixed nodes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-50",
"text": "This setting introduces many challenges ignored in prior work."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-51",
"text": "Agents in VLN-CE face significantly longer time horizons; the average number of actions along a path in VLN-CE is \u223c55 compared to the 4-6 node hops in VLN (as illustrated in Fig. 1 )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-52",
"text": "Moreover, the views the agent receives along the way are not well-posed by careful human operators as in the panoramas, but rather a consequence of the agent's actions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-53",
"text": "Agents must also learn to avoid getting stuck on obstacles, something that is structurally impossible in VLN's navigability defined nav-graph."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-54",
"text": "Further, agents are not provided their location or heading while navigating."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-55",
"text": "We develop agent architectures for this task and explore how popular mechanisms for VLN transfer to the VLN-CE setting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-56",
"text": "Specifically, we develop a simple sequence-to-sequence baseline architecture as well as a cross-modal attentionbased model."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-57",
"text": "We perform a number of input-modality ablations to assess the biases and baselines in this new setting (including models without perception or instructions as suggested in [27] )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-58",
"text": "Unlike in VLN where depth is rarely used, our analysis reveals depth to be an integral signal for learning embodied navigation -echoing similar findings in point-goal navigation tasks [19, 31] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-59",
"text": "We also apply existing training augmentations [17, 24, 26] popular in VLN to our setting, finding mixed results."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-60",
"text": "Overall, our best performing agent successfully navigates to the goal in approximately a third of episodes in unseen environments -taking an average of 88 actions in this long-horizon task."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-61",
"text": "Table 1 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-62",
"text": "Comparison of language-guided visual navigation tasks."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-63",
"text": "Ours is the only to provide unconstrained navigation in real environments for crowdsourced instructions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-65",
"text": "**TASK**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-66",
"text": "Instructions Environment Navigation LANI [20] Crowdsourced Synthetic Unconstrained StreetNav [13] Templated Real Nav-Graph Based Touchdown [6] Crowdsourced Real Nav-Graph Based VLN [4] Crowdsourced Real Nav-Graph Based"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-68",
"text": "**VLN-CE (OURS) CROWDSOURCED REAL UNCONSTRAINED**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-69",
"text": "To further examine the relationship between the nav-graph-based VLN task and VLN-CE, we also transfer paths from agents trained in continuous environments back to the nav-graph to provide a direct comparison."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-70",
"text": "We find significant gaps in performance between these settings indicative of the strong prior provided by the nav-graph."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-71",
"text": "This suggests prior results in VLN may be overly optimistic in terms of progress towards instruction-following robots functioning in the wild."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-72",
"text": "Contributions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-73",
"text": "To summarize our contributions, we: -Lift the VLN task to continuous 3D environments -removing many unrealistic assumptions imposed by the nav-graph-based representation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-74",
"text": "We will publicly release the VLN-CE codebase and our baseline models."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-75",
"text": "-Develop model architectures for the VLN-CE task and evaluate a suite of single-input ablations to assess the biases and baselines of the setting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-141",
"text": "Converting Room-to-Room Trajectories to Habitat."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-76",
"text": "-Investigate how a number of popular techniques in VLN transfer to this more challenging long-horizon setting -identifying significant gaps in performance."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-78",
"text": "**RELATED WORK**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-79",
"text": "Language-guided Visual Navigation Tasks."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-80",
"text": "Language-guided visual navigation tasks require agents to follow navigation directions in simulated environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-81",
"text": "There have been a number of recent tasks proposed in this space [4, 6, 13, 20] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-82",
"text": "Chen et al. [6] introduce the Touchdown task which studies outdoor language-guided navigation in Google Street View panoramas."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-83",
"text": "Hermann et al. [13] investigates the same setting; however, the instructions are automatically generated from Google Map directions rather than being crowdsourced from human annotators."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-84",
"text": "Both adopt a nav-graph setting due to the source data being panoramic images -constraining agent navigation to fixed points."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-85",
"text": "Misra et al. [20] introduce a simulated environment with unconstrained navigation and a dataset of crowdsourced instructions; however, the environments are unrealistic, synthetic scenes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-86",
"text": "Most related to our work is the Vision-and-Language Navigation (VLN) task of Anderson et al. [4] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-87",
"text": "VLN provides nav-graph trajectories and crowdsourced instructions in Matterport3D [5] environments as the Room-to-Room (R2R) dataset."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-88",
"text": "We build VLN-CE directly on these annotations -converting R2R panorama-based trajectories to fine-grained paths in continuous Matterport3D environments ( Fig. 1 (a) to Fig. 1(b) )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-89",
"text": "As outlined in the introduction, this shift to continuous environments with unconstrained agent navigation lifts a number of unrealistic assumptions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-90",
"text": "The variation in these tasks is primarily in the source of navigation instructions (crowdsourced from human annotators vs. generated via template), environment realism (hand-designed synthetic worlds vs. captures from real locations), and constraints on agent navigation (nav-graph based navigation vs. unconstrained agent motion)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-91",
"text": "Tab."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-92",
"text": "1 provides a comparison between tasks along these axes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-93",
"text": "Our proposed VLN-CE task provides the first setting with crowdsourced instructions in realistic environments with unconstrained agent navigation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-94",
"text": "Approaches to Vision-and-Language Navigation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-95",
"text": "VLN has seen considerable progress from a wide variety of techniques."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-96",
"text": "Multimodal attention mechanisms have become popular to provide better grounding between instructions and the visual scene observation [29] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-97",
"text": "Orthogonal to new modeling architectures, improvements have also come from new training approaches and data augmentation methods."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-98",
"text": "One prevalent technique is to utilize inverse \"speaker\" models to rerank candidate trajectories or augment the available training data by generating instructions for novel trajectories [9] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-99",
"text": "Tan et al. [26] further improve upon this idea by masking a subset of visual features during the speaker's instruction generation process, thereby improving the diversity of the generated instructions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-100",
"text": "Ma et al. [17] show that an additional training signal can be gained by explicitly estimating progress toward the goal (referred to as self-monitoring)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-101",
"text": "We adapt these methods to VLN-CE and examine their impact -finding mixed results."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-102",
"text": "Multimodal attention remains a useful structure; however, speaker-based data augmentation and self-monitoring losses provide mixed results."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-103",
"text": "Other Language-based Embodied AI."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-104",
"text": "A number of other embodied tasks have considered language-conditioned navigation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-105",
"text": "For instance, referring to specific rooms or objects that agents must then navigate to [7, 10, 30] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-106",
"text": "However, these settings use language to specify end-goals or query agent knowledge rather than to provide navigational directions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-107",
"text": "For example, specifying \"lamp\" or \"What color is the lamp in the living room? \" rather than \"Go down the hall and into the bedroom on the right."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-108",
"text": "Stop by the lamp to the left of the bed.\" This loose coupling of intermediate agent action with the language instruction differentiates these tasks from language-guided navigation settings."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-110",
"text": "**VLN IN CONTINUOUS ENVIRONMENTS (VLN-CE)**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-111",
"text": "We consider a continuous setting for the vision-and-language navigation task which we refer to as Vision-and-Language Navigation in Continuous Environments (VLN-CE)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-112",
"text": "Given a natural language navigation instruction, an agent must navigate from a start position to the described goal in a continuous 3D environment by executing a sequence of low-level actions based on egocentric perception alone."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-113",
"text": "In overview, we develop this setting by transferring nav-graph-based Room-to-Room (R2R) [4] trajectories to reconstructed continuous Matterport3D environments in the Habitat simulator [19] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-114",
"text": "We discuss the task specification and the details of this transfer process in this section."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-115",
"text": "Continuous Matterport3D Environments in Habitat."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-116",
"text": "We set our problem in the Matterport3D (MP3D) [5] dataset, a collection of 90 environments captured through over 10,800 high-definition RGB-D panoramas."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-117",
"text": "In addition to the panoramic images, MP3D also provides corresponding mesh-based 3D environment reconstructions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-118",
"text": "To enable agent interaction with these meshes, we develop the VLN-CE task on top of the Habitat Simulator [19] , a high-throughput simulator that supports basic movement and collision checking for 3D environments including MP3D."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-164",
"text": "Among the 23% of trajectories that were not navigable, we observed two primary failure modes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-119",
"text": "In contrast to the simulator used in VLN [4] , Habitat allows agents to navigate freely in the continuous environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-120",
"text": "Observations and Actions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-121",
"text": "We select observation and action spaces to emulate a ground-based, zero-turning radius robot with a single, forward-mounted RGBD camera, similar to a LoCoBot [1] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-122",
"text": "Agents perceive the world through egocentric RGBD images from the simulator with a resolution of 256\u00d7256 and a horizontal field-of-view of 90 degrees."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-123",
"text": "Note that this is similar to the egocentric RGB perception in the original VLN task [4] but differs from the panoramic observation space adopted by nearly all follow-up work [9, 17, 26, 29] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-124",
"text": "While the simulator is quite flexible in terms of agent actions, we consider four simple, low-level actions for agents in VLN-CE -move forward 0.25m, turn-left or turn-right 15 degrees, or stop to declare that the goal position has been reached."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-125",
"text": "These actions can easily be implemented on robotic agents with standard motion controllers."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-126",
"text": "In contrast, actions to move between panoramas in [4] traverse 2.25m on average and can include avoiding obstacles."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-128",
"text": "**TRANSFERRING NAV-GRAPH TRAJECTORIES**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-129",
"text": "Rather than collecting a new dataset of trajectories and instructions, we instead transfer those from the nav-graph-based Room-to-Room dataset to our continuous setting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-130",
"text": "Doing so enables us to compare existing nav-graph-based techniques with our methods that operate in continuous environments on the same instructions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-131",
"text": "Matterport3D Simulator and the Room-to-Room Dataset."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-132",
"text": "The original VLN task is based on panoramas from Matterport3D (MP3D) [5] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-133",
"text": "To enable agent interaction with these panoramas, Anderson et al. [4] developed the Matterport3D Simulator."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-134",
"text": "Environments in this simulator are defined as nav-graphs E = {V, E}. Each node v \u2208 V corresponds to a panoramic image I captured by a Matterport camera at location x, y, z -i.e."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-135",
"text": "v = {I, x, y, z}. Edges in the graph correspond to navigability between nodes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-136",
"text": "Navigability was defined by ray-tracing between node locations at varying heights to check for obstacles in the reconstructed MP3D scene and then manually inspected."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-137",
"text": "Edges were manually added or removed based on judgement whether an agent could navigate between nodes -including by avoiding minor obstacles 4 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-138",
"text": "Agents act by teleporting between adjacent nodes in this graph."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-139",
"text": "Based on this simulator, Anderson et al. [4] collect the Roomto-Room (R2R) dataset containing 7189 trajectories each with three humangenerated instructions on average."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-212",
"text": "where Attn is a scaled dot-product attention [28] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-142",
"text": "Given a mapping between the coordinate frames of Matterport3D Simulator and MP3D in Habitat, it is seemingly simple to transfer the Room-to-Room trajectories -after all, each node has a corresponding xyz location."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-143",
"text": "However, node locations often do not correspond to reachable locations for a ground-based agent -existing at variable height depending on tripod configuration or placed on top of flat furniture like tables."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-144",
"text": "Further, the reconstructions and panoramas may differ if objects or doors are moved between camera captures."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-145",
"text": "Fig. 2 shows an overview of this process and common errors when directly transferring node locations."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-146",
"text": "For each node, v = {I, x, y, z}, we would like to identify the nearest, navigable point on the reconstructed mesh -i.e."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-147",
"text": "the closest point that can be occupied by a ground-based agent represented by a 1.5m tall cylinder of diameter of 0.2m."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-148",
"text": "Directly projecting to the nearest mesh location fails for 73% of nodes where failure is defined as projecting to distant (>0.5m) or non-navigable points."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-149",
"text": "Many of these points project to ceilings or the tops of nearby objects rather than the floor due to the height of the camera."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-150",
"text": "Instead, we cast a ray up to 2m directly downward from the node location."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-151",
"text": "At small, fixed intervals along this ray, we project to the nearest mesh point."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-152",
"text": "If multiple navigable points are identified, we take the one with minimal horizontal displacement from the original location."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-153",
"text": "If no navigable point is found with less than a 0.5m displacement, we consider this MP3D node unmappable to the 3D mesh and thus invalid."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-154",
"text": "We reviewed all invalid nodes manually and made corrections if possible, e.g. shifting nodes to the side of furniture."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-155",
"text": "After these steps, 98.3% of nodes are successfully transferred."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-156",
"text": "We refer to these transferred nodes as waypoint locations in the MP3D environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-157",
"text": "As shown in Fig. 3(a) , points requiring adjustment (3% of points) are transferred with small horizontal displacement, averaging 0.19m from the panorama location."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-158",
"text": "Given a trajectory of converted waypoints \u03c4 = [w 1 , . . . , w T ], we would like to verify that an agent can actually navigate between each location."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-159",
"text": "We employ an A*-based heuristic search algorithm to compute an approximate shortest path to a goal location."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-160",
"text": "We run this shortest path algorithm between each waypoint in a trajectory to the next (e.g. w i to w i+1 )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-161",
"text": "A trajectory is considered navigable if for each pairwise navigation, an agent following the computed shortest path can navigate to within 0.5m of the next waypoint (w i+1 )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-162",
"text": "In total, we find 77% of the R2R trajectories navigable in the continuous environment."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-163",
"text": "Non-Navigable Trajectories."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-213",
"text": "For a query q \u2208 R 1\u00d7dq ,"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-165",
"text": "First and most simply, 22% of these included one of the 1.7% of invalid nodes that could not be projected to MP3D 3D meshes and were rejected by default."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-166",
"text": "The remaining trajectories were not navigable because they spanned disjoint regions of the reconstruction -meaning that there was no valid path from some waypoint w i to w i+1 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-167",
"text": "As shown in Fig. 3(b) , this may be holes or other mesh errors dividing the space."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-168",
"text": "Alternatively, objects like chairs may be moved in between panorama captures -possibly resulting in a reconstruction that places the object mesh on top of individual panorama locations."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-169",
"text": "As noted above, nodes in the R2R nav-graph were manually connected if there appeared to be a path between them, even if most other panoramas (and thus the reconstruction) showed objects (e.g. a closed door) blocking their path."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-171",
"text": "**VLN-CE DATASET**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-172",
"text": "In total, the VLN-CE dataset consists of 4475 trajectories converted from R2R train and validation splits."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-173",
"text": "For each trajectory, we provide the multiple natural language instructions from R2R and a pre-computed shortest path following the waypoints via low-level actions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-174",
"text": "As shown in Fig. 3(c) , the low-level action space of VLN-CE makes our trajectories significantly longer horizon tasks -with an average of 55.88 steps compared to the 4-6 in R2R."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-176",
"text": "**INSTRUCTION-GUIDED NAVIGATION MODELS IN VLN-CE**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-177",
"text": "We develop two models for VLN-CE."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-178",
"text": "A simple sequence-to-sequence baseline and a more powerful cross-modal attentional model."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-179",
"text": "While there are many differences in the details, these models are conceptually similar to early [4] and more recent [29] work in the nav-graph based VLN task."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-180",
"text": "Exploring these gives insight into the difficulty of this setting in isolation and by comparison relative to VLN."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-181",
"text": "Further, these models allow us to test whether improvements from early to later architectures carry over to a more realistic setting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-182",
"text": "Both of our models make use of the same observation and instruction encodings described below."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-183",
"text": "Instruction Representation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-184",
"text": "We convert tokenized instructions to corresponding GLoVE [23] embeddings which are processed by recurrent encoders for each model."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-185",
"text": "We denote these encoded tokens as w 1 , . . . , w T for a length T instruction."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-186",
"text": "Fig. 4 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-187",
"text": "We develop a simple baseline agent (a) as well as an attentional agent (b) comparable to that in [29] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-236",
"text": "Progress Monitor."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-188",
"text": "Both receive RGB and depth frames represented by pretrained networks for image classification [8] and point-goal navigation [31] , respectively."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-189",
"text": "Observation Encoding."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-190",
"text": "We separately encode the RGB and depth observations."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-191",
"text": "For RGB, we apply a ResNet50 [12] pretrained on ImageNet [8] to collect semantic visual features."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-192",
"text": "We denote the final spatial features of this model as V = {v i } where i indexes over spatial locations."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-193",
"text": "Likewise for depth, we use a modified ResNet50 that was trained to perform point-goal navigation (i.e. to navigate to a location given in relative coordinates) [31] and denote these as D = {d i }."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-194",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-195",
"text": "**SEQUENCE-TO-SEQUENCE BASELINE**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-196",
"text": "We consider a simple sequence-to-sequence baseline model shown in Fig. 4(a) ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-197",
"text": "This model consists of a recurrent policy that takes a representation of the visual observation (depth and RGB) and instructions at each time step, then predicts an action a. Concretely, we can write the agent for time step t as"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-198",
"text": "where [\u00b7] denotes concatenation and s is the final hidden state of an LSTM instruction encoder."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-199",
"text": "This simple model enables straight-forward input-modality ablations and establishes a straight-forward baseline for the VLN-CE setting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-200",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-201",
"text": "**CROSS-MODAL ATTENTION MODEL**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-202",
"text": "While the previous baseline is a sensible start, it lacks powerful modeling techniques common to vision-and-language tasks including cross-modal attention and spatial visual reasoning which are intuitively quite important for language-guided visual navigation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-203",
"text": "Many instructions include relative references (e.g. \"to the left of the table\") that would be difficult to ground from mean-pooled features."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-204",
"text": "Moreover, already-completed parts of the instruction are likely irrelevant to the next decision -pointing towards the potential of attention over instructions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-205",
"text": "We consider a more expressive model shown in Fig. 4(b) that incorporates these mechanisms."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-206",
"text": "This model consists of two recurrent networks -one tracking visual observations as before and the other making decisions based on attended instruction and visual features."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-207",
"text": "We can write this first recurrent network as:"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-208",
"text": "where a t\u22121 \u2208 R 1\u00d732 and is a learned linear embedding of the previous action."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-209",
"text": "We encode instructions with a bi-directional LSTM and reserve all intermediate hidden states:"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-210",
"text": "We then compute an attended instruction feature\u015d t over these representations which is then used to attend to visual (v t ) and depth (d t ) features."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-211",
"text": "Concretely,"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-214",
"text": "The second recurrent network then takes a concatenation of these features as input (including an action encoding and the first recurrent network's hidden state) and predicts an action."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-215",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-216",
"text": "**AUXILIARY LOSSES AND TRAINING REGIMES**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-217",
"text": "Aside from modeling details, much of the remaining progress in VLN has come from adjusting the training regime -adding auxiliary losses / rewards [17, 29] , mitigating exposure bias during training [4, 29] , or reducing data sparsity by incorporating synthetically generated data augmentation [9, 26] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-218",
"text": "We explore some of these directions for VLN-CE, but note that this is not an exhaustive accounting of impactful techniques."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-219",
"text": "Particularly, we suspect that methods addressing exposure bias and data sparsity in VLN will help in the VLN-CE setting where these problems may be amplified by lengthy action sequences."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-220",
"text": "We report ablations with and without these techniques in Sec. 5."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-221",
"text": "Imitation Learning."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-222",
"text": "A natural starting point for training is simply to maximize the likelihood of the ground truth trajectories."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-223",
"text": "To do so, we perform teacherforcing training with inflection weighting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-224",
"text": "As described in [30] , inflection weighting places emphasis on time-steps where actions change (i.e. a t\u22121 = a t ), adjusting loss weight proportionally to the rarity of such events."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-225",
"text": "This was found to be helpful for problems like navigation with long sequences of repeated actions (e.g. going forward down a hall)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-226",
"text": "We observe a similar effect in early experiments and apply inflection weighting in all our experiments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-227",
"text": "Coping with Exposure Bias."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-228",
"text": "Imitation learning in auto-regressive settings suffers from a disconnect between training and test -agents are not exposed to the consequences of their actions during training."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-229",
"text": "Prior work has shown significant gains by addressing this issue for VLN through scheduled sampling [4] or reinforcement learning fine-tuning [26, 29] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-230",
"text": "In this work, we apply Dataset Aggregation (DAgger) [24] towards the same end."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-231",
"text": "While DAgger and scheduled sampling share many similarities, DAgger trains on the aggregated set of trajectories from all iterations 1 to n. Thus, the resulting policy after iteration n is optimized over all past experiences and not just those collected from iteration n. Synthetic Data Augmentation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-232",
"text": "Another popular strategy is to learn an inverse 'speaker' model that produces instructions given a trajectory."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-233",
"text": "These models can be used to re-rank paths or to generate new trajectory-instruction pairs from any trajectory."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-234",
"text": "Both [26] and [9] take this data augmentation approach and many follow-up works have used these trajectories for gains in performance."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-235",
"text": "We take the \u223c150k synthetic trajectories generated this way from [26] -converting them to our continuous environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-237",
"text": "An important aspect of a successful navigation is accurately identifying where to stop."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-238",
"text": "Prior work [17] has found improvements from explicitly supervising the agent with a progress-toward-goal signal."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-239",
"text": "Specifically, agents are trained to predict the fraction through the trajectory they are at each time step."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-240",
"text": "We apply this progress estimation during training with a mean squared error loss term akin to [17] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-241",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-242",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-243",
"text": "Setting and Metrics."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-244",
"text": "We train and evaluate our models in VLN-CE."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-245",
"text": "As is common practice, we perform early stopping based on val-unseen performance."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-246",
"text": "We report standard metrics for visual navigation tasks defined in [2, 4, 18] -trajectory length in meters (TL), navigation error in meters from goal at termination (NE), oracle success rate (OS), success rate (SR), success weighted by inverse path length (SPL), and normalized dynamic-time warping (nDTW)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-247",
"text": "For our discussion, we will examine success rate and SPL as the primary metrics for performance and use NDTW to describe how paths differ in shape from ground truth trajectories."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-248",
"text": "For full details on these metrics, see [2, 4, 18] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-249",
"text": "Implementation Details."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-250",
"text": "We utilize the Adam optimizer [15] with a learning rate of 2.5 \u00d7 10 \u22124 and a batch size of 5 full trajectories."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-251",
"text": "We set the inflection weighting coefficient [30] to 3.2 (inverse frequency of inflections in our groundtruth paths)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-252",
"text": "We train on all ground-truth paths until convergence on val-unseen (at most 30 epochs)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-253",
"text": "For DAgger [24] , we collect the nth set by taking the oracle action with probability \u03b2=0.75 n and the current policy action otherwise."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-254",
"text": "We collect 5, 000 trajectories at each stage and then perform 4 epochs of imitation learning (with inflection weighting) over all collected trajectories."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-255",
"text": "Once again, we train to convergence on val-unseen (6 to 10 dataset collections, depending on the model)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-256",
"text": "We implement our agents in PyTorch [22] and on top of Habitat [19] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-257",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-258",
"text": "**ESTABLISHING BASELINE PERFORMANCE FOR VLN-CE**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-259",
"text": "No-Learning Baselines."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-260",
"text": "To establish context for our results, we consider random and hand-crafted agents shown in Tab."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-261",
"text": "2 (top two rows)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-286",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-287",
"text": "**MODEL PERFORMANCE IN VLN-CE**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-288",
"text": "Tab."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-289",
"text": "3 shows a comparison of our models (Seq2Seq and Cross-Modal) under three training augmentations (Progress Monitor, DAgger, Data Augmentation)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-290",
"text": "Table 3 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-291",
"text": "Performance in VLN-CE."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-292",
"text": "We find that popular techniques in VLN have mixed benefit in VLN-CE; however, our best performing model combining all examined techniques succeeds nearly 1/3rd of the time in new environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-293",
"text": "* denotes fine-tuning."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-294",
"text": "PM [17] DA [24] Aug. [26] Val-Seen Val-Unseen Training Augmentation."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-295",
"text": "We find DAgger-based training impactful for both the Seq2Seq (row 1 vs. 3) and Cross-Modal (row 6 vs. 8) models -improving by 0.03-0.05 SPL in val-unseen."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-296",
"text": "Contrary to findings in prior work, we observe negative effects from progress monitor auxiliary loss or data augmentation for both models (rows 2/4 and 7/9) -dropping 0.01-0.03 SPL from standard training (rows 1/6)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-297",
"text": "Despite this, we find combining all three techniques to lead to significant performance gains for the cross-modal attention model (row 10)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-298",
"text": "Specifically, we pretrain with imitation learning, data augmentation, and the progress monitoring loss, then finetune using DAgger (with \u03b2=0.5 n+1 ) on the original data."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-299",
"text": "This Cross-Modal Attention PM+DA * +Aug model achieves an SPL of 0.35 on val-seen and 0.30 on val-unseen -succeeding on 32% of episodes in new environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-300",
"text": "We explore this trend further for the Cross-Modal model."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-301",
"text": "We examine the validation performance of PM+Aug (row 11) and find it to outperform Aug or PM alone (by 0.02-0.03 SPL)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-302",
"text": "Next, we examine progress monitor loss on val-unseen for both PM and PM+Aug."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-303",
"text": "We find that without data augmentation, the progress monitor over-fits considerably more (validation loss of 0.67 vs. 0.47) -indicating that the progress monitor can be effective in our continuous setting but tends to over-fit on the non-augmented training data, negatively affecting generalization."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-304",
"text": "Finally, we examine the performance of DA * +Aug (row 12) and find that this outperforms DA (by 0.01-0.02 SPL), but is unable to match pre-training with the progress monitor and augmented data (row 10)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-305",
"text": "Qualitative Examples."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-306",
"text": "We examine two qualitative examples of our Cross-Modal Attention (PM+DA * +Aug.) model in unseen environments (Fig. 5 )."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-307",
"text": "The top example shows the agent successfully following the instruction and demonstrates the increased difficultly of VLN-CE (62 actions vs. 3 hops in VLN)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-308",
"text": "Phrases like \"turn left, and enter the hallway\" present an additional challenge in VLN-CE as the agent must turn-left an unknown number of times until it sees the hallway."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-309",
"text": "The second example shows a failure of the agent -it navigates towards the wrong windows and fails to first \"pass the kitchen\" -stopping instead at the nearest couch."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-310",
"text": "We also observe failures when the agent never sees the object(s) referred to by the instruction in the scene -with a limited egocentric field-of-view, the agent must actively choose to observe the surrounding scene."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-311",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-312",
"text": "**EXAMINING THE IMPACT OF THE NAV-GRAPH IN VLN**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-313",
"text": "To draw a direct comparison between the VLN and VLN-CE settings, we convert trajectories taken by our Cross-Modal Attention (PM+DA * +Aug.) model in continuous environments to nav-graph trajectories (details in the supplement) and then evaluate these paths on the VLN leaderboard."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-314",
"text": "5 We emphasize that the point of this comparison is not to outperform existing approaches for VLN, but rather to highlight how important the nav-graph is to the performance of existing VLN systems by contrasting them with our model."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-315",
"text": "Unlike the approaches shown, our model does not benefit from the nav-graph during training or inference."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-316",
"text": "As shown in Tab."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-317",
"text": "4, we find significant gaps between our model and prior work in the VLN setting."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-318",
"text": "Despite having similar cross-modal attention architectures, RCM [29] achieves an SPL of 0.38 in test environments while our model yields Table 4 ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-319",
"text": "Comparison on the VLN validation and test sets with existing models."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-320",
"text": "Note there is a significant gap between techniques that leverage the oracle nav-graph at train and inference (top set) and our best method in continuous environments."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-321",
"text": "However, it is unclear if these gains could be realized on a real system given the strong assumptions set by the nav-graph."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-322",
"text": "In contrast, our approach does not rely on external information and recent work has shown promising sim2real transferability for navigation agents trained in continuous simulations [14] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-323",
"text": "Caveats."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-324",
"text": "Making direct comparisons between drastically different settings is challenging, we note some caveats."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-325",
"text": "Approximately 20% of VLN trajectories are non-navigable in the continuous environments (and thus excluded in VLN-CE)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-326",
"text": "By default, our models cannot succeed on these."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-327",
"text": "Further, continuous VLN-CE paths can translate poorly to nav-graph trajectories when traversing areas of the environment not well-covered by the sparse panoramas."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-328",
"text": "Comparing VLN-CE val results in Tab."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-329",
"text": "3 with the same in Tab."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-330",
"text": "4 shows these effects account for a drop of approximately 10 SPL."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-331",
"text": "Even compensating for this possible underestimation, nav-graph-based approaches still outperform our continuous models significantly."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-332",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-333",
"text": "**DISCUSSION**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-334",
"text": "In this work, we explore the problem of following navigation instructions in continuous environments with low-level actions -lifting many of the unrealistic assumptions in prior nav-graph-based settings."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-335",
"text": "In models presented here, we took an approach where observations were mapped directly to low-level control in an end-to-end manner; however, exploring modular approaches is exciting future work."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-336",
"text": "For instance, having the learned agent pass directives to a motion controller."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-337",
"text": "Crucially, setting our VLN-CE task in continuous environments (rather than a nav-graph) provides the community with a testbed where these sort of integrative experiments studying the interface of high-and low-level control are possible."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-263",
"text": "No-learning baselines and input modality ablations for our baseline sequenceto-sequence model."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-264",
"text": "Given the long trajectories involved, we find both random agents and single-modality ablations to perform quite poorly in VLN-CE."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-265",
"text": "----------------------------------"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-266",
"text": "**VAL-SEEN VAL-UNSEEN**"
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-267",
"text": "Model Vision Instr."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-268",
"text": "History agent selects actions according to the train set action distribution (68% forward, 15% turn-left, 15% turn-right, and 2% stop)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-269",
"text": "The hand-crafted agent picks a random heading and takes 37 forward actions (average trajectory length) before calling stop."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-270",
"text": "Despite having no learned components nor processing any input, both these agents achieve approximately 3% success rates in val-unseen."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-271",
"text": "In contrast, a similar hand-crafted random-heading-and-forward model in VLN yields a 16.3% success rate [4] ."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-272",
"text": "Though not directly comparable, this gap illustrates the strong structural prior provided by the nav-graph in VLN."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-273",
"text": "Seq2Seq and Single-Modality Ablations."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-274",
"text": "Tab. 2 also shows performance for the baseline Seq2Seq model along with input ablations."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-275",
"text": "All models are trained with imitation learning without data augmentation or any auxiliary losses."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-276",
"text": "Our baseline Seq2Seq model significantly outperforms the random and hand-crafted baselines, successfully reaching the goal in 20% of val-unseen episodes."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-277",
"text": "As illustrated in [27] , models examining only single modalities can be very strong baselines in embodied tasks."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-278",
"text": "We train models without access to the instruction (No Instruction) and with ablated visual input (No Vision/Depth/Image)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-279",
"text": "All of these ablations under-perform the Seq2Seq baseline."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-280",
"text": "We find that depth is a very strong signal for learning, with models lacking it (No Depth and No Vision) failing to outperform chance (\u22641% success rates)."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-281",
"text": "We believe that depth enable agents to quickly begin traversing environments effectively (e.g. without collisions) and without this it is very difficult to bootstrap to instruction following."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-282",
"text": "With a success rate of 17%, the No Instruction model performs similarly to a hand-crafted agent in VLN, suggesting shared trajectory regularities between VLN and VLN-CE."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-283",
"text": "While these regularities can be manually exploited in VLN via the nav-graph, they are implicit in VLN-CE as evidenced by the significantly lower performance of our random and hand crafted agents which collide with and get stuck on obstacles."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-284",
"text": "The No Image model also achieves 17% success, similarly failing to reason about instructions."
},
{
"sent_id": "4f646eceef2e5fc447a367488b6aaf-C001-285",
"text": "This hints at the importance of grounding visual referents (through RGB) for navigation."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"4f646eceef2e5fc447a367488b6aaf-C001-10"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-14"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-81"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-86"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-133"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-139"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-217"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-229"
]
],
"cite_sentences": [
"4f646eceef2e5fc447a367488b6aaf-C001-10",
"4f646eceef2e5fc447a367488b6aaf-C001-14",
"4f646eceef2e5fc447a367488b6aaf-C001-81",
"4f646eceef2e5fc447a367488b6aaf-C001-86",
"4f646eceef2e5fc447a367488b6aaf-C001-133",
"4f646eceef2e5fc447a367488b6aaf-C001-139",
"4f646eceef2e5fc447a367488b6aaf-C001-217",
"4f646eceef2e5fc447a367488b6aaf-C001-229"
]
},
"@USE@": {
"gold_contexts": [
[
"4f646eceef2e5fc447a367488b6aaf-C001-47"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-113"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-246"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-248"
]
],
"cite_sentences": [
"4f646eceef2e5fc447a367488b6aaf-C001-47",
"4f646eceef2e5fc447a367488b6aaf-C001-113",
"4f646eceef2e5fc447a367488b6aaf-C001-246",
"4f646eceef2e5fc447a367488b6aaf-C001-248"
]
},
"@DIF@": {
"gold_contexts": [
[
"4f646eceef2e5fc447a367488b6aaf-C001-119"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-126"
]
],
"cite_sentences": [
"4f646eceef2e5fc447a367488b6aaf-C001-119",
"4f646eceef2e5fc447a367488b6aaf-C001-126"
]
},
"@SIM@": {
"gold_contexts": [
[
"4f646eceef2e5fc447a367488b6aaf-C001-123"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-177",
"4f646eceef2e5fc447a367488b6aaf-C001-178",
"4f646eceef2e5fc447a367488b6aaf-C001-179"
],
[
"4f646eceef2e5fc447a367488b6aaf-C001-229",
"4f646eceef2e5fc447a367488b6aaf-C001-230"
]
],
"cite_sentences": [
"4f646eceef2e5fc447a367488b6aaf-C001-123",
"4f646eceef2e5fc447a367488b6aaf-C001-179",
"4f646eceef2e5fc447a367488b6aaf-C001-229"
]
},
"@MOT@": {
"gold_contexts": [
[
"4f646eceef2e5fc447a367488b6aaf-C001-217",
"4f646eceef2e5fc447a367488b6aaf-C001-218"
]
],
"cite_sentences": [
"4f646eceef2e5fc447a367488b6aaf-C001-217"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"4f646eceef2e5fc447a367488b6aaf-C001-271"
]
],
"cite_sentences": [
"4f646eceef2e5fc447a367488b6aaf-C001-271"
]
}
}
},
"ABC_d1dce63d89e8cfc73962413734bf7b_2": {
"x": [
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-94",
"text": "**SEMANTIC SPECIALIZATION**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-184",
"text": "**MODEL**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-185",
"text": "Results and Analysis."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-186",
"text": "The results on the German and Italian DST task are summarized in Table 2 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-2",
"text": "Semantic specialization integrates structured linguistic knowledge from external resources (such as lexical relations in WordNet) into pretrained distributional vectors in the form of constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-3",
"text": "However, this technique cannot be leveraged in many languages, because their structured external resources are typically incomplete or non-existent."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-4",
"text": "To bridge this gap, we propose a novel method that transfers specialization from a resource-rich source language (English) to virtually any target language."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-5",
"text": "Our specialization transfer comprises two crucial steps: 1) Inducing noisy constraints in the target language through automatic word translation; and 2) Filtering the noisy constraints via a state-of-the-art relation prediction model trained on the source language constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-6",
"text": "This allows us to specialize any set of distributional vectors in the target language with the refined constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-7",
"text": "We prove the effectiveness of our method through intrinsic word similarity evaluation in 8 languages, and with 3 downstream tasks in 5 languages: lexical simplification, dialog state tracking, and semantic textual similarity."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-8",
"text": "The gains over the previous state-of-art specialization methods are substantial and consistent across languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-9",
"text": "Our results also suggest that the transfer method is effective even for lexically distant source-target language pairs."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-10",
"text": "Finally, as a by-product, our method produces lists of WordNet-style lexical relations in resource-poor languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-13",
"text": "Due to their dependence on the distributional hypothesis (Harris, 1954) , that is, word co-occurrence information in large corpora, distributional word embeddings (Mikolov et al., 2013; Levy and Goldberg, 2014; Pennington et al., 2014; Melamud et al., 2016; Bojanowski et al., 2017; Peters et al., 2018 , inter alia) conflate paradigmatic relations (e.g., synonymy, antonymy, lexical entailment, cohyponymy, meronymy) and the broader topical (i.e., syntagmatic) relatedness (Schwartz et al., 2015; ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-14",
"text": "This property can propagate undesired effects to language understanding applications such as statistical dialog modeling or text simplification (Faruqui, 2016; Chiu et al., 2016; : for instance, the inability to distinguish between synonymy and antonymy (e.g., between cheap pubs and expensive restaurants) can break task-oriented dialog or a recommendation system (Mrk\u0161i\u0107 et al., 2016; Kim et al., 2016b) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-15",
"text": "Semantic specialization techniques are therefore leveraged to stress a relation of interest such as semantic similarity (Wieting et al., 2015; Ponti et al., 2018) or lexical entailment (Nguyen et al., 2017; over other types of semantic association in the word vector space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-16",
"text": "The best-performing specialization models (cf."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-17",
"text": "Ponti et al. 2018) are executed as vector space post-processors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-18",
"text": "In short, these techniques force the distributional vectors to conform to external linguistic constraints (e.g., synonymy, meronymy, lexical entailment) extracted from structured external resources (e.g., WordNet, BabelNet) to emphasize the particular relation."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-19",
"text": "As post-processors they are applicable to any input distributional space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-20",
"text": "A critical requirement for all specialization techniques is the set of linguistic constraints drawn from the curated external semantic resource."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-21",
"text": "Such resources contain incomplete information even in resource-rich languages (e.g., English WordNet), while the resources are scarcer or even non-existent for many other languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-22",
"text": "A solution was proposed recently to deal with incomplete information in a resource-rich language: the specialization function learned on the subset of words observed in the external resource gets propagated to the entire vocabulary in a step called post-specialization )."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-23",
"text": "Yet, another fundamental question concerning specialization techniques is still unresolved: how to enable specialization in virtually any language, even when the language completely lacks external lexical resources?"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-24",
"text": "In this work, we therefore propose a novel approach for cross-lingual specialization transfer based on Lexical Relation Induction (CLSRI)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-25",
"text": "CLSRI leverages lexical information from a resource-rich language to enable specialization in any target language, without observing a single lexical constraint in the target language."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-26",
"text": "The transfer method consists of two main steps: 1) We induce a noisy set of constraints in the target language through automatic word translation via a shared cross-lingual word vector space Joulin et al., 2018) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-27",
"text": "2) To mitigate the noise from the translation process, the initial set of noisy constraints is then refined in a relation prediction phase: we adjust a state-of-the-art neural method for lexical relation classification (Glava\u0161 and Vuli\u0107, 2018a) and use it to predict the validity of each noisy constraint obtained in the first step."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-28",
"text": "Finally, a standard specialization technique (including the post-specialization step) can then be used monolingually in the target language, starting from the set of refined target language constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-29",
"text": "We verify the usefulness of our specialization transfer method in the intrinsic word similarity task for 8 target languages, followed by 3 downstream tasks in 5 languages: lexical simplification, dialog state tracking, and semantic textual similarity."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-30",
"text": "We observe large improvements over purely distributional word vectors for all target languages and in all tasks."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-31",
"text": "Moreover, we show that the proposed specialization transfer method consistently outperforms the direct specialization transfer based on the composition of the crosslingual projection and the post-specialization function (Ponti et al., 2018) , with substantial gains across all experimental setups."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-32",
"text": "In order to boost the integration of external lexical knowledge into distributional models beyond English, we will release our code and lists of WordNet-style lexical relations generated by our transfer method for all target languages at: https://github.com/ cambridgeltl/xling-postspec."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-34",
"text": "**RELATED WORK**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-35",
"text": "Conflating distinct (both paradigmatic and syntagmatic) lexico-semantic relations is a well-known property of distributional word vectors; semantic specialization of such spaces for a particular lexicosemantic relation (e.g., semantic similarity or lexical entailment) benefits a number of tasks, e.g., dialog state tracking Ponti et al., 2018) , spoken language understanding (Kim et al., 2016b,a) , text simplification (Glava\u0161 and Vuli\u0107, 2018b; Ponti et al., 2018) , and cross-lingual transfer of resources (Vuli\u0107 et al., 2017a) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-36",
"text": "Specialization methods inject external lexical knowledge into a distributional space, tailoring vectors for a particular relation of interest."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-37",
"text": "Joint specialization models (Yu and Dredze, 2014; Xu et al., 2014; Kiela et al., 2015; Liu et al., 2015; Ono et al., 2015; Osborne et al., 2016; Nguyen et al., 2017, inter alia) use external constraints to modify the training objective of word embedding models (Mikolov et al., 2013; Dhillon et al., 2015; Liu et al., 2018b,a) and train specialized vectors from scratch."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-38",
"text": "In contrast, retrofitting (also known as postprocessing) methods tune the pre-trained distributional vectors post-hoc based on the provided external constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-39",
"text": "Despite the fact that joint models specialize the entire space, whereas the first generation of retrofitting models specializes only the vectors of words seen in lexical constraints, the latter yield better downstream performance (Mrk\u0161i\u0107 et al., 2016) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-40",
"text": "Moreover, while the joint models are tightly coupled to a concrete word embedding objective, retrofitting models can be applied on top of any distributional vector space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-41",
"text": "Post-specialization Ponti et al., 2018; Kamath et al., 2019) is a generalization of retrofitting that specializes the entire distributional space: 1) it learns a global specialization function using before-and after-retrofitting vectors of words from lexical constraints as training examples and 2) it applies the global specialization functions to vectors of words unseen in lexical constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-42",
"text": "Similar to retrofitting, post-specialization can be applied to any vector space, but also (like joint specialization models) specializes the full distributional space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-43",
"text": "Since it learns a global and explicit specialization function, post-specialization can be used for cross-lingual specialization transfer."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-44",
"text": "Assuming a shared cross-lingual embedding space , a post-specialization function induced on the source language subspace can be directly applied to the target language sub- Figure 1 : High-level illustration of our CLSRI framework for semantic specialization."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-45",
"text": "Step 1: a network of lexical relations in a source language (red dots, left) is translated into a target language (blue dots, right) through a shared vector space (center)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-46",
"text": "Step 2: a lexical relation classifier (center) trained on vector pairs sampled from the source language (left) prunes the constraints in the noisy target network (right)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-47",
"text": "Step 3: the refined constraints are used to attract or repel the corresponding vectors (golden edges, left); this transformation is learned by a deep feed-forward network (center) and applied to the full target vector space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-48",
"text": "space (Glava\u0161 and Vuli\u0107, 2018b; Ponti et al., 2018) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-49",
"text": "In this work, we propose a different approach: we use a shared cross-lingual space to (noisily) translate lexical constraints from source to target language, and then use a relation-prediction model (trained on the source language constraints) to filter out the invalid target language constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-50",
"text": "This allows for monolingual application of retrofitting or post-specialization in the target language."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-51",
"text": "Our experiments show that the proposed specialization transfer via lexical relation induction (CLSRI) outperforms the previous state-of-the-art specialization transfer method of Ponti et al. (2018)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-52",
"text": "3 Methodology CLSRI in a Nutshell."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-53",
"text": "In cross-lingual semantic specialization our goal is to fine-tune the distributional vectors of a target language L t leveraging structured knowledge in the form of lexical constraints, available only for a resource-rich source language L s ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-187",
"text": "Several findings emerge from the results."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-54",
"text": "To this end, we propose a two-step translate-and-refine procedure for the induction of target language constraints, described in \u00a7 3.1."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-55",
"text": "We first translate words in each L s constraint by retrieving their nearest neighbour in L t from a shared cross-lingual L s -L t embedding space ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-56",
"text": "Such a translation procedure will generate noisy constraints in the target language due to (1) imperfect word translation via the cross-lingual embedding space and (2) polysemy in L s and translation of incorrect senses of L s words."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-57",
"text": "We thus subsequently refine the noisy set of target constraints by having a state-of-the-art neural model for lexico-semantic relation prediction (Glava\u0161 and Vuli\u0107, 2018a) , trained on the L s constraints, discern valid from invalid L t constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-58",
"text": "Following that, we perform monolingual retrofitting and post-specialization in the target language L t , as outlined in \u00a7 3.2."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-59",
"text": "The L t distributional vectors can be specialized with the cleaned L t constraints using any off-the-shelf retrofitting model (Faruqui et al., 2015; Mrk\u0161i\u0107 et al., 2016; Lengerich et al., 2018, inter alia) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-60",
"text": "In this work we opt for the best-performing retrofitting model ATTRACT-REPEL (AR) Vuli\u0107 et al., 2017b) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-61",
"text": "AR specializes only the words seen in the cleaned L t constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-62",
"text": "As the final step, we generalize AR's specialization to the entire target vocabulary with a post-specialization model (Ponti et al., 2018) that learns the global specialization function from pairs of distributional and ARspecialized vectors of words from L t constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-63",
"text": "A visual summary of our transfer model is presented in Figure 1 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-64",
"text": "Our proposed CLSRI specialization conceptually differs from an existing cross-lingual specialization transfer methodology (Ponti et al., 2018; Glava\u0161 and Vuli\u0107, 2018b) , in which the global specialization function is learned in the source language L s and then transferred directly to the target language L s via a shared cross-lingual embedding space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-66",
"text": "**INDUCTION AND REFINEMENT OF CONSTRAINTS**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-67",
"text": "Step 1: Constraint Translation."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-68",
"text": "Following the established methodology of , constraints drawn from external resources are usually split into two broad sets: 1) ATTRACT constraints couple words that should have similar representations (e.g., synonyms like complicated and complex or direct hyponym-hypernym pairs like parrot and bird); and 2) REPEL constraints indicate which word pairs should appear far-flung in the space (e.g., antonyms like ancient and recent)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-69",
"text": "Given a set A s of ATTRACT word pairs and a set R s of REPEL word pairs, each word pair (w l s , w r s ) from the vocabulary of the source language V s is automatically translated into the target language with vocabulary V t using a shared cross-lingual L s -L t word embedding space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-70",
"text": "We create the crosslingual space X CL by learning a linear map W CL that projects the distributional space of the target language X t to the distributional space X s of the source language, i.e., X CL = X s \u222a X t W CL ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-71",
"text": "We translate each word w s from each linguistic constraint in L s by looking for the nearest neighbour of its vector x s in the projected target space X t W CL ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-72",
"text": "We employ recently proposed Relaxed Cross-domain Similarity Local Scaling (RCSLS) model of Joulin et al. (2018) to learn the projection matrix W CL and induce the bilingual space X CL ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-232",
"text": "**CONCLUSION**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-73",
"text": "1 1 RCSLS substantially outperforms competing models on the task of bilingual lexicon induction as shown in a recent comparative study , and has been designed to optimize performance exactly on the word translation task."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-74",
"text": "Step 2: Cleaning Noisy Constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-75",
"text": "The L t constraints we obtain by translating L s constraints via a cross-lingual L 1 -L 2 embedding space are expected to be noisy (as validated later in \u00a7 5), i.e., a shared cross-lingual space obtained via a linear projection matrix is far from ideal."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-76",
"text": "The translations are going to be particularly noisy for pairs of distant languages for which the projection-based methods for inducing cross-lingual embedding spaces (including RCSLS) generally yield lower bilingual lexicon induction (BLI) performance (S\u00f8gaard et al., 2018; Joulin et al., 2018; ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-77",
"text": "In the next step, we therefore clean the noisy L t constraints obtained via this imperfect translation procedure."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-78",
"text": "To this end, we leverage the state-ofthe-art model for lexical relation prediction: the Specialization Tensor Model (STM) (Glava\u0161 and Vuli\u0107, 2018a) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-79",
"text": "STM is a neural model that predicts lexical relations for pairs of input distributional vectors based on multi-view projections of those vectors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-80",
"text": "Each slice of the STM's central specialization tensor specifies a different projection."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-81",
"text": "We modify the original N -ary STM classifier to now model binary classification, and train two instances of the model: one that predicts whether a pair of words represents a valid ATTRACT constraint (A-STM), and another that predicts valid REPEL constraints (R-STM)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-82",
"text": "We train both models with the training instances created from the clean L s constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-83",
"text": "Given a pair of vectors (x l , x r ) that corresponds to a clean linguistic constraint (w l s , w r s ) from A s (or R s ), each vector is transformed with k feedforward networks (FFNs) of the STM model."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-84",
"text": "The paired projections of the two vectors resulting from each FFN are scored with a parameterized biaffine product, producing k latent scores describing the nature of the relation between the input vectors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-85",
"text": "The k-dimensional latent feature vector is finally passed to a FFN, which performs binary classification."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-86",
"text": "2 The complete objective is summarized in Equation (1):"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-87",
"text": "where \u2295 stands for concatenation, and the output layer activations are denoted as \u03c3 for sigmoid and \u03c4 for tanh."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-88",
"text": "The pairs (x l , x r ) created from A s and R s constitute positive training instances for A-STM and R-STM, respectively."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-89",
"text": "For each classifier we couple each positive training instance with two types of negative training instances: (1) we create a negative instance by substituting a member of the pair (x l or x r ) with a randomly sampled vector from one of the other pairs in the same training batch; (2) we create a negative instance by randomly sampling a constraint from the opposing set of constraints, that is, we turn a constraint from A s into a negative example for R-STM, and, conversely, a constraint from R s into a negative training instance for A-STM."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-90",
"text": "We train the A-STM and R-STM models with training instances created from L s constraints and then use the trained model to predict the validity of the translated L t constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-91",
"text": "We retain only the subsets of L t constraints A t and R t deemed valid by A-STM and R-STM, respectively."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-92",
"text": "Vectors of L s words (during training) and vectors of L t words (at inference) are taken from the induced bilingual"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-95",
"text": "We can now directly feed A t and R t to any retrofitting model and (monolingually) specialize any distributional space in the target language."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-96",
"text": "We first run the state-of-the-art retrofitting model ATTRACT-REPEL (AR) with A t and R t constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-97",
"text": "AR however, specializes only the words present in A t and R t ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-98",
"text": "In the next step, we generalize AR's specialization to the full vocabulary V t with the state-of-the-art postspecialization model ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-99",
"text": "For completeness, we briefly summarize AR and the postspecialization model of ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-100",
"text": "Retrofitting with ATTRACT-REPEL."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-101",
"text": "Each constraint from A t and R t is used to fine-tune the distance between their corresponding vectors (x l , x r ) in the target L t distributional space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-102",
"text": "Let B A be a batch of vector pairs created from ATTRACT constraints A t and B R the batch of vector pairs created from REPEL constraints R t ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-103",
"text": "For each batch B A and each batch B R , we construct batches of corresponding negative pairs T A (B A ) and T R (B R ), containing new pairs of words sampled among those present in the batch of positive pairs."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-104",
"text": "In particular, half of the negative examples t l and t r for ATTRACT (or RE-PEL) pairs are chosen by retrieving the nearest (or farthest) neighbours to x l and x r , respectively, in terms of cosine similarity."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-105",
"text": "Another half are random negative examples."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-106",
"text": "AR minimizes an objective based on max-margin loss between positive pairs and their corresponding negative pairs."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-107",
"text": "More precisely, its objective has three loss components:"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-108",
"text": "The first component ensures that word pairs from each B A are drawn closer together than those in the corresponding T A up to a certain \"attract\" margin \u03b4 A :"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-109",
"text": "where \u03c4 (z) = max(0, z) is ramp function."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-110",
"text": "Analogously, Rep(B R , T R ) forces the vectors of words in B R pairs to be further away than the vectors of their corresponding T R pairs by a margin \u03b4 R ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-111",
"text": "Finally, P re(B A , B R ) is the regularization objective that preserves the useful semantic information from the distributional space by minimizing the Euclidean distance between original and changed vectors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-112",
"text": "3"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-113",
"text": "Post-Specialization."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-114",
"text": "By virtue of AR retrofitting, only the subset of vectors of L t words observed in the refined L t constraints are specialized."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-115",
"text": "The specialized subspace, however, contains useful information for propagating the specialization to the rest of the vocabulary V t (i.e., to the vectors of L t words unseen in A t and R t )."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-116",
"text": "Post-specialization aims to learn a global specialization function G : X t \u2208 R d \u2192 X t \u2208 R d that approximates the perturbation patterns of AR as captured by changes in vectors of seen words from A t and R t ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-117",
"text": "G is learned as a non-linear mapping between pairs (x i , y i ), where x i \u2208 X t is a distributional vectors of some constraint word (from A t or R t ) and y i is its corresponding AR-specialized vector."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-118",
"text": "In line with and Ponti et al. (2018), we implement this function as a deep feed-forward neural network with l hidden layers of size h and a final linear layer with weight W \u2208 R h\u00d7d ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-119",
"text": "We optimize model parameters \u03b8 G by minimizing a contrastive margin ranking loss with random confounders (Weston et al., 2011, inter alia) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-120",
"text": "The cosine similarity between a distributional vector transformed with G and the corresponding \"gold\" vector (i.e., ARspecialized vector) is forced to be larger than that between the former and randomly sampled confounders (k of them) by a margin \u03b4 M M :"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-121",
"text": "Once the global specialization transformation G is learned, it is applied to the whole distributional space of our target language: Y t = G \u03b8 G (X t )."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-122",
"text": "Note that with our proposed specialization approach CLSRI, we execute the retrofitting and postspecialization completely monolingually in the target language L t on the automatically induced constraints in the target language."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-123",
"text": "In contrast, existing work Glava\u0161 and Vuli\u0107, 2018b; Ponti et al., 2018) transfers the post-specialization function learned for the source language L s to the target language L t via a cross-lingual vector space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-124",
"text": "This fundamental design difference is illustrated in Figure 1 and empirically validated in \u00a75."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-126",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-127",
"text": "Lexical Constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-128",
"text": "The assortment of English constraints for specialization is the same as in prior work (Zhang et al., 2014; Ono et al., 2015; Ponti et al., 2018) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-129",
"text": "These constraints concern the lexical relations documented in WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-130",
"text": "Initially, they amount to 1,023,082 synonymy/ATTRACT word pairs and 380,873 antonymy/REPEL pairs, which cover 14.6% of the 200K most frequent English words, as found in the vocabulary of FASTTEXT vectors (Bojanowski et al., 2017) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-131",
"text": "The number of constraints is substantially reduced in the target languages 4 after the induction process from \u00a7 3.1, both after the rough translation and after the refinement via relation prediction."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-132",
"text": "The actual numbers are reported in Figure 2 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-133",
"text": "Relation Prediction."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-134",
"text": "The STM model is trained with the Adam optimizer (Kingma and Ba, 2015), a learning rate of 0.0001, and a batch size of 48 (including negative examples) for a maximum of 10 iterations."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-135",
"text": "Early stopping was implemented based on the F 1 score on a development set comprising 5% of the source language constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-136",
"text": "The hidden layer dimensionality is 300, and we use k = 5 specialization sub-tensors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-137",
"text": "Regarding the quality of the STM predictions, the best models achieve an F 1 score of 81.4 on ATTRACT constraints, and an F 1 score of 66.9 on REPEL constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-138",
"text": "7 AR and Post-Specialization."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-139",
"text": "We retain the exact hyper-parameter configuration for ATTRACT-REPEL from the original work : \u03b4 A = 0.6, \u03b4 R = 0.0, \u03bb P = 10 \u22129 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-140",
"text": "Adagrad (Duchi et al., 2011) is employed to optimize the model parameters for 5 epochs, feeding batches of size |B A | = |B R | = 50, again as in prior work."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-141",
"text": "of Skip-Gram with Negative Sampling (SGNS) that builds representations for each word's constituent character n-grams and sums them up to obtain the entire word's representation."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-142",
"text": "6 https://github.com/facebookresearch/ fastText/tree/master/alignment Owing to the difference in the amount of supervision, the post-specialization model has partially non-overlapping configurations for the baseline model of Ponti et al. (2018) and our CLSRI model."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-143",
"text": "For both models each of the l = 3 hidden layers of the feed-forward network is composed of h = 2, 048 hidden units, and is non-linearly activated by LeakyReLU (Maas et al., 2013) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-144",
"text": "We apply a dropout of 0.2 both in input and between hidden layers."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-145",
"text": "In Eq. (3) , the margin \u03b4 M M = 1, and the negative examples amount to k = 25."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-146",
"text": "We use SGD with the learning rate lr = 0.1."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-147",
"text": "For the baseline, post-specialization is trained for 10 epochs, each consisting of 1 million mini-batches of 32 randomly sampled pairs."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-148",
"text": "For our CLSRI model, it is limited to 2 epochs of 200K iterations each."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-149",
"text": "8 Models in Comparison."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-150",
"text": "Finally, we summarize the main models benchmarked in \u00a75."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-151",
"text": "First, we evaluate the original Distributional vectors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-152",
"text": "X-PS refers to the baseline model of Ponti et al. (2018) based on direct cross-lingual post-specialization."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-153",
"text": "CLSRI-AR denotes the variant of our model based on constraint induction in L t after running the initial AR retrofitting without post-specialization; CLSRI-PS refers to our full model with the postspecialization step."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-155",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-156",
"text": "We evaluate different specialization models across several target languages on the intrinsic word similarity task and three downstream language understanding tasks where distinguishing between true semantic similarity and conceptual relatedness is crucial: dialog state tracking, lexical simplification, and semantic textual similarity."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-157",
"text": "The choice of tasks has also been driven by the availability of standardized evaluation data in different languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-159",
"text": "**WORD SIMILARITY**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-160",
"text": "Evaluation Setup."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-161",
"text": "The intrinsic evaluation is based on a set of (true) word similarity benchmarks manually translated from (subsets of) the English SimLex and re-scored in the target languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-162",
"text": "9 In particular, the benchmarks 8 For both models, the hyper-parameters are chosen with a grid search over the intervals h={1024, 2048, 4096}, l={2, 3}, lr={0.1, 0.01, 0.001}, and optimizers in {Adam, SGD}, using a held-out dev set (10% of the constraints)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-163",
"text": "9 In contrast to other datasets like WordSim-353 (Finkelstein et al., 2002) or MEN (Bruni et al., 2014) , SimLex encourages scores to distinguish between pure semantic similarity (actual synonyms) and broad topical relatedness."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-164",
"text": "are collected from the work of Leviant and Reichart (2015) for German, Italian, and Russian (999 pairs), 10 from for Hebrew and Croatian (999 pairs), 11 from Venekoski and Vankka (2017) for Finnish (300), 12 from Mykowiecka et al. (2018) for Polish (999), 13 and from Ercan and Y\u0131ld\u0131z (2018) for Turkish (500)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-165",
"text": "14 We measure the Spearman's \u03c1 rank correlation between the gold human-elicited word pair similarity scores and the cosine similarity of the corresponding word vectors retrieved from each vector space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-166",
"text": "Results and Analysis."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-167",
"text": "We summarize the results for word similarity in Table 1 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-168",
"text": "The full CLSRI-PS model outperforms both the distributional vectors and the baseline method for cross-lingual specialization (Ponti et al., 2018) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-169",
"text": "In all languages but two (DE and RU) even the CLSRI-AR model without post-specialization is superior to both baselines, and the post-specialization step additionally improves the results, supporting the findings from prior work ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-170",
"text": "Crucially, the performance of CLSRI-PS remains strong even for distant language pairs (e.g., for EN-HE, EN-TR or EN-FI), whereas the X-PS baseline shows a drop in performance for such cases."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-171",
"text": "We suspect that it is because the success of our CLSRI-PS method depends less on the quality of the underlying shared cross-lingual vector space, which is known to deteriorate for more distant language pairs (S\u00f8gaard et al., 2018; ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-173",
"text": "**DIALOG STATE TRACKING**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-174",
"text": "A standard language understanding evaluation task used in prior work on semantic specialization Ponti et al., 2018, inter alia) is dialog state tracking (DST) (Henderson et al., 2014; ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-175",
"text": "A DST model is a fundamental building block of statistical modular dialogue systems (Young, 2010) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-176",
"text": "Its task is to maintain the information of the user's goals during a multi-turn conversation by updating the dialog belief state at each turn."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-177",
"text": "Distinguishing true similarity as captured in specialized word vectors from broader relatedness is crucial for DST to succeed: e.g., a dialog system for restaurant bookings should not confuse the western and the eastern part of town, or Thai and Japanese cuisine."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-178",
"text": "Evaluation Setup."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-179",
"text": "To be directly comparable to prior work when evaluating the effects of specialized word embeddings on DST, we rely on the Neural Belief Tracker (NBT) v2 : it is a fully statistical DST model that operates solely on the basis of pretrained word vectors , and they are pivotal to its performance."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-180",
"text": "15 Again following prior work, our evaluation data come from the multilingual Wizard-of-Oz (WOZ) dataset , which is available in two target languages: German and Italian ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-181",
"text": "It contains 1,200 dialogues split into training (600 dialogues), development (200), and test data (400)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-182",
"text": "We report the standard DST metric of joint goal accuracy: it refers to the proportion of dialog turns where all the users goals were correctly identified."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-188",
"text": "First, as already confirmed in prior work Ponti et al., 2018) , vectors specialized for semantic similarity are indeed important for DST: we observe improvements with all specialized vectors."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-189",
"text": "The highest gains are observed with the full CSLRI-PS model."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-190",
"text": "This confirms two main intuitions: 1) our proposed specialization transfer via lexical induction in the target language is more robust than 15 Note that the original NBT framework in the English DST task has been recently surpassed by more intricate taskspecific architectures (Zhong et al., 2018; Ren et al., 2018) , but its lightweight design coupled with its strong dependence on input word vectors still makes it a convenient means to evaluate the effects of different specialization methods."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-191",
"text": "the previous X-PS method of Ponti et al. (2018) , and 2) the full-vocabulary post-specialization step is again useful as the initial CSLRI-AR model cannot match the performance of CSLRI-PS."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-193",
"text": "**LEXICAL SIMPLIFICATION**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-194",
"text": "Lexical simplification (LS) aims to automatically replace complex words (i.e., specialized terms, words used less frequently and known to fewer speakers) with their simpler in-context synonyms: the simplified text must be grammatical and retain the meaning of the original text."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-195",
"text": "Lexical simplification critically depends on discerning semantic similarity from other types of semantic relatedness, as the meaning of the original text might not be preserved otherwise (e.g., \"The orange automobile crashed.\" vs. \"The orange wheel crashed.\")."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-196",
"text": "Evaluation Setup."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-197",
"text": "To evaluate the effects of similarity-based specialization on LS, we employ Light-LS (Glava\u0161 and\u0160tajner, 2015) , a languageagnostic LS tool that makes simplifications based on word similarities in a given vector space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-198",
"text": "The quality of similarity-based information encoded in the vector space encode is thus expected to directly correlate with the performance of Light-LS."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-199",
"text": "We use LS datasets for Italian (IT) (Tonelli et al., 2016) , Spanish (ES) (Saggion et al., 2015; Saggion, 2017) , and Portuguese (PT) (Hartmann et al., 2018) to evaluate the specialized spaces in those languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-200",
"text": "We rely on the standard LS evaluation metric of Accuracy (Horn et al., 2014; Glava\u0161 and\u0160tajner, 2015) : it quantifies both the quality and frequency of replacements as a number of correct simplifications divided by the total number of complex words."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-201",
"text": "Results and Analysis."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-202",
"text": "The results are reported in Table 3 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-203",
"text": "As shown in previous work Ponti et al., 2018) , retrofitting (CLSRI-AR) and the cross-lingual post-specialization transfer (X-PS) are substantially better in the LS task than the original distributional space."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-204",
"text": "However, our full CLSRI-PS model results in substantial boosts in the LS task (13-17%) over the previous best reported scores of X-PS as well as over CLSRI-AR."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-206",
"text": "**SEMANTIC TEXT SIMILARITY**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-207",
"text": "Evaluation Setup."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-208",
"text": "Finally, we also carry downstream evaluation in the semantic textual similarity (STS) task."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-209",
"text": "The Arabic dataset constructed for Se-mEval 2017 track 1 16 (Cer et al., 2017) consists of sentence pairs scored from 0 (semantic independence) to 5 (semantic equivalence)."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-210",
"text": "We augment the training set with all the data for English (translated with Google Translate) from previous editions of the shared task."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-211",
"text": "To classify sentence pairs, we employ the CNN-HTCI model (Shao, 2017) ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-212",
"text": "Each sentence is encoded with a convolutional network into a hidden representation."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-213",
"text": "Then, the interaction between the pair of representations is evaluated as their element-wise multiplication and absolute difference."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-214",
"text": "A fully connected network takes this interaction as input, and infers the similarity score."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-215",
"text": "Results and Analysis."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-216",
"text": "We report the accuracy scores for the Arabic STS in Table 3 ."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-217",
"text": "Interestingly, for STS both X-PS and CLSRI-AR damage the performance of the distributional baseline."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-218",
"text": "However, the full CLSRI-PS model still shows a substantial improvement over all baselines."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-219",
"text": "This again suggests its wide stability and effectiveness."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-220",
"text": "To empirically validate the importance of noisy constraints refinement (see \u00a7 3.1), we have also evaluated an ablated variant of CLSRI-PS without the refinement step: this model variant relies only on noisy translations of L s lexical constraints."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-221",
"text": "While this variant leads to improvements over the X-PS baseline across the board, it is consistently outperformed by the full CSLRI-PS model in downstream tasks: e.g., the gains with the full model are 2-3% in the LS task, and 2% in the Arabic STS task."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-222",
"text": "Since the full CSLRI-PS model does not require 16 http://alt.qcri.org/semeval2017/task1/ any additional input for the lexical prediction step (i.e., it operates with the same set of L s constraints as the translation step), these results suggest that both steps should be applied for improved specialization in the target languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-223",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-224",
"text": "**FUTURE WORK**"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-225",
"text": "As a supplemental benefit of CLSRI, the constraints induced by translation and pruning hold promise to create WordNet-style resources for languages that lack structured linguistic knowledge."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-226",
"text": "While the relations extracted in this proof-of-concept paper do not cover the rich and expressive set of Word-Net relations in its entirety, they are nonetheless sufficient to create parts of the core WordNet structure with synsets (synonyms) and lexical relations across synsets (antonyms) from scratch."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-227",
"text": "Furthermore, our method is amenable be extended to contextualized embeddings (Lauscher et al., 2019) and/or other WordNet lexical relations such as hypernyms and hyponyms."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-228",
"text": "In recent works, procedures of retrofitting and post-specialization (Kamath et al., 2019) have been developed for lexical entailment."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-229",
"text": "These procedures can be easily adapted to the semantic specialization step presented in \u00a7 3.2, whereas constraint translation and refinement ( \u00a7 3.1) are relation-agnostic."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-230",
"text": "We will exploit these directions in future work."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-231",
"text": "----------------------------------"
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-233",
"text": "We have proposed a new method for cross-lingual transfer of semantic specialization via induction of lexical constraints in a resource-poor target language."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-234",
"text": "We have verified its usefulness in intrinsic and extrinsic language understanding tasks and across a spectrum of target languages."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-235",
"text": "We report consistent improvements over previous state-of-theart specialization methods."
},
{
"sent_id": "d1dce63d89e8cfc73962413734bf7b-C001-236",
"text": "Crucially, our method is robust to target languages that are distant from source languages, as its performance is consistent across all considered language pairs."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d1dce63d89e8cfc73962413734bf7b-C001-15"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-35"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-41"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-174"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-188"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-203"
]
],
"cite_sentences": [
"d1dce63d89e8cfc73962413734bf7b-C001-15",
"d1dce63d89e8cfc73962413734bf7b-C001-35",
"d1dce63d89e8cfc73962413734bf7b-C001-41",
"d1dce63d89e8cfc73962413734bf7b-C001-174",
"d1dce63d89e8cfc73962413734bf7b-C001-188",
"d1dce63d89e8cfc73962413734bf7b-C001-203"
]
},
"@DIF@": {
"gold_contexts": [
[
"d1dce63d89e8cfc73962413734bf7b-C001-31"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-51"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-64"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-122",
"d1dce63d89e8cfc73962413734bf7b-C001-123"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-152"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-168"
]
],
"cite_sentences": [
"d1dce63d89e8cfc73962413734bf7b-C001-31",
"d1dce63d89e8cfc73962413734bf7b-C001-51",
"d1dce63d89e8cfc73962413734bf7b-C001-64",
"d1dce63d89e8cfc73962413734bf7b-C001-123",
"d1dce63d89e8cfc73962413734bf7b-C001-152",
"d1dce63d89e8cfc73962413734bf7b-C001-168"
]
},
"@USE@": {
"gold_contexts": [
[
"d1dce63d89e8cfc73962413734bf7b-C001-62"
]
],
"cite_sentences": [
"d1dce63d89e8cfc73962413734bf7b-C001-62"
]
},
"@SIM@": {
"gold_contexts": [
[
"d1dce63d89e8cfc73962413734bf7b-C001-118"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-128"
],
[
"d1dce63d89e8cfc73962413734bf7b-C001-142"
]
],
"cite_sentences": [
"d1dce63d89e8cfc73962413734bf7b-C001-118",
"d1dce63d89e8cfc73962413734bf7b-C001-128",
"d1dce63d89e8cfc73962413734bf7b-C001-142"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"d1dce63d89e8cfc73962413734bf7b-C001-191"
]
],
"cite_sentences": [
"d1dce63d89e8cfc73962413734bf7b-C001-191"
]
}
}
},
"ABC_25e03048cd34685cec34754bdade4e_2": {
"x": [
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-2",
"text": "Generative models defining joint distributions over parse trees and sentences are useful for parsing and language modeling, but impose restrictions on the scope of features and are often outperformed by discriminative models."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-3",
"text": "We propose a framework for parsing and language modeling which marries a generative model with a discriminative recognition model in an encoder-decoder setting."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-4",
"text": "We provide interpretations of the framework based on expectation maximization and variational inference, and show that it enables parsing and language modeling within a single implementation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-5",
"text": "On the English Penn Treenbank, our framework obtains competitive performance on constituency parsing while matching the state-of-the-art singlemodel language modeling score."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-6",
"text": "1"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-9",
"text": "Generative models defining joint distributions over parse trees and sentences are good theoretical models for interpreting natural language data, and appealing tools for tasks such as parsing, grammar induction and language modeling (Collins, 1999; Henderson, 2003; Titov and Henderson, 2007; Petrov and Klein, 2007; Dyer et al., 2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-10",
"text": "However, they often impose strong independence assumptions which restrict the use of arbitrary features for effective disambiguation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-11",
"text": "Moreover, generative parsers are typically trained by maximizing the joint probability of the parse tree and the sentence-an objective that only indirectly relates to the goal of parsing."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-12",
"text": "At test time, these models require a relatively expensive recognition algo-1 Our code is available at https://github.com/cheng6076/virnng.git."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-13",
"text": "rithm (Collins, 1999; Titov and Henderson, 2007) to recover the parse tree, but the parsing performance consistently lags behind their discriminative competitors (Nivre et al., 2007; Huang, 2008; Goldberg and Elhadad, 2010) , which are directly trained to maximize the conditional probability of the parse tree given the sentence, where linear-time decoding algorithms exist (e.g., for transition-based parsers)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-14",
"text": "In this work, we propose a parsing and language modeling framework that marries a generative model with a discriminative recognition algorithm in order to have the best of both worlds."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-15",
"text": "The idea of combining these two types of models is not new."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-16",
"text": "For example, Collins and Koo (2005) propose to use a generative model to generate candidate constituency trees and a discriminative model to rank them."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-17",
"text": "Sangati et al. (2009) follow the opposite direction and employ a generative model to re-rank the dependency trees produced by a discriminative parser."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-18",
"text": "However, previous work combines the two types of models in a goal-oriented, pipeline fashion, which lacks model interpretations and focuses solely on parsing."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-19",
"text": "In comparison, our framework unifies generative and discriminative parsers with a single objective, which connects to expectation maximization and variational inference in grammar induction settings."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-20",
"text": "In a nutshell, we treat parse trees as latent factors generating natural language sentences and parsing as a posterior inference task."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-21",
"text": "We showcase the framework using Recurrent Neural Network Grammars (RNNGs; Dyer et al. 2016 ), a recently proposed probabilistic model of phrase-structure trees based on neural transition systems."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-22",
"text": "Different from this work which introduces separately trained discriminative and generative models, we integrate the two in an auto-encoder which fits our training objective."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-23",
"text": "We show how the framework enables grammar induction, parsing and language modeling within a single implementation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-24",
"text": "On the English Penn Treebank, we achieve competitive performance on constituency parsing and state-ofthe-art single-model language modeling score."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-26",
"text": "**PRELIMINARIES**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-27",
"text": "In this section we briefly describe Recurrent Neural Network Grammars (RNNGs; Dyer et al. 2016 ), a top-down transition-based algorithm for parsing and generation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-28",
"text": "There are two versions of RNNG, one discriminative, the other generative."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-29",
"text": "We follow the original paper in presenting the discriminative variant first."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-30",
"text": "The discriminative RNNG follows a shiftreduce parser that converts a sequence of words into a parse tree."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-31",
"text": "As in standard shift-reduce parsers, the RNNG uses a buffer to store unprocessed terminal symbols and a stack to store partially completed syntactic constituents."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-32",
"text": "At each timestep, one of the following three operations 2 is performed:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-33",
"text": "\u2022 NT(X) introduces an open non-terminal X onto the top of the stack, represented as an open parenthesis followed by X, e.g., (NP."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-34",
"text": "\u2022 SHIFT fetches the terminal in the front of the buffer and pushes it onto the top of the stack."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-35",
"text": "\u2022 REDUCE completes a subtree by repeatedly popping the stack until an open non-terminal is encountered."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-36",
"text": "The non-terminal is popped as well, after which a composite term representing the entire subtree is pushed back onto the top of the stack, e.g., (NP the cat)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-37",
"text": "The above transition system can be adapted with minor modifications to an algorithm that generates trees and sentences."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-38",
"text": "In generator transitions, there is no input buffer of unprocessed words but there is an output buffer for storing words that have been generated."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-39",
"text": "To reflect the change, the previous SHIFT operation is modified into a GEN operation defined as follows:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-40",
"text": "\u2022 GEN generates a terminal symbol and add it to the stack and the output buffer."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-42",
"text": "**METHODOLOGY**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-43",
"text": "Our framework unifies generative and discriminative parsers within a single training objective."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-44",
"text": "For illustration, we adopt the two RNNG variants introduced above with our customized features."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-45",
"text": "Our starting point is the generative model ( \u00a7 3.1), which allows us to make explicit claims about the generative process of natural language sentences."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-46",
"text": "Since this model alone lacks a bottom-up recognition mechanism, we introduce a discriminative recognition model ( \u00a7 3.2) and connect it with the generative model in an encoder-decoder setting."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-47",
"text": "To offer a clear interpretation of the training objective ( \u00a7 3.3), we first consider the parse tree as latent and the sentence as observed."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-48",
"text": "We then discuss extensions that account for labeled parse trees."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-49",
"text": "Finally, we present various inference techniques for parsing and language modeling within the framework ( \u00a7 3.4)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-51",
"text": "**DECODER (GENERATIVE MODEL)**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-52",
"text": "The decoder is a generative RNNG that models the joint probability p(x, y) of a latent parse tree y and an observed sentence x. Since the parse tree is defined by a sequence of transition actions a, we write p(x, y) as p(x, a)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-53",
"text": "3 The joint distribution p(x, a) is factorized into a sequence of transition probabilities and terminal probabilities (when actions are GEN), which are parametrized by a transitional state embedding u:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-54",
"text": "where I is an indicator function and u t represents the state embedding at time step t. Specifically, the conditional probability of the next action is:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-55",
"text": "where a t represents the action embedding at time step t, A the action space and b a the bias."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-56",
"text": "Similarly, the next word probability (when GEN is invoked) is computed as:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-57",
"text": "where W denotes all words in the vocabulary."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-58",
"text": "To satisfy the independence assumptions imposed by the generative model, u t uses only a restricted set of features defined over the output buffer and the stack -we consider p(a) as a context insensitive prior distribution."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-59",
"text": "Specifically, we use the following features: 1) the stack embedding d t which encodes the stack of the decoder and is obtained with a stack-LSTM (Dyer et al., 2015 (Dyer et al., , 2016 ; 2) the output buffer embedding o t ; we use a standard LSTM to compose the output buffer and o t is represented as the most recent state of the LSTM; and 3) the parent non-terminal embedding n t which is accessible in the generative model because the RNNG employs a depth-first generation order."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-60",
"text": "Finally, u t is computed as:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-61",
"text": "where Ws are weight parameters and b d the bias."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-63",
"text": "**ENCODER (RECOGNITION MODEL)**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-64",
"text": "The encoder is a discriminative RNNG that computes the conditional probability q(a|x) of the transition action sequence a given an observed sentence x. This conditional probability is factorized over time steps as:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-65",
"text": "where v t is the transitional state embedding of the encoder at time step t. The next action is predicted similarly to Equation (2), but conditioned on v t ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-66",
"text": "Thanks to the discriminative property, v t has access to any contextual features defined over the entire sentence and the stack -q(a|x) acts as a context sensitive posterior approximation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-67",
"text": "Our features 4 are: 1) the stack embedding e t obtained with a stack-LSTM that encodes the stack of the encoder; 2) the input buffer embedding i t ; we use a bidirectional LSTM to compose the input buffer and represent each word as a concatenation of forward and backward LSTM states; i t is the representation of the word on top of the buffer; 3) to incorporate more global features and a more sophisticated look-ahead mechanism for the buffer, we also use an adaptive buffer embedding\u012b t ; the latter is computed by having the stack embedding e t attend to all remaining embeddings on the buffer with the attention function in Vinyals et al. (2015) ; and 4) the parent non-terminal embedding n t ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-68",
"text": "Finally, v t is computed as follows:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-69",
"text": "where Ws are weight parameters and b e the bias."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-71",
"text": "**TRAINING**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-72",
"text": "Consider an auto-encoder whose encoder infers the latent parse tree and the decoder generates the observed sentence from the parse tree."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-73",
"text": "5 The maximum likelihood estimate of the decoder parameters is determined by the log marginal likelihood of the sentence:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-74",
"text": "We follow expectation-maximization and variational inference techniques to construct an evidence lower bound of the above quantity (by Jensen's Inequality), denoted as follows:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-75",
"text": "where p(x, a) = p(x|a)p(a) comes from the decoder or the generative model, and q(a|x) comes from the encoder or the recognition model."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-76",
"text": "The objective function 6 in Equation (8), denoted by L x , is unsupervised and suited to a grammar induction task."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-77",
"text": "This objective can be optimized with the methods shown in ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-78",
"text": "Next, consider the case when the parse tree is observed."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-79",
"text": "We can directly maximize the log likelihood of the parse tree for the encoder output log q(a|x) and the decoder output log p(a):"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-80",
"text": "This supervised objective leverages extra information of labeled parse trees to regularize the distribution q(a|x) and p(a), and the final objective is:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-81",
"text": "where L x and L a can be balanced with the task focus (e.g, language modeling or parsing)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-82",
"text": "5 Here, GEN and SHIFT refer to the same action with different definitions for encoding and decoding."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-83",
"text": "6 See \u00a7 4 and Appendix A for comparison between this objective and the importance sampler of Dyer et al. (2016"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-85",
"text": "**INFERENCE**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-86",
"text": "We consider two inference tasks, namely parsing and language modeling."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-87",
"text": "Parsing In parsing, we are interested in the parse tree that maximizes the posterior p(a|x) (or the joint p(a, x))."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-88",
"text": "However, the decoder alone does not have a bottom-up recognition mechanism for computing the posterior."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-89",
"text": "Thanks to the encoder, we can compute an approximated posterior q(a|x) in linear time and select the parse tree that maximizes this approximation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-90",
"text": "An alternative is to generate candidate trees by sampling from q(a|x), re-rank them with respect to the joint p(x, a) (which is proportional to the true posterior), and select the sample that maximizes the true posterior."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-91",
"text": "Language Modeling In language modeling, our goal is to compute the marginal probability p(x) = a p(x, a), which is typically intractable."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-92",
"text": "To approximate this quantity, we can use Equation (8) to compute a lower bound of the log likelihood log p(x) and then exponentiate it to get a pessimistic approximation of p(x)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-93",
"text": "7 Another way of computing p(x) (without lower bounding) would be to use the variational approximation q(a|x) as the proposal distribution as in the importance sampler of Dyer et al. (2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-94",
"text": "We discuss details in Appendix A."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-96",
"text": "**RELATED WORK**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-97",
"text": "Our framework is related to a class of variational autoencoders , which use neural networks for posterior approximation in variational inference."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-98",
"text": "This technique has been previously used for topic modeling and sentence compression Discriminative parsers Socher et al. (2013) 90.4 Zhu et al. (2013) 90."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-99",
"text": "4 Dyer et al. (2016) 91.7 Cross and Huang (2016) 89.9 Vinyals et al. (2015) 92.8"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-100",
"text": "Generative parsers Petrov and Klein (2007) 90."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-101",
"text": "1 Shindo et al. (2012) 92."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-102",
"text": "4 Dyer et al. (2016) 93.3"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-103",
"text": "This work argmax a q(a|x) 89.3 argmax a p(a, x) 90.1 Table 2 : Parsing results (F1) on the PTB test set. ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-104",
"text": "Another interpretation of the proposed framework is from the perspective of guided policy search in reinforcement learning (Bachman and Precup, 2015) , where a generative parser is trained to imitate the trace of a discriminative parser."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-105",
"text": "Further connections can be drawn with the importance-sampling based inference of Dyer et al. (2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-106",
"text": "There, a generative RNNG and a discriminative RNNG are trained separately; during language modeling, the output of the discriminative model serves as the proposal distribution of an importance sampler p(x) = E q(a|x)"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-107",
"text": "p(x,a) q(a|x) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-108",
"text": "Compared to their work, we unify the generative and discriminative RNNGs in a single framework, and adopt a joint training objective."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-110",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-111",
"text": "We performed experiments on the English Penn Treebank dataset; we used sections 2-21 for training, 24 for validation, and 23 for testing."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-112",
"text": "Following Dyer et al. (2015) , we represent each word in three ways: as a learned vector, a pretrained vector, and a POS tag vector."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-113",
"text": "The encoder word embedding is the concatenation of all three vectors while the decoder uses only the first two since we do not consider POS tags in generation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-114",
"text": "Table 1 presents details on the hyper-parameters we used."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-115",
"text": "To find the MAP parse tree argmax a p(a, x) (where p(a, x) is used rank the output of q(a|x)) and to compute the language modeling perplexity (where a \u223c q(a|x)), we collect 100 samples from q(a|x), same as Dyer et al. (2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-116",
"text": "Experimental results for constituency parsing and language modeling are shown in Tables 2 and 3, respectively."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-117",
"text": "As can be seen, the single framework we propose obtains competitive parsing performance."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-118",
"text": "Comparing the two inference KN-5 255.2 LSTM 113."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-119",
"text": "4 Dyer et al. (2016) 102.4 This work: a \u223c q(a|x) 99.8 Table 3 : Language modeling results (perplexity)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-120",
"text": "methods for parsing, ranking approximated MAP trees from q(a|x) with respect to p(a, x) yields a small improvement, as in Dyer et al. (2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-121",
"text": "It is worth noting that our parsing performance lags behind Dyer et al. (2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-122",
"text": "We believe this is due to implementation disparities, such as the modeling of the reduce operation."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-123",
"text": "While Dyer et al. (2016) use an LSTM as the syntactic composition function of each subtree, we adopt a rather simple composition function based on embedding averaging, which gains computational efficiency but loses accuracy."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-124",
"text": "On language modeling, our framework achieves lower perplexity compared to Dyer et al. (2016) and baseline models."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-125",
"text": "This gain possibly comes from the joint optimization of both the generative and discriminative components towards a language modeling objective."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-126",
"text": "However, we acknowledge a subtle difference between Dyer et al. (2016) and our approach compared to baseline language models: while the latter incrementally estimate the next word probability, our approach (and Dyer et al. 2016 ) directly assigns probability to the entire sentence."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-127",
"text": "Overall, the advantage of our framework compared to Dyer et al. (2016) is that it opens an avenue to unsupervised training."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-129",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-130",
"text": "We proposed a framework that integrates a generative parser with a discriminative recognition model and showed how it can be instantiated with RNNGs."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-131",
"text": "We demonstrated that a unified framework, which relates to expectation maximization and variational inference, enables effective parsing and language modeling algorithms."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-132",
"text": "Evaluation on the English Penn Treebank, revealed that our framework obtains competitive performance on constituency parsing and state-of-the-art results on single-model language modeling."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-133",
"text": "In the future, we would like to perform grammar induction based on Equation (8), with gradient descent and posterior regularization techniques (Ganchev et al., 2010 A Comparison to Importance Sampling (Dyer et al., 2016) In this appendix we highlight the connections between importance sampling and variational inference, thereby comparing our method with Dyer et al. (2016) ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-134",
"text": "Consider a simple directed graphical model with discrete latent variables a (e.g., a is the transition action sequence) and observed variables x (e.g., x is the sentence)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-135",
"text": "The model evidence, or the marginal likelihood p(x) = a p(x, a) is often intractable to compute."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-136",
"text": "Importance sampling transforms the above quantity into an expectation over a distribution q(a), which is known and easy to sample from:"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-137",
"text": "where q(a) is the proposal distribution and w(x, a) = p(x,a) q(a) the importance weight."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-138",
"text": "The proposal distribution can potentially depend on the observations x, i.e., q(a) q(a|x)."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-139",
"text": "A challenge with importance sampling lies in choosing a proposal distribution which leads to low variance."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-140",
"text": "As shown in Rubinstein and Kroese (2008) , the optimal choice of the proposal distribution is in fact the true posterior p(a|x), in which case the importance weight p(a,x) p(a|x) = p(x) is constant with respect to a. In Dyer et al. (2016) , the proposal distribution depends on x, i.e., q(a) q(a|x), and is computed with a separately-trained, discriminative model."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-141",
"text": "This proposal choice is close to optimal, since in a fully supervised setting a is also observed and the discriminative model can be trained to approximate the true posterior well."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-142",
"text": "We hypothesize that the performance of their importance sampler is dependent on this specific proposal distribution."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-143",
"text": "Besides, their training strategy does not generalize to an unsupervised setting."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-144",
"text": "In comparison, variational inference approach approximates the log marginal likelihood log p(x) with the evidence lower bound."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-145",
"text": "It is a natural choice when one aims to optimize Equation (11) directly: log p(x) = log a p(x, a) q(a) q(a)"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-146",
"text": "\u2265 E q(a) log p(x, a) q(a)"
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-147",
"text": "where q(a) is the variational approximation of the true posterior."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-148",
"text": "Again, the variational approximation can potentially depend on the observation x (i.e., q(a) q(a|x)) and can be computed with a discriminative model."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-149",
"text": "Equation (12) is a well-defined, unsupervised training objective which allows us to jointly optimize generative (i.e., p(x, a)) and discriminative (i.e., q(a|x)) models."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-150",
"text": "To further support the observed variable a, we augment this objective with supervised terms shown in Equation (10), following and ."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-151",
"text": "Equation (12) can be also used to approximate the marginal likelihood p(x) (e.g., in language modeling) with its lower bound."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-152",
"text": "An alternative choice without lower bounding is to use the variational approximation q(a|x) as the proposal distribution in importance sampling (Equation (11))."
},
{
"sent_id": "25e03048cd34685cec34754bdade4e-C001-153",
"text": "Ghahramani and Beal (2000) show that this proposal distribution leads to improved results of importance samplers."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"25e03048cd34685cec34754bdade4e-C001-9"
],
[
"25e03048cd34685cec34754bdade4e-C001-140"
]
],
"cite_sentences": [
"25e03048cd34685cec34754bdade4e-C001-9",
"25e03048cd34685cec34754bdade4e-C001-140"
]
},
"@MOT@": {
"gold_contexts": [
[
"25e03048cd34685cec34754bdade4e-C001-10",
"25e03048cd34685cec34754bdade4e-C001-11",
"25e03048cd34685cec34754bdade4e-C001-14",
"25e03048cd34685cec34754bdade4e-C001-9"
],
[
"25e03048cd34685cec34754bdade4e-C001-140",
"25e03048cd34685cec34754bdade4e-C001-141",
"25e03048cd34685cec34754bdade4e-C001-142"
]
],
"cite_sentences": [
"25e03048cd34685cec34754bdade4e-C001-9",
"25e03048cd34685cec34754bdade4e-C001-140"
]
},
"@USE@": {
"gold_contexts": [
[
"25e03048cd34685cec34754bdade4e-C001-21"
],
[
"25e03048cd34685cec34754bdade4e-C001-27"
],
[
"25e03048cd34685cec34754bdade4e-C001-59"
],
[
"25e03048cd34685cec34754bdade4e-C001-83"
],
[
"25e03048cd34685cec34754bdade4e-C001-93"
],
[
"25e03048cd34685cec34754bdade4e-C001-105"
],
[
"25e03048cd34685cec34754bdade4e-C001-115"
],
[
"25e03048cd34685cec34754bdade4e-C001-120"
],
[
"25e03048cd34685cec34754bdade4e-C001-133"
]
],
"cite_sentences": [
"25e03048cd34685cec34754bdade4e-C001-21",
"25e03048cd34685cec34754bdade4e-C001-27",
"25e03048cd34685cec34754bdade4e-C001-59",
"25e03048cd34685cec34754bdade4e-C001-83",
"25e03048cd34685cec34754bdade4e-C001-93",
"25e03048cd34685cec34754bdade4e-C001-105",
"25e03048cd34685cec34754bdade4e-C001-115",
"25e03048cd34685cec34754bdade4e-C001-120",
"25e03048cd34685cec34754bdade4e-C001-133"
]
},
"@DIF@": {
"gold_contexts": [
[
"25e03048cd34685cec34754bdade4e-C001-121"
],
[
"25e03048cd34685cec34754bdade4e-C001-123"
],
[
"25e03048cd34685cec34754bdade4e-C001-124"
],
[
"25e03048cd34685cec34754bdade4e-C001-126"
],
[
"25e03048cd34685cec34754bdade4e-C001-127"
]
],
"cite_sentences": [
"25e03048cd34685cec34754bdade4e-C001-121",
"25e03048cd34685cec34754bdade4e-C001-123",
"25e03048cd34685cec34754bdade4e-C001-124",
"25e03048cd34685cec34754bdade4e-C001-126",
"25e03048cd34685cec34754bdade4e-C001-127"
]
},
"@SIM@": {
"gold_contexts": [
[
"25e03048cd34685cec34754bdade4e-C001-126"
]
],
"cite_sentences": [
"25e03048cd34685cec34754bdade4e-C001-126"
]
}
}
},
"ABC_c4cc8d4013b0259eb626d06750e4ab_2": {
"x": [
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-2",
"text": "Most deep learning approaches for text-to-SQL generation are limited to the WikiSQL dataset, which only supports very simple queries."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-3",
"text": "Recently, template-based and sequence-to-sequence approaches were proposed to support complex queries, which contain join queries, nested queries, and other types."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-4",
"text": "However, Finegan-Dollak et al. (2018) demonstrated that both the approaches lack the ability to generate SQL of unseen templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-5",
"text": "In this paper, we propose a template-based one-shot learning model for the text-to-SQL generation so that the model can generate SQL of an untrained template based on a single example."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-6",
"text": "First, we classify the SQL template using the Matching Network (Vinyals et al., 2016) that is augmented by our novel architecture Candidate Search Network."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-7",
"text": "Then, we fill the variable slots in the predicted template using the Pointer Network (Vinyals et al., 2015) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-8",
"text": "We show that our model outperforms stateof-the-art approaches for various text-to-SQL datasets in two aspects: 1) the SQL generation accuracy for the trained templates, and 2) the adaptability to the unseen SQL templates based on a single example without any additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-11",
"text": "We focus on a text-to-SQL generation, the task of translating a question in natural language into the corresponding SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-12",
"text": "Recently, various deep learning approaches have been proposed for the task."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-13",
"text": "However, most of these approaches target the WikiSQL dataset (Zhong et al., 2017 ) that only contains very simple and constrained queries (Xu et al., 2017; Yu et al., 2018; Dong and Lapata, 2018; Huang et al., 2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-14",
"text": "These approaches cannot be applied directly to generate complex queries containing elements such as join, group by, and nested queries."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-15",
"text": "Finegan-Dollak et al. (2018) proposed two different approaches to support complex queries: a template-based model and a sequence-to-sequence model."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-16",
"text": "However, both of these models have limitations."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-17",
"text": "The template-based model cannot generate queries of unobserved templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-18",
"text": "It requires a lot of examples and additional training to support new templates of SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-19",
"text": "On the other hand, the sequence-to-sequence model is unstable because of the large search space including outputs with SQL syntax errors."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-20",
"text": "Moreover, Finegan-Dollak et al. (2018) demonstrated that the sequence-tosequence model also lack the ability to generate SQL queries of unseen templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-46",
"text": "**RELATED WORK**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-21",
"text": "In this work, we propose an extension of a template-based model with one-shot learning, which can generate SQL queries of untrained templates based on a single example."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-22",
"text": "Our model works in two phases."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-23",
"text": "The first phase classifies an SQL template."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-24",
"text": "We applied Matching Network (Vinyals et al., 2016) for the classification since it is robust to adapt to new SQL templates without additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-25",
"text": "However, as most of the oneshot learning methods, including Matching Network, focus on n-way classification setting, it cannot be directly applied to classify a label from a large number of classes."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-26",
"text": "Therefore, we introduce a novel architecture Candidate Search Network that picks the top-n most relevant SQL templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-27",
"text": "It enables the Matching Network to be utilized to find the most appropriate template among all possible templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-28",
"text": "The second phase fills the variable slots of the predicted template using a Pointer Network (Vinyals et al., 2015) as these variables are chosen from the tokens in the input sentence."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-29",
"text": "The proposed model has three advantages."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-30",
"text": "on the WikiSQL dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-31",
"text": "2. It minimizes unnecessary search space, unlike sequence-to-sequence approaches (Iyer et al., 2017; Finegan-Dollak et al., 2018) ; thus, the model is guaranteed to be free of SQL syntax errors."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-32",
"text": "3."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-33",
"text": "The model not only generates SQL of trained templates, but it can also adapt to queries of unseen templates based on a single example without additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-34",
"text": "Our approach has great strengths in terms of practical application."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-35",
"text": "To support the SQL queries of new templates, previous approaches require a number of natural language examples for each template and the retraining of the model."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-36",
"text": "In contrast, our model just needs a single example and no retraining."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-37",
"text": "Moreover, our model is not merely limited to generating SQL but can also be applied to the other code generation tasks (Oda et al., 2015; Ling et al., 2016b; Lin et al., 2018) by defining templates of code and variables for each template."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-38",
"text": "We conducted experiments with four different text-to-SQL datasets on both of the questionbased split and query-based split (Finegan-Dollak et al., 2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-39",
"text": "In the question-based split, SQL queries of the same template appear in both training dataset and test dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-40",
"text": "With the questionbased split, we tested the effectiveness of the model at generating queries for the trained templates of SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-41",
"text": "In contrast, query-based split ensures that queries of the same template only appear in either training or test dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-42",
"text": "With the querybased split, we studied how well the model can adapt to new templates of SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-43",
"text": "The experimental result shows that our approach outperforms the state-of-the-art approach by 3-9% for the question-based split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-44",
"text": "In addition, we achieved up to 52% performance gain for the query-based split in the one-shot setting without additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-47",
"text": "Semantic Parsing Semantic parsing is the task of mapping natural language utterances onto machine-understandable representations of meaning."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-48",
"text": "As a sub-task of semantic parsing, natural language to code generation aims to convert a natural language description to the executable code (Oda et al., 2015; Ling et al., 2016b; Lin et al., 2018 )."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-49",
"text": "To solve this task, a variety of deep learning approaches have been proposed."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-50",
"text": "Early works applied a sequence-to-sequence architecture that directly maps a natural language description to a sequence of the target code (Ling et al., 2016a; Jia and Liang, 2016) , but this approach does not guarantee syntax correctness."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-51",
"text": "To overcome this limitation, tree-based approaches such as sequence-totree (Dong and Lapata, 2016) and Abstract Syntax Tree (AST) (Rabinovich et al., 2017) have been proposed to ensure syntax correctness."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-52",
"text": "However, Finegan-Dollak et al. (2018) showed that the sequence-to-tree approach was inefficient when generating complex SQL queries from a natural language question."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-53",
"text": "Very recently, Hayati et al. (2018) proposed a retrieval-based neural code generation (RE-CODE), sharing a similar idea with our templatebased approach."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-54",
"text": "They searched for similar sentences in training dataset using a sentence similarity score and then extracted n-grams to build subtrees of AST to be used at the decoding step."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-55",
"text": "In contrast, we introduced an end-to-end learning architecture to retrieve similar sentences in terms of SQL generation."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-56",
"text": "In addition, we do not need a decoding step or subtrees since we generate the full template at once via classification."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-57",
"text": "Text-to-SQL Natural language interface to database(NLIDB) is a topic that has been actively studied for decades."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-58",
"text": "In early works, there have been several rule-based approaches to parse natural language as SQL (Popescu et al., 2003; Li and Jagadish, 2014) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-59",
"text": "Because rule-based systems suffer from variations in natural language input, enhanced methods leveraging user-interaction have been proposed (Li and Jagadish, 2014; Yaghmazadeh et al., 2017) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-60",
"text": "Recently, WikiSQL dataset (Zhong et al., 2017) , a large dataset of natural language and SQL pairs, has been released, and a number of studies have proposed text-to-SQL approaches based on deep learning (Xu et al., 2017; Yu et al., 2018; Dong and Lapata, 2018; Huang et al., 2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-61",
"text": "However, as WikiSQL only contains simple SQL queries, most of the approaches are restricted to the simple queries alone."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-62",
"text": "Iyer et al. (2017); Finegan-Dollak et al. (2018) focused on the dataset that contains more complex queries such as ATIS (Dahl et al., 1994) and GeoQuery (Zelle and Mooney, 1996) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-63",
"text": "To support complex queries, Iyer et al. (2017) ap- Figure 1 : The architecture of our SQL template classification model."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-64",
"text": "We propose a Candidate Search Network (CSN) for the selection of top-n relevant SQL templates within the candidate set C to build a support set S. Then, we find the template\u0177 using the Matching Network based on the support set S."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-65",
"text": "plied a sequence-to-sequence approach with attention mechanism, and Finegan-Dollak et al. (2018) proposed a template-based model and another sequence-to-sequence model with a copy mechanism."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-66",
"text": "However, Finegan-Dollak et al. (2018) showed that both approaches lack the ability to generate SQL of the unseen template in the training stage."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-67",
"text": "One-shot Learning/Matching Network Deep learning models usually require hundreds or thousands of examples in order to learn a class."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-68",
"text": "To overcome this limitation, one-shot learning aims to learn a class from a single labeled example."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-69",
"text": "We applied one-shot learning to the text-to-SQL task so that our model could learn a SQL template from just a few examples and adapt easily and promptly to the SQL of untrained templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-70",
"text": "Vinyals et al. (2016) proposed a Matching Network that aims to train an end-to-end k-nearest neighbor (kNN) by combining feature extraction and a differentiable distance metric with cosine similarity."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-71",
"text": "It enables the model to produce test labels for unobserved classes given only a few samples without any network tuning."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-72",
"text": "However, the n-way classification setting used in the Matching Network cannot be directly applied to the general classification problem, because it fixes the number of target classes with a small number n by sampling from whole possible classes."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-73",
"text": "Therefore, we introduced a novel architecture Candidate Search Network that chooses the top-n most relevant classes from the entire classes to support the Matching Network."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-74",
"text": "Pointer Network Pointer Network (Vinyals et al., 2015) aims to predict an output sequence as probability distributions over the tokens in the input sequence."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-75",
"text": "It has been successfully applied to question answering (Wang and Jiang, 2016) , abstractive summarization (Nallapati et al., 2016) , and code generation (Yin and Neubig, 2017) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-76",
"text": "We adapted the Pointer Network to fill the variables of the predicted SQL template as these variables are chosen from the tokens in the input sentence."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-78",
"text": "**APPROACH**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-79",
"text": "Our approach works in two phases."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-80",
"text": "We first classify an SQL template for a given natural language question and then, we fill the variable slots of the predicted template."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-81",
"text": "This architecture is based on an idea similar to the template-based model of Finegan-Dollak et al. (2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-82",
"text": "However, the previous model requires a number of examples for each template and needs retraining to support new templates of SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-83",
"text": "Conversely, we applied one-shot learning so that our model could learn a template with just a single example."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-84",
"text": "Moreover, our model does not require any additional training to support new SQL templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-86",
"text": "**SQL TEMPLATE CLASSIFICATION**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-87",
"text": "The SQL template classification model consists of two networks."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-88",
"text": "First, the Candidate Search Network chooses the top-n most relevant templates from a candidate set C to build a support set S. Then, the Matching Network predicts the SQL template based on the support set S. The overall architecture is depicted in Figure 1 ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-89",
"text": "Candidate Search Network We propose the Candidate Search Network (CSN) to apply the nway classification setting of the Matching Network to the general classification problem."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-90",
"text": "First, we build a candidate set C = {(x c i , y c i )} N i=1 which comprises sample pairs of natural language questions and their labels (SQL templates), by sampling one example pair from each of whole classes (N ) in the training dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-91",
"text": "For a given test samplex, the CSN chooses the top-n most relevant items withx from the candidate set C to build a support set S = {(x s i , y s i )} n i=1 ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-92",
"text": "Since the Matching Network assumes that the support set is given, the CSN plays a key role in finding a SQL template among all possible templates via the Matching Network."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-93",
"text": "To build the CSN, we first trained a convolutional neural network (CNN) text classification model (Kim, 2014) with the training dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-94",
"text": "From this network, we took features from the last layer before the final classification layer in order to get a feature vector g(x) and {g("
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-95",
"text": ". Then, we choose the top-n most similar items withx, using the cosine similarity of the feature vectors to build a support set S forx."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-96",
"text": "Matching Network A Matching Network consists of an encoder and an augmented memory."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-97",
"text": "Encoder f (\u00b7) embeds the natural language question as a fixed-size vector."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-98",
"text": "We used a CNN as our encoder."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-99",
"text": "It consists of different window sizes of convolutional layers and a max-pooling operation is applied over each feature map."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-100",
"text": "The output of the encoder is the concatenated vector of each pooled feature."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-101",
"text": "The augmented memory stores a support set S that is generated by the CSN."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-102",
"text": "For a given test examplex, our classifier predicts a label\u0177 based on the support set S = {(x s i , y s i )} n i=1 as follows:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-103",
"text": "where a(\u00b7, \u00b7) is an attention function defined as follows:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-104",
"text": "where c(\u00b7, \u00b7) denotes cosine similarity."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-105",
"text": "For training, we followed the n-way 1-shot training strategy."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-106",
"text": "We first sampled label sets L size of n from all the possible labels N ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-107",
"text": "Then we sampled one example for each label in L to build the support set S. Finally, we sampled a number of examples for each label in L to build a training batch T to train the model."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-108",
"text": "The training objective is to maximize the log-likelihood of the predicted labels in batch T , based on the support set S as follows:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-110",
"text": "**SLOT-FILLING**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-111",
"text": "We applied the Pointer Network (Vinyals et al., 2015) to fill the variable slots of the predicted SQL template, as described in Figure 2 ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-112",
"text": "We used a bi-directional LSTM as an input encoder and a uni-directional LSTM as an output decoder."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-113",
"text": "Let (x 1 , ..., x n ) denote the tokens in a natural language question and (v 1 , ..., v m ) denote the variables of the SQL template."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-114",
"text": "Then the encoder hidden states are (e 1 , ..., e n ), while the decoder hidden states are (d 1 , ..., d m )."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-115",
"text": "At each time step t in the decoding phase (for each variable v t ), we computed the attention vector as follows:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-116",
"text": "(4) where W 1 and W 2 are trainable parameters."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-117",
"text": "Then we applied softmax to obtain the likelihood over the tokens in the input sentence as follows:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-146",
"text": "We used a SQL version of the dataset processed by Finegan-Dollak et al. (2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-118",
"text": "where y = (y 1 , ..., y m ) is a sequence of indices, each between 1 and n. The training objective is to maximize the loglikelihood of the predicted tokens for the given natural language input and list of variables in the SQL template as follows: For the parameter sets \u03c6 of the Pointer Network,"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-119",
"text": "where D denotes training dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-121",
"text": "**# QUESTIONS # VOCABULARIES # SQL TEMPLATES AVG # OF VARIABLES**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-123",
"text": "**ADAPTATION**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-124",
"text": "Our model can be adapted to the new template of SQL with a single example, without additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-125",
"text": "Assume that there is a natural language to SQL template pair (x , y ) and that y is the unseen template during the training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-126",
"text": "We merely need to add one example pair (x , y ) to the candidate set C to make our model applicable to the new template y ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-128",
"text": "**TRAINING AND INFERENCE**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-129",
"text": "Required:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-130",
"text": "where N is number of SQL templates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-131",
"text": "Required: n : hyperparameter for support set size Input: Natural language questionx Output: Algorithm 1: Inference steps."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-132",
"text": "Our approach has three module parameter sets, \u03c0 (CSN), \u03b8 (Matching Network), and \u03c6 (Pointer Network)."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-133",
"text": "We train these three modules independently."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-134",
"text": "As our Matching Network uses the same CNN architecture as CSN, we set the initial weight of the Matching Network using trained parameters from CSN for the efficient training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-135",
"text": "At the inference stage, we apply the three modules consecutively as described in Algorithm 1."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-137",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-139",
"text": "**DATASET**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-140",
"text": "We used four different text-to-SQL datasets for experiments."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-141",
"text": "Advising (Finegan-Dollak et al., 2018) Collection of questions on a course information database at a university."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-142",
"text": "Questions with corresponding SQL were collected from a web page and by students, and augmented by paraphrasing with manual inspection."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-143",
"text": "Atis (Price, 1990; Dahl et al., 1994) Collection of questions on a flight booking system."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-144",
"text": "We used a SQL version of the dataset processed by FineganDollak et al. (2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-145",
"text": "GeoQuery (Zelle and Mooney, 1996) Collection of questions on a US geography database."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-147",
"text": "Scholar (Iyer et al., 2017) Collection of questions on an academic publication database."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-148",
"text": "Questions were collected by the crowd, and initial corresponding SQL were automatically generated by the system and augmented with manual inspection."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-149",
"text": "The number of natural language questions, vocabularies, SQL templates, and the average number of variables per SQL template for each dataset is described in Table 1 ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-150",
"text": "We used a template and variables for each SQL from the preprocessed versions provided by Finegan-Dollak et al. (2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-151",
"text": "For the question-based split, we used a 2:1:1 ratio for the train:dev:test split and ensured that every SQL template in the test set appeared at least once in the training set."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-152",
"text": "For the query-based split, we used the same split as in Finegan-Dollak et al. (2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-154",
"text": "**MODEL CONFIGURATION**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-155",
"text": "We used the same hyperparameters for every dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-156",
"text": "For the word embedding, we"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-157",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-158",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-159",
"text": "We evaluated the query generation accuracy for both the question-based split and query-based split (Finegan-Dollak et al., 2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-160",
"text": "In the question-based split, SQL queries of the same template appear in both train and test sets."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-161",
"text": "Through the question-based split, we tested how well the model could generate SQL of trained templates from natural language questions."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-162",
"text": "On the contrary, the query-based split ensures that SQL queries of the same template only appear in either train or test set."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-163",
"text": "Through the query-based split, we evaluated how well the model can generalize unseen templates of queries."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-164",
"text": "For the query-based split, we studied our model in two scenarios: zero-shot and one-shot."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-165",
"text": "In the zero-shot scenario, we trained the model with a training dataset and evaluated it with a test dataset, which is the same setting under previous approaches."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-166",
"text": "In the one-shot scenario, we first trained the model with the training dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-167",
"text": "Then, we sampled a single example from each SQL templates in the test dataset for adaptation."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-168",
"text": "Finally, we evaluated our adapted model with the remaining test dataset."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-169",
"text": "Through the one-shot scenario, we examined how well our model adapts to the unseen templates of SQL from a single example."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-171",
"text": "**BASELINES**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-172",
"text": "We compare our results with three different previous approaches: a sequence-to-sequence model from Iyer et al. (2017) , template-based model, and another sequence-to-sequence model from Finegan-Dollak et al. (2018) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-173",
"text": "Iyer et al. (2017) proposed an encoder-decoder model with global attention (Luong et al., 2015) to directly generate a sequence of SQL tokens from a natural language question."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-174",
"text": "Finegan-Dollak et al. (2018) proposed a template based model using a bi-directional LSTM."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-175",
"text": "The LSTM output for each word was used to predict whether the word is one of the variables or not, and the last hidden state of the LSTM was used to predict the template of SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-176",
"text": "They also proposed a sequence-to-sequence model with attention (Bahdanau et al., 2014) and copying mechanism (Gu et al., 2016) to copy variables in the natural language tokens to the SQL output."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-177",
"text": "To test the one-shot setting scenario, because previous approaches cannot perform adaptation without retraining, we added one-shot examples to the training dataset and retrained each of previous model."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-178",
"text": "On the contrary, we did not retrain our model but just added one-shot examples to the candidate set as mentioned in Section 3.3."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-179",
"text": "Table 2 shows the results of the query generation accuracy for the question-based split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-180",
"text": "Table 3 shows the query generation accuracy for the query-based split both in a zero-shot setting and a one-shot setting."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-182",
"text": "**RESULTS AND ANALYSIS**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-184",
"text": "**COMPARISON TO PREVIOUS APPROACHES**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-185",
"text": "Question-based Split For the question-based split, our model outperformed the state-of-theart approaches in every benchmark."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-186",
"text": "Our model shows 3-27% query generation accuracy gain, compared to the sequence-to-sequence model, 5-9% gain, compared to template-based model (Finegan-Dollak et al., 2018) , and 15-56% gain, compared to Iyer et al. (2017) ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-187",
"text": "The result demonstrates that our model is more efficient in generating SQL of the trained templates than the previous approaches."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-188",
"text": "Advising ATIS GeoQuery Scholar Model \"0\" \"1\" \"0\" \"1\" \"0\" \"1\" \"0\" \"1\" Table 3 : SQL generation accuracy for the query-based split in a zero-shot setting (\"0\" column) and a one-shot setting (\"1\" column)"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-189",
"text": "Query-based Split Although our approach cannot generate a SQL of unseen templates, we observed that it could adapt well to new templates of SQL given just a single example without additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-190",
"text": "Sequence-to-sequence models (Iyer et al., 2017; Finegan-Dollak et al., 2018) , as shown in the Table 3 , showed poor performance for the query-based split in the zero-shot setting."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-191",
"text": "The model from Finegan-Dollak et al. (2018) showed accuracies of 0%, 32%, 20%, and 5% for each benchmark and accuracies of Iyer et al. (2017) showed 1%, 17%, 40%, and 3%, meaning that they also lack the capability to generate unseen templates of SQL."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-192",
"text": "In a one-shot setting, where an example is added for each new template, our approach outperformed previous ones against every benchmark."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-193",
"text": "Our model outperforms the sequence-to-sequence model (Finegan-Dollak et al., 2018) by 1-60%, the template-based model (Finegan-Dollak et al., 2018) by 17-52%, Iyer et al. (2017) by 14-62%."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-194",
"text": "It should be noted that previous models were retrained with one-shot examples as they cannot adapt to unseen templates of SQL without additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-195",
"text": "In contrast, we did not retrain our model but merely added one-shot examples to the candidate set in memory."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-222",
"text": "**CONCLUSION**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-196",
"text": "This result demonstrates that our model is considerably more efficient in adapting SQL of new templates than the previous approaches even in the absence of any additional training."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-197",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-198",
"text": "**ABLATION ANALYSIS**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-199",
"text": "To examine how effectively the Candidate Search Network (CSN) and the Matching Network perform together, we conducted an ablation analysis for the query-based split in the one-shot setting."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-200",
"text": "The result is shown in the Table 4 ."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-201",
"text": "We reported the classification accuracy for each, using:"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-202",
"text": "1)The Matching Network with Candidate Search Network (CSN) as described in the paper 2) Only the Matching Network that uses a full candidate set as a support set, instead of the n-way support set 3) Only the CSN that determines the top-1 most relevant template as a predicted template By deploying an approach that combines the Matching Network with the CSN, we achieved 19.0%-42.0% performance gain, compared to when only the Matching Network was used."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-203",
"text": "This result demonstrates that our proposed CSN plays a key role in enabling the Matching Network to be utilized for classifying templates from a large number of possibilities."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-204",
"text": "Compared to when only the CSN was used, we achieved 9.9-15.3% performance gain."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-205",
"text": "Table 5 : Breakdown of results for our approach for both question-based split('?' column) and query-based split('Q' column)."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-206",
"text": "For the query-based split, we show the result from the one-shot setting."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-207",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-208",
"text": "**BREAKDOWN ANALYSIS**"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-209",
"text": "We performed a breakdown analysis for both the question-based split and the query-based split with a one-shot setting."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-210",
"text": "Table 5 shows the accuracy of each modules of our model."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-211",
"text": "Our approach consists of two parts: SQL template classification and slot filling."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-212",
"text": "In addition, our template classification model consists of a Candidate Search Network (CSN) and a Matching Network."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-213",
"text": "For the CSN, accuracy is determined based on the inclusion of the actual label among the n candidates."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-214",
"text": "For the Matching Network, we only report classification accuracy when CSN chooses the n candidates correctly."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-215",
"text": "Regarding the slot filling model, we only count it as correct when all the variables in the template are chosen correctly."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-216",
"text": "In every benchmark, template classification was a more difficult part than slot filling in both the question-based split and the query-based split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-217",
"text": "The performance of the slot filling model did not degrade significantly from the question-based to the query-based split (2-12%)."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-218",
"text": "By contrast, the template classification performance dropped by 9-35%."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-219",
"text": "CSN was able to find the top-15 most relevant templates almost perfectly (94-100%) in the question-based split but the accuracy dropped by 6-19% in the query-based split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-220",
"text": "Finally, the Matching Network showed 71-91% accuracy in the question-based split and 48-76% accuracy in the query-based split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-223",
"text": "In this paper, we proposed a one-shot learning model for the text-to-SQL generation that enables the model to adapt to the new template of SQL based on a single example."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-224",
"text": "Our approach works in two phases: 1) SQL template classification and 2) slot-filling."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-225",
"text": "For template classification, we proposed a novel Candidate Search Network that chooses the top-n most relevant SQL templates from the entire templates to build a support set."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-226",
"text": "Subsequently, we applied a Matching Network to classify the template based on the support set."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-227",
"text": "For the slot-filling, we applied a Pointer Network to fill the variable slots of the predicted template."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-228",
"text": "We evaluated our model in two aspects."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-229",
"text": "We tested the SQL generation accuracy for the trained templates with question-based split and the adaptability to the SQL of new templates with querybased split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-230",
"text": "Experimental results showed that our approach outperforms state-of-the-art models for the question-based split."
},
{
"sent_id": "c4cc8d4013b0259eb626d06750e4ab-C001-231",
"text": "In addition, we demonstrated that our model could efficiently generate SQL of untrained templates from a single example, without any additional training."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"c4cc8d4013b0259eb626d06750e4ab-C001-4"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-20"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-52"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-62"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-65"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-190",
"c4cc8d4013b0259eb626d06750e4ab-C001-191"
]
],
"cite_sentences": [
"c4cc8d4013b0259eb626d06750e4ab-C001-4",
"c4cc8d4013b0259eb626d06750e4ab-C001-20",
"c4cc8d4013b0259eb626d06750e4ab-C001-52",
"c4cc8d4013b0259eb626d06750e4ab-C001-62",
"c4cc8d4013b0259eb626d06750e4ab-C001-65",
"c4cc8d4013b0259eb626d06750e4ab-C001-190",
"c4cc8d4013b0259eb626d06750e4ab-C001-191"
]
},
"@MOT@": {
"gold_contexts": [
[
"c4cc8d4013b0259eb626d06750e4ab-C001-3",
"c4cc8d4013b0259eb626d06750e4ab-C001-4",
"c4cc8d4013b0259eb626d06750e4ab-C001-5"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-65",
"c4cc8d4013b0259eb626d06750e4ab-C001-66",
"c4cc8d4013b0259eb626d06750e4ab-C001-67",
"c4cc8d4013b0259eb626d06750e4ab-C001-68",
"c4cc8d4013b0259eb626d06750e4ab-C001-69"
]
],
"cite_sentences": [
"c4cc8d4013b0259eb626d06750e4ab-C001-4",
"c4cc8d4013b0259eb626d06750e4ab-C001-65",
"c4cc8d4013b0259eb626d06750e4ab-C001-66"
]
},
"@EXT@": {
"gold_contexts": [
[
"c4cc8d4013b0259eb626d06750e4ab-C001-20",
"c4cc8d4013b0259eb626d06750e4ab-C001-21"
]
],
"cite_sentences": [
"c4cc8d4013b0259eb626d06750e4ab-C001-20"
]
},
"@DIF@": {
"gold_contexts": [
[
"c4cc8d4013b0259eb626d06750e4ab-C001-29",
"c4cc8d4013b0259eb626d06750e4ab-C001-31"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-65",
"c4cc8d4013b0259eb626d06750e4ab-C001-66",
"c4cc8d4013b0259eb626d06750e4ab-C001-67",
"c4cc8d4013b0259eb626d06750e4ab-C001-68",
"c4cc8d4013b0259eb626d06750e4ab-C001-69"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-81",
"c4cc8d4013b0259eb626d06750e4ab-C001-82",
"c4cc8d4013b0259eb626d06750e4ab-C001-83"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-186"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-190",
"c4cc8d4013b0259eb626d06750e4ab-C001-191",
"c4cc8d4013b0259eb626d06750e4ab-C001-192",
"c4cc8d4013b0259eb626d06750e4ab-C001-193"
]
],
"cite_sentences": [
"c4cc8d4013b0259eb626d06750e4ab-C001-31",
"c4cc8d4013b0259eb626d06750e4ab-C001-65",
"c4cc8d4013b0259eb626d06750e4ab-C001-66",
"c4cc8d4013b0259eb626d06750e4ab-C001-81",
"c4cc8d4013b0259eb626d06750e4ab-C001-186",
"c4cc8d4013b0259eb626d06750e4ab-C001-190",
"c4cc8d4013b0259eb626d06750e4ab-C001-191",
"c4cc8d4013b0259eb626d06750e4ab-C001-193"
]
},
"@USE@": {
"gold_contexts": [
[
"c4cc8d4013b0259eb626d06750e4ab-C001-38"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-146"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-150"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-152"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-159"
],
[
"c4cc8d4013b0259eb626d06750e4ab-C001-172"
]
],
"cite_sentences": [
"c4cc8d4013b0259eb626d06750e4ab-C001-38",
"c4cc8d4013b0259eb626d06750e4ab-C001-146",
"c4cc8d4013b0259eb626d06750e4ab-C001-150",
"c4cc8d4013b0259eb626d06750e4ab-C001-152",
"c4cc8d4013b0259eb626d06750e4ab-C001-159",
"c4cc8d4013b0259eb626d06750e4ab-C001-172"
]
},
"@SIM@": {
"gold_contexts": [
[
"c4cc8d4013b0259eb626d06750e4ab-C001-81"
]
],
"cite_sentences": [
"c4cc8d4013b0259eb626d06750e4ab-C001-81"
]
}
}
},
"ABC_7700b6c3c096d5cd7999c34e7614f7_2": {
"x": [
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-2",
"text": "We study the problem of automatically building hypernym taxonomies from textual and visual data."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-46",
"text": "In Kiela et al. (2015) , both textual and visual evidences are exploited to detect pairwise lexical entailments."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-122",
"text": "Moreover, we leverage the distributed representations of images and words to construct compact and effective features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-221",
"text": "Second, their method is designed to induce only pairwise relations."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-3",
"text": "Previous works in taxonomy induction generally ignore the increasingly prominent visual data, which encode important perceptual semantics."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-4",
"text": "Instead, we propose a probabilistic model for taxonomy induction by jointly leveraging text and images."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-5",
"text": "To avoid hand-crafted feature engineering, we design end-to-end features based on distributed representations of images and words."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-6",
"text": "The model is discriminatively trained given a small set of existing ontologies and is capable of building full taxonomies from scratch for a collection of unseen conceptual label items with associated images."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-7",
"text": "We evaluate our model and features on the WordNet hierarchies, where our system outperforms previous approaches by a large gap."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-10",
"text": "Human knowledge is naturally organized as semantic hierarchies."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-11",
"text": "For example, in WordNet (Miller, 1995) , specific concepts are categorized and assigned to more general ones, leading to a semantic hierarchical structure (a.k.a taxonomy)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-12",
"text": "A variety of NLP tasks, such as question answering (Harabagiu et al., 2003) , document clustering (Hotho et al., 2002) and text generation (Biran and McKeown, 2013) can benefit from the conceptual relationship present in these hierarchies."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-13",
"text": "Traditional methods of manually constructing taxonomies by experts (e.g. WordNet) and interest communities (e.g. Wikipedia) are either knowledge or time intensive, and the results have limited coverage."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-14",
"text": "Therefore, automatic induction of taxonomies is drawing increasing attention in both NLP and computer vision."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-15",
"text": "On one hand, a number of methods have been developed to build hierarchies based on lexical patterns in text (Yang and Callan, 2009; Snow et al., 2006; Kozareva and Hovy, 2010; Navigli et al., 2011; Fu et al., 2014; Bansal et al., 2014; Tuan et al., 2015) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-16",
"text": "These works generally ignore the rich visual data which encode important perceptual semantics (Bruni et al., 2014) and have proven to be complementary to linguistic information and helpful for many tasks (Silberer and Lapata, 2014; Kiela and Bottou, 2014; ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-17",
"text": "On the other hand, researchers have built visual hierarchies by utilizing only visual features (Griffin and Perona, 2008; Yan et al., 2015; Sivic et al., 2008) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-18",
"text": "The resulting hierarchies are limited in interpretability and usability for knowledge transfer."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-19",
"text": "Hence, we propose to combine both visual and textual knowledge to automatically build taxonomies."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-20",
"text": "We induce is-a taxonomies by supervised learning from existing entity ontologies where each concept category (entity) is associated with images, either from existing dataset (e.g. ImageNet (Deng et al., 2009)) or retrieved from the web using search engines, as illustrated in Fig 1."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-21",
"text": "Such a scenario is realistic and can be extended to a variety of tasks; for example, in knowledge base construction ), text and image collections are readily available but label relations among categories are to be uncovered."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-22",
"text": "In largescale object recognition, automatically learning relations between labels can be quite useful (Deng et al., 2014; Zhao et al., 2011) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-23",
"text": "Both textual and visual information provide important cues for taxonomy induction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-24",
"text": "Fig 1 il lustrates this via an example."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-47",
"text": "Our work is significantly different as our model is optimized over the whole taxonomy space rather than considering only word pairs separately."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-25",
"text": "The parent category seafish and its two child categories shark and ray are closely related as: (1) there is a hypernym-hyponym (is-a) relation between the words \"seafish\" and \"shark\"/\"ray\" through text descriptions like \"...seafish, such as shark and ray...\", \"...shark and ray are a group of seafish...\"; (2) images of the close neighbors, e.g., shark and ray are usually visually similar and images of the child, e.g. shark/ray are similar to a subset of images of seafish."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-26",
"text": "To effectively capture these patterns, in contrast to previous works that rely on various hand-crafted features Bansal et al., 2014) , we extract features by leveraging the distributed representations that embed images (Simonyan and Zisserman, 2014) and words as compact vectors, based on which the semantic closeness is directly measured in vector space."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-27",
"text": "Further, we develop a probabilistic framework that integrates the rich multi-modal features to induce \"is-a\" relations between categories, encouraging local semantic consistency that each category should be visually and textually close to its parent and siblings."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-28",
"text": "In summary, this paper has the following contributions: (1) We propose a novel probabilistic Bayesian model (Section 3) for taxonomy induction by jointly leveraging textual and visual data."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-29",
"text": "The model is discriminatively trained and can be directly applied to build a taxonomy from scratch for a collection of semantic labels."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-30",
"text": "(2) We design novel features (Section 4) based on generalpurpose distributed representations of text and images to capture both textual and visual relations between labels."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-31",
"text": "(3) We evaluate our model and features on the ImageNet hierarchies with two different taxonomy induction tasks (Section 5)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-32",
"text": "We achieve superior performance on both tasks and improve the F 1 score by 2x in the taxonomy construction task, compared to previous approaches."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-33",
"text": "Extensive comparisons demonstrate the effectiveness of integrating visual features with language features for taxonomy induction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-34",
"text": "We also provide qualitative analysis on our features, the learned model, and the taxonomies induced to provide further insights (Section 5.3)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-36",
"text": "**RELATED WORK**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-37",
"text": "Many approaches have been recently developed that build hierarchies purely by identifying either lexical patterns or statistical features in text corpora (Yang and Callan, 2009; Snow et al., 2006; Kozareva and Hovy, 2010; Navigli et al., 2011; Zhu et al., 2013; Fu et al., 2014; Bansal et al., 2014; Tuan et al., 2014; Tuan et al., 2015; Kiela et al., 2015) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-38",
"text": "The approaches in Yang and Callan (2009) and Snow et al. (2006) assume a starting incomplete hierarchy and try to extend it by inserting new terms."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-39",
"text": "Kozareva and Hovy (2010) and Navigli et al. (2011) first find leaf nodes and then use lexical patterns to find intermediate terms and all the attested hypernymy links between them."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-40",
"text": "In (Tuan et al., 2014) , syntactic contextual similarity is exploited to construct the taxonomy, while Tuan et al. (2015) go one step further to consider trustiness and collective synonym/contrastive evidence."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-41",
"text": "Different from them, our model is discriminatively trained with multi-modal data."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-42",
"text": "The works of Fu et al. (2014) and Bansal et al. (2014) use similar language-based features as ours."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-43",
"text": "Specifically, in (Fu et al., 2014) , linguistic regularities between pretrained word vectors are modeled as projection mappings."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-44",
"text": "The trained projection matrix is then used to induce pairwise hypernym-hyponym relations between words."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-45",
"text": "Our features are partially motivated by Fu et al. (2014) , but we jointly leverage both textual and visual information."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-48",
"text": "In (Bansal et al., 2014) , a structural learning model is developed to induce a globally optimal hierarchy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-49",
"text": "Compared with this work, we exploit much richer features from both text and images, and leverage distributed representations instead of hand-crafted features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-50",
"text": "Several approaches (Griffin and Perona, 2008; Bart et al., 2008; Marsza\u0142ek and Schmid, 2008) have also been proposed to construct visual hierarchies from image collections."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-51",
"text": "In (Bart et al., 2008) , a nonparametric Bayesian model is developed to group images based on low-level features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-52",
"text": "In (Griffin and Perona, 2008) and (Marsza\u0142ek and Schmid, 2008) , a visual taxonomy is built to accelerate image categorization."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-53",
"text": "In , only binary object-object relations are extracted using co-detection matrices."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-54",
"text": "Our work differs from all of these as we integrate textual with visual information to construct taxonomies."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-55",
"text": "Also of note are several works that integrate text and images as evidence for knowledge base autocompletion (Bordes et al., 2011) and zeroshot recognition (Gan et al., 2015; Socher et al., 2013) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-56",
"text": "Our work is different because our task is to accurately construct multilevel hyponym-hypernym hierarchies from a set of (seen or unseen) categories."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-58",
"text": "**TAXONOMY INDUCTION MODEL**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-59",
"text": "Our model is motivated by the key observation that in a semantically meaningful taxonomy, a category tends to be closely related to its children as well as its siblings."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-60",
"text": "For instance, there exists a hypernym-hyponym relation between the name of category shark and that of its parent seafish."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-61",
"text": "Besides, images of shark tend to be visually similar to those of ray, both of which are seafishes."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-62",
"text": "Our model is thus designed to encourage such local semantic consistency; and by jointly considering all categories in the inference, a globally optimal structure is achieved."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-63",
"text": "A key advantage of the model is that we incorporate both visual and textual features induced from distributed representations of images and text (Section 4)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-64",
"text": "These features capture the rich underlying semantics and facilitate taxonomy induction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-65",
"text": "We further distinguish the relative importance of visual and textual features that could vary in different layers of a taxonomy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-66",
"text": "Intuitively, visual features would be increasingly indicative in the deeper layers, as sub-categories under the same category of specific objects tend to be visually similar."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-67",
"text": "In contrast, textual features would be more important when inducing hierarchical relations between the categories of general concepts (i.e. in the near-root layers) where visual characteristics are not necessarily similar."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-69",
"text": "**THE PROBLEM**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-70",
"text": "Assume a set of N categories x = {x 1 , x 2 , . . . , x N }, where each category x n consists of a text term t n as its name, as well as a set of images i n = {i 1 , i 2 , . . . }."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-71",
"text": "Our goal is to construct a taxonomy tree T over these categories 1 , such that categories of specific object types (e.g. shark) are grouped and assigned to those of general concepts (e.g. seafish)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-72",
"text": "As the categories in x may be from multiple disjoint taxonomy trees, we add a pseudo category x 0 as the hyper-root so that the optimal taxonomy is ensured to be a single tree."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-73",
"text": "Let z n \u2208 {1, . . . , N } be the index of the parent of category x n , i.e. x zn is the hypernymic category of x n ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-74",
"text": "Thus the problem of inducing a taxonomy structure is equivalent to inferring the conditional distribution p(z|x) over the set of (latent) indices z = {z 1 , . . . , z n }, based on the images and text."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-76",
"text": "**MODEL**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-77",
"text": "We formulate the distribution p(z|x) through a model which leverages rich multi-modal features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-78",
"text": "Specifically, let c n be the set of child nodes of category x n in a taxonomy encoded by z. Our model is defined as"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-79",
"text": "where g w (x n , x n , c n \\x n ), defined as"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-80",
"text": "measures the semantic consistency between category x n , its parent x n as well as its siblings indexed by c n \\x n ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-81",
"text": "The function g w (\u00b7) is loglinear with respect to f n,n ,cn\\x n , which is the feature vector defined over the set of relevant categories (x n , x n , c n \\x n ), with c n \\x n being the set of child categories excluding x n (Section 4)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-82",
"text": "The simple exponential formulation can effectively encourage close relations among nearby categories in the induced taxonomy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-83",
"text": "The function has combination weights w = {w 1 , . . . , w L }, where L is the maximum depth of the taxonomy, to capture the importance of different features, and the function d(x n ) to return the depth of x n in the current taxonomy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-84",
"text": "Each layer l (1 \u2264 l \u2264 L) of the taxonomy has a specific w l thereby allowing varying weights of the same features in different layers."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-85",
"text": "The parameters are learned in a supervised manner."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-86",
"text": "In eq 1, we also introduce a weight \u03c0 n for each node x n , in order to capture the varying popularity of different categories (in terms of being a parent category)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-87",
"text": "For example, some categories like plant can have a large number of sub-categories, while others such as stone have less."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-88",
"text": "We model \u03c0 as a multinomial distribution with Dirichlet prior \u03b1 = (\u03b1 1 , . . . , \u03b1 N ) to encode any prior knowledge of the category popularity 2 ; and the conjugacy allows us to marginalize out \u03c0 analytically to get"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-89",
"text": "where q n is the number of children of category x n ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-90",
"text": "Next, we describe our approach to infer the expectation for each z n , and based on that select a particular taxonomy structure for the category nodes x. As z is constrained to be a tree (i.e. cycle without loops), we include with eq 2, an indicator factor 1(z) that takes 1 if z corresponds a tree and 0 otherwise."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-91",
"text": "We modify the inference algorithm appropriately to incorporate this constraint."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-92",
"text": "Inference."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-93",
"text": "Exact inference is computationally intractable due to the normalization constant of eq 2."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-94",
"text": "We therefore use Gibbs Sampling, a procedure for approximate inference."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-95",
"text": "Here we present the sampling formula for each z n directly, and defer the details to the supplementary material."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-96",
"text": "The sampling procedure is highly efficient because the normalization term and the factors that are irrelevant to z n are cancelled out."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-97",
"text": "The formula is"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-98",
"text": "where q m is the number of children of category m; the superscript \u2212n denotes the number excluding x n ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-99",
"text": "Examining the validity of the taxonomy structure (i.e. the tree indicator) in each sampling step can be computationally prohibitive."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-100",
"text": "To handle this, we restrict the candidate value of z n in eq 3, ensuring that the new z n is always a tree."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-101",
"text": "Specifically, given a tree T , we define a structure operation as the procedure of detaching one node x n in T from its parent and appending it to another node x m which is not a descendant of x n ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-102",
"text": "Proposition 1."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-103",
"text": "(1) Applying a structure operation on a tree T will result in a structure that is still a tree."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-104",
"text": "(2) Any tree structure over the node set x that has the same root node with tree T can be achieved by applying structure operation on T a finite number of times."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-105",
"text": "2 \u03b1 could be estimated using training data."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-106",
"text": "The proof is straightforward and we omit it due to space limitations."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-107",
"text": "We also add a pseudo node x 0 as the fixed root of the taxonomy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-108",
"text": "Hence by initializing a tree-structured state rooted at x 0 and restricting each updating step as a structure operation, our sampling procedure is able to explore the whole valid tree space."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-109",
"text": "Output taxonomy selection."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-110",
"text": "To apply the model to discover the underlying taxonomy from a given set of categories, we first obtain the marginals of z by averaging over the samples generated through eq 3, then output the optimal taxonomy z * by finding the maximum spanning tree (MST) using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965; Bansal et al., 2014) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-111",
"text": "Training."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-112",
"text": "We need to learn the model parameters w l of each layer l, which capture the relative importance of different features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-113",
"text": "The model is trained using the EM algorithm."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-114",
"text": "Let (x n ) be the depth (layer) of category x n ; andz (siblings c n ) denote the gold structure in training data."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-115",
"text": "Our training algorithm updates w through maximum likelihood estimation, wherein the gradient of w l is (see the supplementary materials for details):"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-116",
"text": "which is the net difference between gold feature vectors and expected feature vectors as per the model."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-117",
"text": "The expectation is approximated by collecting samples using the sampler described above and averaging them."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-119",
"text": "**FEATURES**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-120",
"text": "In this section, we describe the feature vector f used in our model, and defer more details in the supplementary material."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-121",
"text": "Compared to previous taxonomy induction works which rely purely on linguistic information, we exploit both perceptual and textual features to capture the rich spectrum of semantics encoded in images and text."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-123",
"text": "Specifically, each image i is represented as an embedding vector v i \u2208 R a extracted by deep convolutional neural networks."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-124",
"text": "Such image representation has been successfully applied in various vision tasks."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-125",
"text": "On the other hand, the category name t is represented by its word embedding v t \u2208 R b , a low-dimensional dense vector induced by the Skip-gram model which is widely used in diverse NLP applications too."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-126",
"text": "Then we design f (x n , x n , c n \\x n ) based on the above image and text representations."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-127",
"text": "The feature vector f is used to measure the local semantic consistency between category x n and its parent category x n as well as its siblings c n \\x n ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-129",
"text": "**IMAGE FEATURES**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-130",
"text": "Sibling similarity."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-131",
"text": "As mentioned above, close neighbors in a taxonomy tend to be visually similar, indicating that the embedding of images of sibling categories should be close to each other in the vector space R a ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-132",
"text": "For a category x n and its image set i n , we fit a Gaussian distribution N (v in , \u03a3 n ) to the image vectors, where v in \u2208 R a is the mean vector and \u03a3 n \u2208 R a\u00d7a is the covariance matrix."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-133",
"text": "For a sibling category x m of x n , we define the visual similarity between x n and x m as"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-134",
"text": "which is the average probability of the mean image vector of one category under the Gaussian distribution of the other."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-135",
"text": "This takes into account not only the distance between the mean images, but also the closeness of the images of each category."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-136",
"text": "Accordingly, we compute the visual similarity between x n and the set c n \\x n by averaging:"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-137",
"text": "We then bin the values of vissim(x n , c n \\x n ) and represent it as an one-hot vector, which constitutes f as a component named as siblings imageimage relation feature (denoted as S-V1 3 )."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-138",
"text": "Parent prediction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-139",
"text": "Similar to feature S-V1, we also create the similarity feature between the image vectors of the parent and child, to measure their visual similarity."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-140",
"text": "However, the parent node is usually a more general concept than the child, and it usually consists of images that are not necessarily similar to its child."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-141",
"text": "Intuitively, by narrowing the set of images to those that are most similar to its child improves the feature."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-142",
"text": "Therefore, different from S-V1, when estimating the Gaussian distribution of the parent node, we only use the top K images with highest probabilities under the Gaussian distribution of the child node."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-143",
"text": "We empirically show in section 5.3 that choosing an appropriate K consistently boosts the performance."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-144",
"text": "We name this feature as parent-child image-image relation feature (denoted as PC-V1)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-145",
"text": "Further, inspired by the linguistic regularities of word embedding, i.e. the hypernym-hyponym relationship between words can be approximated by a linear projection operator between word vectors Fu et al., 2014) , we design a similar strategy to (Fu et al., 2014) between images and words so that the parent can be \"predicted\" given the image embedding of its child category and the projection matrix."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-146",
"text": "Specifically, let (x n , x n ) be a parent-child pair in the training data, we learn a projection matrix \u03a6 which minimizes the distance between \u03a6v i n (i.e. the projected mean image vector v i n of the child) and v tn (i.e. the word embedding of the parent):"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-147",
"text": "where N is the number of parent-child pairs in the training data."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-148",
"text": "Once the projection matrix has been learned, the similarity between a child node x n and its parent x n is computed as \u03a6v i n \u2212 v tn , and we also create an one-hot vector by binning the feature value."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-149",
"text": "We call this feature as parentchild image-word relation feature (PC-V2)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-151",
"text": "**WORD FEATURES**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-152",
"text": "We briefly introduce the text features employed."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-153",
"text": "More details about the text feature extraction could be found in the supplementary material."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-154",
"text": "Word embedding features.d PC-V1, We induce features using word vectors to measure both sibling-sibling and parent-child closeness in text domain (Fu et al., 2014) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-155",
"text": "One exception is that, as each category has only one word, the sibling similarity is computed as the cosine distance between two word vectors (instead of mean vectors)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-156",
"text": "This will produce another two parts of features, parentchild word-word relation feature (PC-T1) and siblings word-word relation feature (S-T1)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-157",
"text": "Word surface features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-158",
"text": "In addition to the embedding-based features, we further leverage lexical features based on the surface forms of child/parent category names."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-159",
"text": "Specifically, we employ the Capitalization, Ends with, Contains, Suffix match, LCS and Length different features, which are commonly used in previous works in taxonomy induction (Yang and Callan, 2009; Bansal et al., 2014) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-161",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-162",
"text": "We first disclose our implementation details in section 5.1 and the supplementary material for better reproducibility."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-163",
"text": "We then compare our model with previous state-of-the-art methods (Fu et al., 2014; Bansal et al., 2014) with two taxonomy induction tasks."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-164",
"text": "Finally, we provide analysis on the weights and taxonomies induced."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-166",
"text": "**IMPLEMENTATION DETAILS**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-167",
"text": "Dataset."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-168",
"text": "We conduct our experiments on the ImageNet2011 dataset (Deng et al., 2009 ), which provides a large collection of category items (synsets), with associated images and a label hierarchy (sampled from WordNet) over them."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-169",
"text": "The original ImageNet taxonomy is preprocessed, resulting in a tree structure with 28231 nodes."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-170",
"text": "Word embedding training."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-171",
"text": "We train word embedding for synsets by replacing each word/phrase in a synset with a unique token and then using Google's word2vec tool ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-172",
"text": "We combine three public available corpora together, including the latest Wikipedia dump (Wikipedia, 2014) , the One Billion Word Language Modeling Benchmark (Chelba et al., 2013) and the UMBC webbase corpus (Han et al., 2013) , resulting in a corpus with total 6 billion tokens."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-173",
"text": "The dimension of the embedding is set to 200."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-174",
"text": "Image processing."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-175",
"text": "we employ the ILSVRC12 pre-trained convolutional neural networks (Simonyan and Zisserman, 2014) to embed each image into the vector space."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-176",
"text": "Then, for each category x n with images, we estimate a multivariate Gaussian parameterized by N xn = (\u00b5 xn , \u03a3 xn ), and constrain \u03a3 xn to be diagonal to prevent overfitting."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-177",
"text": "For categories with very few images, we only estimate a mean vector \u00b5 xn ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-178",
"text": "For nodes that do not have images, we ignore the visual feature."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-179",
"text": "Training configuration."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-180",
"text": "The feature vector is a concatenation of 6 parts, as detailed in section 4."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-181",
"text": "All pairwise distances are precomputed and stored in memory to accelerate Gibbs sampling."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-182",
"text": "The initial learning rate for gradient descent in the M step is set to 0.1, and is decreased by a fraction of 10 every 100 EM iterations."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-184",
"text": "**EVALUATION**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-185",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-186",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-187",
"text": "We evaluate our model on three subtrees sampled from the ImageNet taxonomy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-188",
"text": "To collect the subtrees, we start from a given root (e.g. consumer goods) and traverse the full taxonomy using BFS, and collect all descendant nodes within a depth h (number of nodes in the longest path)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-189",
"text": "We vary h Table 1 : Statistics of our evaluation set."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-190",
"text": "The bottom 4 rows give the number of nodes within each height h \u2208 {4, 5, 6, 7}. The scale of the threes range from small to large, and there is no overlapping among them."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-191",
"text": "to get a series of subtrees with increasing heights h \u2208 {4, 5, 6, 7} and various scales (maximally 1326 nodes) in different domains."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-192",
"text": "The statistics of the evaluation sets are provided in Table 1 ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-193",
"text": "To avoid ambiguity, all nodes used in ILSVRC 2012 are removed as the CNN feature extractor is trained on them."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-194",
"text": "We design two different tasks to evaluate our model."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-195",
"text": "(1) In the hierarchy completion task, we randomly remove some nodes from a tree and use the remaining hierarchy for training."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-196",
"text": "In the test phase, we infer the parent of each removed node and compare it with groundtruth."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-197",
"text": "This task is designed to figure out whether our model can successfully induce hierarchical relations after learning from within-domain parent-child pairs."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-220",
"text": "First, we take into account not only parent-child relations but also siblings."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-198",
"text": "(2) Different from the previous one, the hierarchy construction task is designed to test the generalization ability of our model, i.e. whether our model can learn statistical patterns from one hierarchy and transfer the knowledge to build a taxonomy for another collection of out-of-domain labels."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-199",
"text": "Specifically, we select two trees as the training set to learn w. In the test phase, the model is required to build the full taxonomy from scratch for the third tree."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-200",
"text": "We use Ancestor F 1 as our evaluation metric (Kozareva and Hovy, 2010; Navigli et al., 2011; Bansal et al., 2014) ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-201",
"text": "Specifically, we measure F 1 = 2P R/(P + R) values of predicted \"is-a\" relations where the precision (P) and recall (R) are:"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-202",
"text": "We compare our method to two previously state-of-the-art models by Fu et al. (2014) and Bansal et al. (2014) , which are closest to ours."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-203",
"text": "Table 2 : Comparisons among different variants of our model, Fu et al. (2014) and Bansal et al. (2014) on two tasks."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-204",
"text": "The ancestor-F 1 scores are reported."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-206",
"text": "**RESULTS**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-207",
"text": "Hierarchy completion."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-208",
"text": "In the hierarchy completion task, we split each tree into 70% nodes for training and 30% for test, and experiment with different h. We compare the following three systems: (1) Fu2014 4 (Fu et al., 2014) ; (2) Ours (L):"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-209",
"text": "Our model with only language features enabled (i.e. surface features, parent-child word-word relation feature and siblings word-word relation feature); (3) Ours (LV): Our model with both language features and visual features 5 ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-210",
"text": "The average performance on three trees are reported at Table 2."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-211",
"text": "We observe that the performance gradually drops when h increases, as more nodes are inserted when the tree grows higher, leading to a more complex and difficult taxonomy to be accurately constructed."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-212",
"text": "Overall, our model outperforms Fu2014 in terms of the F 1 score, even without visual features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-213",
"text": "In the most difficult case with h = 7, our model still holds an F 1 score of 0.42 (2\u00d7 of Fu2014), demonstrating the superiority of our model."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-214",
"text": "Hierarchy construction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-215",
"text": "The hierarchy construction task is much more difficult than hierarchy completion task because we need to build a taxonomy from scratch given only a hyper-root."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-216",
"text": "For this task, we use a leave-one-out strategy, i.e. we train our model on every two trees and test on the third, and report the average performance in Table 2 ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-217",
"text": "We compare the following methods: (1) Fu2014, (2) Ours (L), and (3) Ours (LV), as described above; (4) Bansal2014: The model by Bansal et al. (2014) retrained using our dataset; (5) Ours (LB): By excluding visual features, but including other language features from Bansal et al. (2014) ; (6) Ours (LVB): Our full model further enhanced with all semantic features from Bansal et al. (2014) ; (7) Ours (LVB -E): By excluding word embeddingbased language features from Ours (LVB)."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-218",
"text": "As shown, on the hierarchy construction task, our model with only language features still outperforms Fu2014 with a large gap (0.30 compared to 0.18 when h = 7), which uses similar embeddingbased features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-219",
"text": "The potential reasons are two-fold."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-222",
"text": "To build the full taxonomy, they first identify all possible pairwise relations using a simple thresholding strategy and then eliminate conflicted relations to obtain a legitimate tree hierarchy."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-223",
"text": "In contrast, our model is optimized over the full space of all legitimate taxonomies by taking the structure operation in account during Gibbs sampling."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-224",
"text": "When comparing to Bansal2014, our model with only word embedding-based features underperforms theirs."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-225",
"text": "However, when introducing visual features, our performance is comparable (pvalue = 0.058).Furthermore, if we discard visual features but add semantic features from Bansal et al. (2014) , we achieve a slight improvement of 0.02 over Bansal2014 (p-value = 0.016), which is largely attributed to the incorporation of word embedding-based features that encode high-level linguistic regularity."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-226",
"text": "Finally, if we enhance our full model with all semantic features from Bansal et al. (2014) , our model outperforms theirs by a gap of 0.04 (p-value < 0.01), which justifies our intuition that perceptual semantics underneath visual contents are quite helpful."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-227",
"text": "----------------------------------"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-228",
"text": "**QUALITATIVE ANALYSIS**"
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-229",
"text": "In this section, we conduct qualitative studies to investigate how and when the visual information helps the taxonomy induction task."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-230",
"text": "Contributions of visual features."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-231",
"text": "To evaluate the contribution of each part of the visual features to the final performance, we train our model jointly with textual features and different combinations of visual features, and report the ancestor-F 1 scores."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-232",
"text": "As shown in Table 3 ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-233",
"text": "When incorporating the feature S-V1, the performance is substantially boosted by a large gap at all heights, show-S-V1 PC-V1 PC-V2 h = 4 h = 5 h = 6 h = 7 0. ing that visual similarity between sibling nodes is a strong evidence for taxonomy induction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-234",
"text": "It is intuitively plausible, as it is highly likely that two specific categories share a common (and more general) parent category if similar visual contents are observed between them."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-235",
"text": "Further, adding the PC-V1 feature gains us a better improvement than adding PC-V2, but both minor than S-V1."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-236",
"text": "Compared to that of siblings, the visual similarity between parents and children does not strongly holds all the time."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-237",
"text": "For example, images of Terrestrial animal are only partially similar to those of Feline, because the former one contains the later one as a subset."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-238",
"text": "Our feature captures this type of \"contain\" relation between parents and children by considering only the top-K images from the parent category that have highest probabilities under the Gaussian distribution of the child category."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-239",
"text": "To see this, we vary K while keep all other settings, and plot the F 1 scores in Fig 2."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-240",
"text": "We observe a trend that when we gradually increase K, the performance goes up until reaching some maximal; It then slightly drops (or oscillates) even when more images are available, which confirms with our feature design that only top images should be considered in parent-child visual similarity."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-241",
"text": "Overall, the three visual features complement each other, and achieve the highest performance when combined."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-242",
"text": "Visual representations."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-243",
"text": "To investigate how the image representations affect the final performance, we compare the ancestor-F1 score when different pre-trained CNNs are used for visual feature extraction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-244",
"text": "Specifically, we employ both the CNN-128 model (128 dimensional feature with 15.6% top-5 error on ILSVRC12) and the VGG-16 model (4096 dimensional feature with 7.5% top-5 error) by Simonyan and Zisserman (2014) , but only observe a slight improvement of 0.01 on the ancestor-F1 score for the later one."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-245",
"text": "Relevance of textual and visual features v.s. depth of tree."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-246",
"text": "Compared to Bansal et al. (2014) , a major difference of our model is that different layers of the taxonomy correspond to different weights w l , while in (Bansal et al., 2014) all layers share the same weights."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-247",
"text": "Intuitively, introducing layer-wise w not only extends the model capacity, but also differentiates the importance of each feature at different layers."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-248",
"text": "For example, the images of two specific categories, such as shark and ray, are very likely to be visually similar."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-249",
"text": "However, when the taxonomy goes from bottom to up (specific to general), the visual similarity is gradually undermined -images of fish and terrestrial animal are not necessarily similar any more."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-250",
"text": "Hence, it is necessary to privatize the weights w for different layers to capture such variations, i.e. the visual features become more and more evident from shallow to deep layers, while the textual counterparts, which capture more abstract concepts, relatively grow more indicative oppositely from specific to general."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-251",
"text": "To visualize the variations across layers, for each feature component, we fetch its correspond- ing block in w as V ."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-252",
"text": "Then, we average |V | and observe how its values change with the layer depth h. For example, for the parent-child word-word relation feature, we first fetch its corresponding weights V from w as a 20 \u00d7 6 matrix, where 20 is the feature dimension and 6 is the number of layers."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-253",
"text": "We then average its absolute values 6 in column and get a vector v with length 6."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-254",
"text": "After 2 normalization, the magnitude of each entry in v directly reflects the relative importance of the feature as an evidence for taxonomy induction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-255",
"text": "Fig 3(b) plots how their magnitudes change with h for every feature component averaged on three train/test splits."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-256",
"text": "It is noticeable that for both wordword relations (S-T1, PC-T1), their corresponding weights slightly decrease as h increases."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-257",
"text": "On the contrary, the image-image relation features (S-V1, PC-V1) grows relatively more prominent."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-258",
"text": "The results verify our conjecture that when the category hierarchy goes deeper into more specific classes, the visual similarity becomes relatively more indicative as an evidence for taxonomy induction."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-259",
"text": "Visualizing results."
},
{
"sent_id": "7700b6c3c096d5cd7999c34e7614f7-C001-260",
"text": "Finally, we visualize some excerpts of our predicted taxonomies, as compared to the groundtruth in Fig 4."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7700b6c3c096d5cd7999c34e7614f7-C001-15"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-37"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-48"
]
],
"cite_sentences": [
"7700b6c3c096d5cd7999c34e7614f7-C001-15",
"7700b6c3c096d5cd7999c34e7614f7-C001-37",
"7700b6c3c096d5cd7999c34e7614f7-C001-48"
]
},
"@MOT@": {
"gold_contexts": [
[
"7700b6c3c096d5cd7999c34e7614f7-C001-15",
"7700b6c3c096d5cd7999c34e7614f7-C001-16",
"7700b6c3c096d5cd7999c34e7614f7-C001-17",
"7700b6c3c096d5cd7999c34e7614f7-C001-18",
"7700b6c3c096d5cd7999c34e7614f7-C001-19"
]
],
"cite_sentences": [
"7700b6c3c096d5cd7999c34e7614f7-C001-15"
]
},
"@EXT@": {
"gold_contexts": [
[
"7700b6c3c096d5cd7999c34e7614f7-C001-15",
"7700b6c3c096d5cd7999c34e7614f7-C001-16",
"7700b6c3c096d5cd7999c34e7614f7-C001-17",
"7700b6c3c096d5cd7999c34e7614f7-C001-18",
"7700b6c3c096d5cd7999c34e7614f7-C001-19"
]
],
"cite_sentences": [
"7700b6c3c096d5cd7999c34e7614f7-C001-15"
]
},
"@DIF@": {
"gold_contexts": [
[
"7700b6c3c096d5cd7999c34e7614f7-C001-26"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-48",
"7700b6c3c096d5cd7999c34e7614f7-C001-49"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-217"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-225",
"7700b6c3c096d5cd7999c34e7614f7-C001-226"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-246"
]
],
"cite_sentences": [
"7700b6c3c096d5cd7999c34e7614f7-C001-26",
"7700b6c3c096d5cd7999c34e7614f7-C001-48",
"7700b6c3c096d5cd7999c34e7614f7-C001-217",
"7700b6c3c096d5cd7999c34e7614f7-C001-225",
"7700b6c3c096d5cd7999c34e7614f7-C001-226",
"7700b6c3c096d5cd7999c34e7614f7-C001-246"
]
},
"@SIM@": {
"gold_contexts": [
[
"7700b6c3c096d5cd7999c34e7614f7-C001-42"
]
],
"cite_sentences": [
"7700b6c3c096d5cd7999c34e7614f7-C001-42"
]
},
"@USE@": {
"gold_contexts": [
[
"7700b6c3c096d5cd7999c34e7614f7-C001-110"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-159"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-163"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-200"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-202"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-203"
],
[
"7700b6c3c096d5cd7999c34e7614f7-C001-217"
]
],
"cite_sentences": [
"7700b6c3c096d5cd7999c34e7614f7-C001-110",
"7700b6c3c096d5cd7999c34e7614f7-C001-159",
"7700b6c3c096d5cd7999c34e7614f7-C001-163",
"7700b6c3c096d5cd7999c34e7614f7-C001-200",
"7700b6c3c096d5cd7999c34e7614f7-C001-202",
"7700b6c3c096d5cd7999c34e7614f7-C001-203",
"7700b6c3c096d5cd7999c34e7614f7-C001-217"
]
}
}
},
"ABC_a8ba807b94f6f7ff4f7e77a9fcde35_2": {
"x": [
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-97",
"text": "The deletion model (cf."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-120",
"text": "**ESTIMATING THE PARAMETERS**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-190",
"text": "to fluency (Is the simplified output fluent and grammatical?)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-85",
"text": "In 1964 Given the input DRS shown in Figure 1 , simplification proceeds as follows."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-86",
"text": "Splitting."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-87",
"text": "The splitting candidates of a DRS are event pairs contained in that DRS."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-121",
"text": "We use the EM algorithm (Dempster et al., 1977) to estimate our split and deletion model parameters."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-2",
"text": "We present a hybrid approach to sentence simplification which combines deep semantics and monolingual machine translation to derive simple sentences from complex ones."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-3",
"text": "The approach differs from previous work in two main ways."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-4",
"text": "First, it is semantic based in that it takes as input a deep semantic representation rather than e.g., a sentence or a parse tree."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-5",
"text": "Second, it combines a simplification model for splitting and deletion with a monolingual translation model for phrase substitution and reordering."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-6",
"text": "When compared against current state of the art methods, our model yields significantly simpler output that is both grammatical and meaning preserving."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-9",
"text": "Sentence simplification maps a sentence to a simpler, more readable one approximating its content."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-10",
"text": "Typically, a simplified sentence differs from a complex one in that it involves simpler, more usual and often shorter, words (e.g., use instead of exploit); simpler syntactic constructions (e.g., no relative clauses or apposition); and fewer modifiers (e.g., He slept vs. He also slept)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-11",
"text": "In practice, simplification is thus often modeled using four main operations: splitting a complex sentence into several simpler sentences; dropping and reordering phrases or constituents; substituting words/phrases with simpler ones."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-12",
"text": "As has been argued in previous work, sentence simplification has many potential applications."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-13",
"text": "It is useful as a preprocessing step for a variety of NLP systems such as parsers and machine translation systems (Chandrasekar et al., 1996) , summarisation (Knight and Marcu, 2000) , sentence fusion (Filippova and Strube, 2008 ) and semantic role labelling (Vickrey and Koller, 2008) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-14",
"text": "It also has wide ranging potential societal application as a reading aid for people with aphasis (Carroll et al., 1999) , for low literacy readers (Watanabe et al., 2009 ) and for non native speakers (Siddharthan, 2002) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-15",
"text": "There has been much work recently on developing computational frameworks for sentence simplification."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-16",
"text": "Synchronous grammars have been used in combination with linear integer programming to generate and rank all possible rewrites of an input sentence (Dras, 1999; Woodsend and Lapata, 2011) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-17",
"text": "Machine Translation systems have been adapted to translate complex sentences into simple ones (Zhu et al., 2010; Wubben et al., 2012; Coster and Kauchak, 2011) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-18",
"text": "And handcrafted rules have been proposed to model the syntactic transformations involved in simplifications (Siddharthan et al., 2004; Siddharthan, 2011; Chandrasekar et al., 1996) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-19",
"text": "In this paper, we present a hybrid approach to sentence simplification which departs from this previous work in two main ways."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-20",
"text": "First, it combines a model encoding probabilities for splitting and deletion with a monolingual machine translation module which handles reordering and substitution."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-88",
"text": "More precisely, the splitting candidates are pairs 4 of event variables associated with at least one of the core thematic roles (e.g., agent and patient)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-89",
"text": "The features conditioning a split are the set of thematic roles associated with each event variable."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-90",
"text": "The DRS shown in Figure 1 contains three such event variables X 3 , X 11 and X 10 with associated thematic role sets {agent, in, in, patient}, {agent, patient} and {agent, for, patient} respectively."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-91",
"text": "Hence, there are 3 splitting candidates (X 3 -X 11 , X 3 -X 10 and X 10 -X 11 ) and 4 split options: no split or split at one of the splitting candidates."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-92",
"text": "Here the split with highest probability (cf."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-93",
"text": "Table 1 ) is chosen and the DRS is split into two sub-DRS, one containing X 3 , and the other containing X 10 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-94",
"text": "After splitting, dangling subgraphs are attached to the root of the new subgraph maximizing either proximity or position overlap."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-95",
"text": "Here the graph rooted in X 11 is attached to the root dominating X 3 and the orphan word O 1 to the root dominating X 10 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-96",
"text": "Deletion."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-21",
"text": "In this way, we exploit the ability of statistical machine translation (SMT) systems to capture phrasal/lexical substitution and reordering while relying on a dedicated probabilistic module to capture the splitting and deletion operations which are less well (deletion) or not at all (splitting) captured by SMT approaches."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-22",
"text": "Second, our approach is semantic based."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-23",
"text": "While previous simplification approaches starts from either the input sentence or its parse tree, our model takes as input a deep semantic representation namely, the Discourse Representation Structure (DRS, (Kamp, 1981) ) assigned by Boxer (Curran et al., 2007) to the input complex sentence."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-24",
"text": "As we shall see in Section 4, this permits a linguistically principled account of the splitting operation in that semantically shared elements are taken to be the basis for splitting a complex sentence into several simpler ones; this facilitates completion (the re-creation of the shared element in the split sentences); and this provide a natural means to avoid deleting obligatory arguments."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-25",
"text": "When compared against current state of the art methods (Zhu et al., 2010; Woodsend and Lapata, 2011; Wubben et al., 2012) , our model yields significantly simpler output that is both grammatical and meaning preserving."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-27",
"text": "**RELATED WORK**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-28",
"text": "Earlier work on sentence simplification relied on handcrafted rules to capture syntactic simplification e.g., to split coordinated and subordinated sentences into several, simpler clauses or to model active/passive transformations (Siddharthan, 2002; Chandrasekar and Srinivas, 1997; Bott et al., 2012; Canning, 2002; Siddharthan, 2011; Siddharthan, 2010) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-29",
"text": "While these handcrafted approaches can encode precise and linguistically well-informed syntactic transformation (using e.g., detailed morphological and syntactic information), they are limited in scope to purely syntactic rules and do not account for lexical simplifications and their interaction with the sentential context."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-30",
"text": "Using the parallel dataset formed by Simple English Wikipedia (SWKP) 1 and traditional English Wikipedia (EWKP) 2 , more recent work has focused on developing machine learning approaches to sentence simplification."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-31",
"text": "Zhu et al. (2010) constructed a parallel corpus (PWKP) of 108,016/114,924 complex/simple sentences by aligning sentences from EWKP and SWKP and used the resulting bitext to train a simplification model inspired by syntax-based machine translation (Yamada and Knight, 2001 )."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-32",
"text": "Their simplification model encodes the probabilities for four rewriting operations on the parse tree of an input sentences namely, substitution, reordering, splitting and deletion."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-33",
"text": "It is combined with a language model to improve grammaticality and the decoder translates sentences into sim-pler ones by greedily selecting the output sentence with highest probability."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-34",
"text": "Using both the PWKP corpus developed by Zhu et al. (2010) and the edit history of Simple Wikipedia, Woodsend and Lapata (2011) learn a quasi synchronous grammar (Smith and Eisner, 2006) describing a loose alignment between parse trees of complex and of simple sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-35",
"text": "Following Dras (1999) , they then generate all possible rewrites for a source tree and use integer linear programming to select the most appropriate simplification."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-36",
"text": "They evaluate their model on the same dataset used by Zhu et al. (2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-37",
"text": "They also conducted a human evaluation on 64 of the 100 test sentences and showed again a better performance in terms of simplicity, grammaticality and meaning preservation."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-38",
"text": "In (Wubben et al., 2012; Coster and Kauchak, 2011) , simplification is viewed as a monolingual translation task where the complex sentence is the source and the simpler one is the target."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-39",
"text": "To account for deletions, reordering and substitution, Coster and Kauchak (2011) trained a phrase based machine translation system on the PWKP corpus while modifying the word alignment output by GIZA++ in Moses to allow for null phrasal alignments."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-40",
"text": "In this way, they allow for phrases to be deleted during translation."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-41",
"text": "No human evaluation is provided but the approach is shown to result in statistically significant improvements over a traditional phrase based approach."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-42",
"text": "Similarly, Wubben et al. (2012) use Moses and the PWKP data to train a phrase based machine translation system augmented with a post-hoc reranking procedure designed to rank the output based on their dissimilarity from the source."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-43",
"text": "A human evaluation on 20 sentences randomly selected from the test data indicates that, in terms of fluency and adequacy, their system is judged to outperform both Zhu et al. (2010) and Woodsend and Lapata (2011) systems."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-45",
"text": "**SIMPLIFICATION FRAMEWORK**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-46",
"text": "We start by motivating our approach and explaining how it relates to previous proposals w.r.t., the four main operations involved in simplification namely, splitting, deletion, substitution and reordering."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-47",
"text": "We then introduce our framework."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-48",
"text": "Sentence Splitting."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-49",
"text": "Sentence splitting is arguably semantic based in that in many cases, splitting occurs when the same semantic entity participates in two distinct eventualities."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-50",
"text": "For instance, in example (1) below, the split is on the noun bricks which is involved in two eventualities namely, \"being resistant to cold\" and \"enabling the construction of permanent buildings\"."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-51",
"text": "(1) C. Being more resistant to cold, bricks enabled the construction of permanent buildings."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-52",
"text": "S. Bricks were more resistant to cold."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-53",
"text": "Bricks enabled the construction of permanent buildings."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-54",
"text": "While splitting opportunities have a clear counterpart in syntax (i.e., splitting often occurs whenever a relative, a subordinate or an appositive clause occurs in the complex sentence), completion i.e., the reconstruction of the shared element in the second simpler clause, is arguably semantically governed in that the reconstructed element corefers with its matching phrase in the first simpler clause."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-55",
"text": "While our semantic based approach naturally accounts for this by copying the phrase corresponding to the shared entity in both phrases, syntax based approach such as Zhu et al. (2010) and Woodsend and Lapata (2011) will often fail to appropriately reconstruct the shared phrase and introduce agreement mismatches because the alignment or rules they learn are based on syntax alone."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-56",
"text": "For instance, in example (2), Zhu et al. (2010) fails to copy the shared argument \"The judge\" to the second clause whereas Woodsend and Lapata (2011) learns a synchronous rule matching (VP and VP) to (VP."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-57",
"text": "NP(It) VP) thereby failing to produce the correct subject pronoun (\"he\" or \"she\") for the antecedent \"The judge\"."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-58",
"text": "(2) C. The judge ordered that Chapman should receive psychiatric treatment in prison and sentenced him to twenty years to life."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-59",
"text": "S1."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-60",
"text": "The judge ordered that Chapman should get psychiatric treatment."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-61",
"text": "In prison and sentenced him to twenty years to life."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-62",
"text": "(Zhu et al., 2010) S2."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-63",
"text": "The judge ordered that Chapman should receive psychiatric treatment in prison."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-64",
"text": "It sentenced him to twenty years to life."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-65",
"text": "(Woodsend and Lapata, 2011) Deletion."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-66",
"text": "By handling deletion using a probabilistic model trained on semantic representations, we can avoid deleting obligatory arguments."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-67",
"text": "Thus in our approach, semantic subformulae which are related to a predicate by a core thematic roles (e.g., agent and patient) are never considered for deletion."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-68",
"text": "By contrast, syntax based approaches (Zhu et al., 2010; Woodsend and Lapata, 2011) do not distinguish between optional and obligatory arguments."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-69",
"text": "For instance Zhu et al. (2010) simplifies (3C) to (3S) thereby incorrectly deleting the obligatory theme (gifts) of the complex sentence and modifying its meaning to giving knights and warriors (instead of giving gifts to knights and warriors)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-70",
"text": "(3) C. Women would also often give knights and warriors gifts that included thyme leaves as it was believed to bring courage to the bearer."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-71",
"text": "S. Women also often give knights and warriors."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-72",
"text": "Gifts included thyme leaves as it was thought to bring courage to the saint."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-73",
"text": "(Zhu et al., 2010) We also depart from Coster and Kauchak (2011) who rely on null phrasal alignments for deletion during phrase based machine translation."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-74",
"text": "In their approach, deletion is constrained by the training data and the possible alignments, independent of any linguistic knowledge."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-75",
"text": "Substitution and Reordering SMT based approaches to paraphrasing (Barzilay and Elhadad, 2003; Bannard and Callison-Burch, 2005) and to sentence simplification (Wubben et al., 2012) have shown that by utilising knowledge about alignment and translation probabilities, SMT systems can account for the substitutions and the reorderings occurring in sentence simplification."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-76",
"text": "Following on these approaches, we therefore rely on phrase based SMT to learn substitutions and reordering."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-77",
"text": "In addition, the language model we integrate in the SMT module helps ensuring better fluency and grammaticality."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-79",
"text": "**AN EXAMPLE**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-80",
"text": "Figure 1 shows how our approach simplifies (4C) into (4S)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-81",
"text": "The DRS for (4C) produced using Boxer (Curran et al., 2007 ) is shown at the top of the Figure and a graph representation 3 of the dependencies between its variables is shown immediately below."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-82",
"text": "Each DRS variable labels a node in the graph and each edge is labelled with the relation holding between the variables labelling its end vertices."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-83",
"text": "The"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-84",
"text": "new (X9) massive (X9) spin-zero (X9) boson (X9) predict (X10) event (X10) describe (X11) event (X11) first (X12) time (X12) agent(X10, X8)"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-98",
"text": "Table 2 ) regulates the deletion of relations and their associated subgraph; of adjectives and adverbs; and of orphan words."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-99",
"text": "Here, the relations in between X 3 and X 4 and for between X 10 and X 12 are deleted resulting in the deletion of the phrases \"in Physical Review Letters\" and \"for the first time\" as well as the ad- 4 The splitting candidates could be sets of event variables depending on the number of splits required."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-100",
"text": "Here, we consider pairs for 2 splits."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-101",
"text": "jectives second, massive, spin-zero and the orphan word which."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-102",
"text": "Substitution and Reordering."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-103",
"text": "Finally the translation and language model ensures that published, describing and boson are simplified to wrote, explaining and elementary particle respectively; and that the phrase \"In 1964\" is moved from the beginning of the sentence to its end."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-105",
"text": "**THE SIMPLIFICATION MODEL**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-106",
"text": "Our simplification framework consists of a probabilistic model for splitting and dropping which we call DRS simplification model (DRS-SM); a phrase based translation model for substitution and reordering (PBMT); and a language model learned on Simple English Wikipedia (LM) for fluency and grammaticality."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-107",
"text": "Given a complex sentence c, we split the simplification process into two steps."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-108",
"text": "First, DRS-SM is applied to D c (the DRS representation of the complex sentence c) to produce one or more (in case of splitting) intermediate simplified sentence(s) s \u2032 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-109",
"text": "Second, the simplified sentence(s) s \u2032 is further simplified to s using a phrase based machine translation system (PBMT+LM)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-110",
"text": "Hence, our model can be formally defined as:\u015d"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-111",
"text": "where the probabilities p(s \u2032 |D c ), p(s \u2032 |s) and p(s) are given by the DRS simplification model, the phrase based machine translation model and the language model respectively."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-112",
"text": "To get the DRS simplification model, we combine the probability of splitting with the probability of deletion:"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-113",
"text": "where \u03b8 is a sequence of simplification operations and str(\u03b8(D c )) is the sequence of words associated with a DRS resulting from simplifying D c using \u03b8."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-114",
"text": "The probability of a splitting operation for a given DRS D c is: That is, if the DRS is split on the splitting candidate sp cand , the probability of the split is then given by the SPLIT table (Table 1) for the isSplit value \"true\" and the split candidate sp cand ; else it is the product of the probability given by the SPLIT table for the isSplit value \"false\" for all split candidate considered for D c ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-115",
"text": "As mentioned above, the features used for determining the split operation are the role sets associated with pairs of event variables (cf. Table 1 )."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-116",
"text": "The deletion probability is given by three models: a model for relations determining the deletion of prepositional phrases; a model for modifiers (adjectives and adverbs) and a model for orphan words (Table 2 )."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-117",
"text": "All three deletion models use the associated word itself as a feature."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-118",
"text": "In addition, the model for relations uses the PP length-range as a feature while the model for orphan words relies on boundary information i.e., whether or not, the OW occurs at the associated sentence boundary."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-122",
"text": "For an efficient implementation of EM algorithm, we follow the work of Yamada and Knight (2001) and Zhu et al. (2010) ; and build training graphs (Figure 2 ) from the pair of complex and simple sentence pairs in the training data."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-123",
"text": "Each training graph represents a complexsimple sentence pair and consists of two types of nodes: major nodes (M-nodes) and operation nodes (O-nodes)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-124",
"text": "Each deletion candidate creates a deletion O-node marking successful or failed deletion of the candidate and a result M-node."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-125",
"text": "The deletion process continues on the result M-node until there is no deletion candidate left to process."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-126",
"text": "The governing criteria for the construction of the training graph is that, at each step, it tries to minimize the Levenshtein edit distance between the complex and the simple sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-127",
"text": "Moreover, for the splitting operation, we introduce a split only if the reference sentence consists of several sentences (i.e., there is a split in the training data); and only consider splits which maximises the overlap between split and simple reference sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-128",
"text": "We initialize our probability tables Table 1 and Table 2 with the uniform distribution, i.e., 0.5 because all our features are binary."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-129",
"text": "The EM algorithm iterates over training graphs counting model features from O-nodes and updating our probability tables."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-130",
"text": "Because of the space constraints, we do not describe our algorithm in details."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-131",
"text": "We refer the reader to (Yamada and Knight, 2001 ) for more details."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-132",
"text": "Our phrase based translation model is trained using the Moses toolkit 5 with its default command line options on the PWKP corpus (except the sentences from the test set) considering the complex sentence as the source and the simpler one as the target."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-133",
"text": "Our trigram language model is trained using the SRILM toolkit 6 on the SWKP corpus 7 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-134",
"text": "Decoding."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-135",
"text": "We explore the decoding graph similar to the training graph but in a greedy approach always picking the choice with maximal probability."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-136",
"text": "Given a complex input sentence c, a split Onode will be selected corresponding to the decision of whether to split and where to split."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-137",
"text": "Next, deletion O-nodes are selected indicating whether or not to drop each of the deletion candidate."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-138",
"text": "The DRS associated with the final M-node D f in is then mapped to a simplified sentence s \u2032 f in which is further simplified using the phrase-based machine translation system to produce the final simplified sentence s simple ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-140",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-141",
"text": "We trained our simplification and translation models on the PWKP corpus."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-142",
"text": "To evaluate performance, we compare our approach with three other state of the art systems using the test set provided by Zhu et al. (2010) and relying both on automatic metrics and on human judgments."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-144",
"text": "**TRAINING AND TEST DATA**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-145",
"text": "The DRS-Based simplification model is trained on PWKP, a bi-text of complex and simple sentences provided by Zhu et al. (2010) ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-146",
"text": "To construct this bi-text, Zhu et al. (2010) extracted complex and simple sentences from EWKP and SWKP respectively and automatically aligned them using TF*IDF as a similarity measure."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-147",
"text": "PWKP contains 108016/114924 complex/simple sentence pairs."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-148",
"text": "We tokenize PWKP using Stanford CoreNLP toolkit 8 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-149",
"text": "We then parse all complex sentences in PWKP using Boxer 9 to produce their DRSs."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-150",
"text": "Finally, our DRS-Based simplification model is trained on 97.75% of PWKP; we drop out 2.25% of the complex sentences in PWKP which are repeated in the test set or for which Boxer fails to produce DRSs."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-151",
"text": "We evaluate our model on the test set used by Zhu et al. (2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-152",
"text": "Boxer produces a DRS for 96 of the 100 input sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-153",
"text": "These input are simplified using our simplification system namely, the DRS-SM model and the phrase-based machine translation system (Section 3.2)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-154",
"text": "For the remaining four complex sentences, Boxer fails to produce DRSs."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-155",
"text": "These four sentences are directly sent to the phrase-based machine translation system to produce simplified sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-157",
"text": "**AUTOMATIC EVALUATION METRICS**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-158",
"text": "To assess and compare simplification systems, two main automatic metrics have been used in previous work namely, BLEU and the Flesch-Kincaid Grade Level Index (FKG)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-159",
"text": "The FKG index is a readability metric taking into account the average sentence length in words and the average word length in syllables."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-160",
"text": "In its original context (language learning), it was applied to well formed text and thus measured the simplicity of a well formed sentence."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-161",
"text": "In the context of the simplification task however, the automatically generated sentences are not necessarily well formed so that the FKG index reduces to a measure of the sentence length (in terms of words and syllables) approximating the simplicity level of an output sentence irrespective of the length of the corresponding input."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-162",
"text": "To assess simplification, we instead use metrics that are directly related to the simplification task namely, the number of splits in the overall (test and training) data and in average per sentences; the number of generated sentences with no edits i.e., which are identical to the original, complex one; and the average Levenshtein distance between the system's output and both the complex and the simple reference sentences."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-163",
"text": "BLEU gives a measure of how close a system's output is to the gold standard simple sentence."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-164",
"text": "Because there are many possible ways of simplifying a sentence, BLEU alone fails to correctly assess the appropriateness of a simplification."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-165",
"text": "Moreover BLEU does not capture the degree to which the system's output differs from the complex sentence input."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-189",
"text": "They were also asked to rate the second (simplified) sentence(s) of the pair w.r.t."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-166",
"text": "We therefore use BLEU as a means to evaluate how close the systems output are to the reference corpus but complement it with further manual metrics capturing other important factors when evaluating simplifications such as the fluency and the adequacy of the output sentences and the degree to which the output sentence simplifies the input."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-168",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-169",
"text": "Number of Splits Table 3 shows the proportion of input whose simplification involved a splitting operation."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-170",
"text": "While our system splits in proportion similar to that observed in the training data, the other systems either split very often (80% of the time for Zhu and 63% of the time for Woodsend) or not at all (Wubben)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-171",
"text": "In other words, when compared to the other systems, our system performs splits in proportion closest to the reference both in terms of total number of splits and of average number of splits per sentence."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-172",
"text": "Zhu et al. (2010) , Woodsend and Lapata (2011) and Wubben et al. (2012) respectively; Hybrid is our model."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-173",
"text": "Table 4 indicates the edit distance of the output sentences w.r.t."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-174",
"text": "both the complex and the simple reference sentences as well as the number of input for which no simplification occur."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-175",
"text": "The right part of the table shows that our system generate simplifications which are closest to the reference sentence (in terms of edits) compared to those output by the other systems."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-176",
"text": "It also produces the highest number of simplifications which are identical to the reference."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-177",
"text": "Conversely our system only ranks third in terms of dissimilarity with the input complex sentences (6.32 edits away from the input sentence) behind the Woodsend (8.63 edits) and the Zhu (7.87 edits) system."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-178",
"text": "This is in part due to the difference in splitting strategies noted above : the many splits applied by these latter two systems correlate with a high number of edits."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-179",
"text": "Table 4 show that our system produces simplifications that are closest to the reference."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-180",
"text": "In sum, the automatic metrics indicate that our system produces simplification that are consistently closest to the reference in terms of edit distance, number of splits and BLEU score."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-182",
"text": "**NUMBER OF EDITS**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-184",
"text": "**HUMAN EVALUATION**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-185",
"text": "The human evaluation was done online using the LG-Eval toolkit (Kow and Belz, 2012) 11 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-186",
"text": "The evaluators were allocated a trial set using a Latin Square Experimental Design (LSED) such that each evaluator sees the same number of output from each system and for each test set item."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-187",
"text": "During the experiment, the evaluators were presented with a pair of a complex and a simple sentence(s) and asked to rate this pair w.r.t."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-188",
"text": "to adequacy (Does the simplified sentence(s) preserve the meaning of the input?) and simplification (Does the generated sentence(s) simplify the complex input?)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-191",
"text": "Similar to the Wubben's human evaluation setup, we randomly selected 20 complex sentences from Zhu's test corpus and included in the evaluation corpus: the corresponding simple (Gold) sentence from Zhu's test corpus, the output of our system (Hybrid) and the output of the other three systems (Zhu, Woodsend and Wubben) which were provided to us by the system authors."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-192",
"text": "The evaluation data thus consisted of 100 complex/simple pairs."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-193",
"text": "We collected ratings from 27 participants."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-194",
"text": "Table 5 : Average Human Ratings for simplicity, fluency and adequacy Table 5 shows the average ratings of the human evaluation on a slider scale from 0 to 5."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-195",
"text": "Pairwise comparisons between all models and their statistical significance were carried out using a one-way ANOVA with post-hoc Tukey HSD tests and are shown in Table 6 ."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-196",
"text": "With regard to simplification, our system ranks first and is very close to the manually simplified input (the difference is not statistically significant)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-197",
"text": "The low rating for Woodsend reflects the high number of unsimplified sentences (24/100 in the test data used for the automatic evaluation and 6/20 in the evaluation data used for human judgments)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-198",
"text": "Our system data is not significantly different from the manually simplified data for simplicity whereas all other systems are."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-199",
"text": "For fluency, our system rates second behind Wubben and before Woodsend and Zhu."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-200",
"text": "The difference between our system and both Zhu and Woodsend system is significant."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-201",
"text": "In particular, Zhu's output is judged less fluent probably because of the many incorrect splits it licenses."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-202",
"text": "Manual examination of the data shows that Woodsend's system also produces incorrect splits."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-203",
"text": "For this system however, the high proportion of non simplified sentences probably counterbalances these incorrect splits, allowing for a good fluency score overall."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-204",
"text": "Regarding adequacy, our system is against closest to the reference (3.50 for our system vs. 3.66 for manual simplification)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-205",
"text": "Our system, the Wubben system and the manual simplifications are in the same group (the differences between these systems are not significant)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-206",
"text": "The Woodsend system comes second and the Zhu system third (the difference between the two is significant)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-207",
"text": "Wubben's high fluency, high adequacy but low simplicity could be explained with their minimal number of edit (3.33 edits) from the source sentence."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-208",
"text": "In sum, if we group together systems for which there is no significant difference, our system ranks first (together with GOLD) for simplicity; first for fluency (together with GOLD and Wubben); and first for adequacy (together with GOLD and Wubben)."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-210",
"text": "**CONCLUSION**"
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-211",
"text": "A key feature of our approach is that it is semantically based."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-212",
"text": "Typically, discourse level simplification operations such as sentence splitting, sentence reordering, cue word selection, referring expression generation and determiner choice are semantically constrained."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-213",
"text": "As argued by Siddharthan (2006) , correctly capturing the interactions between these phenomena is essential to ensuring text cohesion."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-214",
"text": "In the future, we would like to investigate how our framework deals with such discourse level simplifications i.e., simplifications which involves manipulation of the coreference and of the discourse structure."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-215",
"text": "In the PWKP data, the proportion of split sentences is rather low (6.1 %) and many of the split sentences are simple sentence coordination splits."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-216",
"text": "A more adequate but small corpus is that used in (Siddharthan, 2006) which consists of 95 cases of discourse simplification."
},
{
"sent_id": "a8ba807b94f6f7ff4f7e77a9fcde35-C001-217",
"text": "Using data from the language learning or the children reading community, it would be interesting to first construct a similar, larger scale corpus; and to then train and test our approach on more complex cases of sentence splitting."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-17"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-34"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-36"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-43"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-146"
]
],
"cite_sentences": [
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-17",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-34",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-36",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-43",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-146"
]
},
"@EXT@": {
"gold_contexts": [
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-17",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-18",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-19"
]
],
"cite_sentences": [
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-17"
]
},
"@DIF@": {
"gold_contexts": [
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-25"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-55",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-56"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-68"
]
],
"cite_sentences": [
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-25",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-55",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-56",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-68"
]
},
"@USE@": {
"gold_contexts": [
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-36"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-122"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-142"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-145"
],
[
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-151"
]
],
"cite_sentences": [
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-36",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-122",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-142",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-145",
"a8ba807b94f6f7ff4f7e77a9fcde35-C001-151"
]
}
}
},
"ABC_366231b855f226f63d637e6b2e1667_2": {
"x": [
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-120",
"text": "**DEPENDENCY EVALUATION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-2",
"text": "Recent parsing research has started addressing the questions a) how parsers trained on different syntactic resources differ in their performance and b) how to conduct a meaningful evaluation of the parsing results across such a range of syntactic representations."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-3",
"text": "Two German treebanks, Negra and T\u00fcBa-D/Z, constitute an interesting testing ground for such research given that the two treebanks make very different representational choices for this language, which also is of general interest given that German is situated between the extremes of fixed and free word order."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-4",
"text": "We show that previous work comparing PCFG parsing with these two treebanks employed PARSEVAL and grammatical function comparisons which were skewed by differences between the two corpus annotation schemes."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-5",
"text": "Focusing on the grammatical dependency triples as an essential dimension of comparison, we show that the two very distinct corpora result in comparable parsing performance."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-8",
"text": "Syntactically annotated corpora have been produced for a range of languages and they differ significantly regarding which language properties are encoded and how they are represented."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-9",
"text": "Between the two extremes of constituency treebanks for English and dependency treebanks for free word order languages such as Czech lie languages such as German, for which two different treebanks have explored different options for encoding topology and dependency, Negra (Brants et al., 1999) and T\u00fcBa-D/Z (Telljohann et al., 2005) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-10",
"text": "Recent research has started addressing the question of how parsers trained on these different syntactic resources differ in their performance."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-11",
"text": "Such work must also address the question of how to conduct a meaningful evaluation of the parsing results across such a range of syntactic representations."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-12",
"text": "In this paper, we show that previous work comparing PCFG parsing for the two German treebanks used representations which cannot adequately be compared using the given PARSEVAL measures and that a grammatical dependency evaluation is more meaningful than the grammatical function evaluation provided."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-13",
"text": "We present the first comparison of Negra and T\u00fcBa-D/Z using a labeled dependency evaluation based on the grammatical function labels provided in the corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-14",
"text": "We show that, in contrast to previous literature, a labeled dependency evaluation establishes that PCFG parsers trained on the two corpora give similar parsing performance."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-15",
"text": "The focus on labeled dependencies also provides a direct link to recent work on dependency-based evaluation (e.g., Clark and Curran, 2007) and dependency parsing (e.g., CoNLL shared tasks 2006, 2007)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-17",
"text": "**PREVIOUS WORK**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-18",
"text": "The question of how to evaluate parser output has naturally already arisen in earlier work on parsing English."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-19",
"text": "As discussed by Lin (1995) and others, the PARSEVAL evaluation typically used to analyze the performance of statistical parsing models has many drawbacks."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-20",
"text": "Bracketing evaluation may count a single error multiple times and does not differentiate between errors that significantly affect the interpretation of the sentence and those that are less crucial."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-21",
"text": "It also does not allow for evaluation of particular syntactic structures or provide meaningful information about where the parser is failing."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-22",
"text": "In addition, and most directly relevant for this paper, PARSE-VAL scores are difficult to compare across syntactic annotation schemes (Carroll et al., 2003) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-23",
"text": "At the same time, previous research on PCFG parsing using treebank training data present PAR-SEVAL measures in comparing the parsing performance for different languages and annotation schemes, reporting a number of striking differences."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-24",
"text": "For example, Levy and Manning (2003) , K\u00fcbler (2005) , and K\u00fcbler et al. (2006) highlight the significant effect of language properties and annotation schemes for German and Chinese treebanks."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-25",
"text": "In related work, parser enhancements that provide a significant performance boost for English, such as head lexicalization, are reported not to provide the same kind of improvement, if any, for German (Dubey and Keller, 2003; Dubey, 2004; K\u00fcbler et al., 2006) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-26",
"text": "Previous work has compared the similar Negra and Tiger corpora of German to the very different T\u00fcBa-D/Z corpus."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-27",
"text": "K\u00fcbler et al. (2006) compares the Negra and T\u00fcBa-D/Z corpora of German using a PARSEVAL evaluation and an evaluation on core grammatical function labels that is included to address concerns about the PARSEVAL measure."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-28",
"text": "1 Using the Stanford Parser (Klein and Manning, 2002) , which employs a factored PCFG and dependency model, they claim that the model trained on T\u00fcBa-D/Z consistently outperforms that trained on Negra in PARSEVAL and grammatical function evaluations."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-29",
"text": "Dubey (2004) also includes an evaluation on grammatical function for statistical models trained on Negra, but obtains very different results from K\u00fcbler et al. (2006) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-30",
"text": "2 In recent related work, Rehbein and van Genabith (2007a) demonstrate using the Tiger and T\u00fcBa-D/Z 1 The evaluation is based only on the grammatical function; it does not identify the dependency pair that it labels."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-31",
"text": "2 While the focus of K\u00fcbler et al. (2006) is on comparing parsing results across corpora, Dubey (2004) focuses on improving parsing for Negra, including corpus-specific enhancements leading to better results."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-32",
"text": "This difference in focus and additional differences in experimental setup mean that a finegrained comparison of the results is inappropriate -the relevant point here is that the gap between the results (23% for subjects, 35% for accusative objects) warrants further attention in the context of comparing parsing results across corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-33",
"text": "corpora of German that PARSEVAL is inappropriate for comparisons of the output of PCFG parsers trained on different treebank annotation schemes because PARSEVAL scores are affected by the ratio of terminal to non-terminal nodes."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-34",
"text": "A dependencybased evaluation on triples of the form word-POShead shows better results for the parser trained on Tiger even though the much lower PARSEVAL scores, if meaningful, would predict that the output for Tiger is of lower quality."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-35",
"text": "However, their dependency-based evaluation does not make use of the grammatical function labels, which are provided in the corpora and closely correspond to the representations used in recent work on formalismindependent evaluation of parsers (e.g., Clark and Curran, 2007) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-36",
"text": "3 Addressing these issues, we resolve the apparent discrepancy between K\u00fcbler et al. (2006) and Dubey (2004) and establish a firm grammatical function comparison of Negra and T\u00fcBa-D/Z. We also extend the evaluation to a labeled dependency evaluation based on grammatical relations for both corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-37",
"text": "Such an evaluation, which abstracts away from the specifics of the annotation schemes, shows that, in contrast to the claims made in K\u00fcbler et al. (2006) , the parsing results for PCFG parsers trained on these heterogeneous corpora are very similar."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-39",
"text": "**THE CORPORA USED**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-40",
"text": "As motivated in the introduction, the work discussed in this paper is based on two German corpora, Negra and T\u00fcBa-D/Z, which differ significantly in the syntactic representations used -thereby offering an interesting test bed for investigating the influence of an annotation scheme on the parsers trained."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-42",
"text": "**NEGRA**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-43",
"text": "The Negra corpus (Brants et al., 1999) consists of newspaper text from the Frankfurter Rundschau, a German newspaper."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-44",
"text": "Version 2 of the corpus contains 20,602 sentences."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-45",
"text": "It uses the STTS tag set (Schiller et al., 1995) for part-of-speech annotation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-46",
"text": "There are 25 non-terminal node labels and 46 edge labels."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-47",
"text": "The syntactic annotation of Negra combines features from phrase structure grammar and depen-dency grammar using a tree-like syntactic structure with grammatical functions labeled on the edges of the tree."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-48",
"text": "Flat sentence structures are used in many places to avoid attachment ambiguities and nonbranching phrases are not used."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-49",
"text": "The annotation scheme emphasizes the use of the tree structure to encode grammatical dependencies, representing a head and all its dependents within a local tree regardless of whether a dependent is realized near its head or not, e.g., because it has been extraposed or fronted."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-50",
"text": "Since traditional syntax trees do not permit the crossing branches needed to license discontinuous constituents, Negra uses a \"syntax graph\" data structure to represent the annotation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-51",
"text": "An example of a syntax graph with a discontinuous constituent (VP) due to a fronted dative object (NP) is shown in Figure 1 ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-52",
"text": "Negra uses flat NP and PP annotation with no marked heads."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-53",
"text": "For example, both Dieser and Meinung in Figure 1 have the grammatical function label \"NK\"."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-54",
"text": "Since unary branching is not used in Negra, a bare noun or pronoun argument is not dominated by an NP node, as shown by the pronoun ich above."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-55",
"text": "A verbal head in Negra is always marked with the edge label \"HD\" and its arguments are its sisters in the local tree."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-56",
"text": "The subject is always the sister of the finite verb, which is a daughter of S. If the finite verb is the main verb in the clause, the objects are also its sisters, i.e., the finite verb, subject and objects are all daughters of S. If the main verb is an auxiliary governing a non-finite main verb, the non-finite verb and its objects and modifiers form a VP where the objects are sisters of the non-finite verb as in Fig- ure 1."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-57",
"text": "The VP is then a sister of the finite verb."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-58",
"text": "The finite verb in a German declarative clause appears in the so-called verb-second position, immediately following the fronted constituent."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-59",
"text": "As a result, the VP in Negra is discontinuous whenever one of its children has been fronted, as in the common word orders exemplified in (1a) and (1b)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-60",
"text": "(1) The sentence we saw in Figure 1 contains a discontinuous VP with a fronted dative object (Dieser Meinung)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-61",
"text": "The dative object and a modifier (voll) form a VP with the non-finite verb (zustimmen)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-63",
"text": "**T\u00dcBA-D/Z**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-64",
"text": "The T\u00fcBa-D/Z corpus, version 2, (Telljohann et al., 2005) consists of 22,091 sentences of newspaper text from the German newspaper die tageszeitung."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-65",
"text": "Like Negra, it uses the STTS tag set (Schiller et al., 1995) for part-of-speech annotation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-66",
"text": "Syntactically it uses 27 non-terminal node labels and 47 edge labels."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-67",
"text": "The syntactic annotation incorporates a topological field analysis of the German clause (Reis, 1980; H\u00f6hle, 1986) , which segments a sentence into topological units depending on the position of the finite verb (verb-first, verb-second, verb-last) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-68",
"text": "In a verbfirst and verb-second sentence, the finite verb is the left bracket (LK), whereas in a verb-last subordinate clause, the subordinating conjunction occupies that field."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-69",
"text": "In all clauses, the non-finite verb cluster forms the right bracket (VC), and arguments and modifiers can appear in the middle field (MF) between the two brackets."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-70",
"text": "Extraposed material is found to the right of the right bracket, and in a verb-second sentence one constituent appears in the fronted field (VF) preceding the finite verb."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-71",
"text": "By specifying constraints on the elements that can occur in the different fields, the word order in any type of German clause can be concisely characterized."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-72",
"text": "Each clause in the T\u00fcBa-D/Z corpus is divided into topological fields at the top level, and each topological field contains phrase-level annotation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-73",
"text": "An example sentence from T\u00fcBa-D/Z is shown in Figure Edge labels are used to mark heads and grammatical functions, even though it can be nontrivial to figure out which grammatical function belongs to which head given that heads and their arguments often are in separate topological fields."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-74",
"text": "For example, in Figure 2 the subject noun chunk (NX) has the edge label ON (object -nominative) and the object noun chunk has the edge label OA (object -accusative); both are realized within the middle field (MF), while the finite verb (VXFIN) marked as HD (head) is in the left sentence bracket (LK)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-75",
"text": "This issue becomes relevant in section 3.4.2, discussing an evaluation based on labeled dependency triples."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-76",
"text": "Where Negra uses discontinuous constituents, T\u00fcBa-D/Z uses special edge labels to annotate grammatical relations which are not locally realized."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-77",
"text": "For example, the fronted prepositional phrase (PX) in Figure 2 has the edge label OA-MOD which needs to be matched with the noun phrase (NX) with label OA that is found in the MF field."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-79",
"text": "**COMPARING NEGRA AND T\u00dcBA-D/Z**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-80",
"text": "To give an impression of how the different annotation schemes affect the appearance of a typical tree in the two corpora, Table 1 While the sentences in Negra and T\u00fcBa-D/Z on average have the same number of words, the average T\u00fcBa-D/Z sentence has nearly three times as many non-terminal nodes as the average Negra sentence."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-81",
"text": "This difference is mainly due to the extra level of topological fields annotation and the use of more contoured structures in many places where Negra uses flatter structures."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-83",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-84",
"text": "The goal of the following experiments is a comparison of parsing performance across different types of evaluation metrics for parsers trained on Negra (Ver. 2) and T\u00fcBa-D/Z (Ver. 2)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-86",
"text": "**DATA PREPARATION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-87",
"text": "Following K\u00fcbler et al. (2006) , only sentences with fewer than 35 words were used, which results in 20,002 sentences for Negra and 21,365 sentences for T\u00fcBa-D/Z. Because punctuation is not attached within the sentence in the corpus annotation, punctuation was removed."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-88",
"text": "To be able to train PCFG parsing models, it is necessary to convert the syntax graphs encoding trees with discontinuities in Negra into traditional syntax trees."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-89",
"text": "Around 30% of sentences in Negra contain at least one discontinuity."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-90",
"text": "To remove discontinuities, we used the conversion program included with the Negra corpus annotation tools (Brants and Plaehn, 2000) , the same tool used in K\u00fcbler et al. (2006) , which raises non-head elements to a higher tree until there are no more discontinuities."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-91",
"text": "For example, for the discontinuous tree with a fronted object we saw in Figure 1 , the PP containing the fronted NP Dieser Meinung is raised to become a daughter of the top S node."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-92",
"text": "4 Additionally, the edge labels used in both corpora need to be folded into the node labels to become a part of context-free grammar rules used by a PCFG parser."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-93",
"text": "In the Penn Treebank-style versions of the corpora appropriate for training a PCFG parser, each edge label is joined with the phrase or POS label on the phrase or word immediately below it."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-94",
"text": "Both corpora include edge labels above all phrases and words."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-95",
"text": "However the flatter structures in Negra result in 39 different edge labels on words while T\u00fcBa-D/Z has only 5."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-96",
"text": "Unlike K\u00fcbler et al. (2006) , which ignored edge labels on words, we incorporate all edge labels present in both corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-97",
"text": "As a consequence of this, providing a parser with perfect lexical tags would also provide the edge label for that word."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-98",
"text": "T\u00fcBa-D/Z does not annotate grammatical functions other than HD on words, but Negra includes many grammatical functions on words."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-99",
"text": "Including edge labels in the perfect lexical tags would artificially boost the results of a grammatical function evaluation for Negra since it amounts to providing the correct grammatical function for the 38% of arguments in Negra that are single words."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-100",
"text": "To avoid this problem, we introduced nonbranching phrasal nodes into Negra to prevent the correct grammatical function label from being provided with the perfect lexical tag in the cases of single-word arguments, which are mostly bare nouns and pronouns."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-101",
"text": "We added phrasal nodes above all single-word subject, accusative object, dative object, and genitive object 5 arguments, with the category of the inserted phrase depending on the POS tag on the word."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-102",
"text": "The introduced phrasal node is given the word's original grammatical function label; the grammatical function label of the word itself becomes NK for NPs and HD for APs and VPs."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-103",
"text": "In total, 14,580 nodes were inserted into Negra in this way."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-104",
"text": "T\u00fcBa-D/Z has non-branching phrases above all single-word arguments, so that no such modification was needed."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-105",
"text": "6"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-107",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-108",
"text": "We trained unlexicalized PCFG parsing models using LoPar (Schmid, 2000) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-109",
"text": "Unlexicalized models 5 Genitive objects are modified for the sake of consistency among arguments even though there are too few genitive objects to provide reliable results in the evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-110",
"text": "6 The addition of edge labels to terminal POS labels results in 337 lexical tags for Negra and 91 for T\u00fcBa-D/Z. were used to minimize the impact of other corpus differences on parsing."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-111",
"text": "A ten-fold cross validation was performed for all experiments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-112",
"text": "7"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-114",
"text": "**PARSEVAL EVALUATION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-115",
"text": "As a reference point for comparison with previous work, the PARSEVAL results 8 are given in Table 2 The parser trained on T\u00fcBa-D/Z performs much better than the one trained on Negra on all labeled and unlabeled bracketing scores."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-116",
"text": "As we saw in section 2, Negra and T\u00fcBa-D/Z use very different syntactic annotation schemes, resulting in over 2.5 times as many non-terminals per sentence in T\u00fcBa-D/Z as in Negra with the additional unary nodes."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-117",
"text": "As mentioned previously, Rehbein and van Genabith (2007a) showed that PARSEVAL is affected by the ratio of terminal to non-terminal nodes, so these results are not expected to indicate the quality of the parses."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-118",
"text": "The comparison with grammatical function and dependency evaluations we turn to next showcases that PARSEVAL does not provide a meaningful evaluation metric across annotation schemes."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-121",
"text": "Complementing the issue of the ratio of terminals to non-terminals raised in the last section, one can question whether counting all brackets in the sentence equally, as done by the PARSEVAL metric, provides a good measure of how accurately the basic functor-argument structure of the sentence has been captured in a parse."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-122",
"text": "Thus, it is useful to per-7 Our experimental setup is designed to support a comparison between Negra and T\u00fcBa-D/Z for the three evaluation metrics and is intended to be comparable to the setup of K\u00fcbler et al. (2006) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-123",
"text": "For Negra, Dubey (2004) explores a range of parsing models and the corpus preparation he uses differs from the one discussed in this paper so that a discussion of his results is beyond the scope of the corpus comparison in this paper."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-124",
"text": "8 Scores were calculated using evalb."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-125",
"text": "form an evaluation based on the grammatical function labels that are important for determining the functor-argument structure of the sentence: subjects, accusative objects, and dative objects."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-126",
"text": "9 The first step in an evaluation of functor-argument structure is to identify whether an argument bears the correct grammatical function label."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-128",
"text": "**GRAMMATICAL FUNCTION LABEL EVALUATION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-129",
"text": "K\u00fcbler et al. (2006) present the results shown in Table 3 for the parsing performance of the unlexicalized model of the Stanford Parser (Klein and Manning, 2002) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-130",
"text": "In this grammatical function label evaluation, T\u00fcBa-D/Z outperforms Negra for subjects, accusative objects, and dative objects based on an evaluation of phrasal arguments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-131",
"text": "K\u00fcbler et al. (2006) Note that this grammatical function label evaluation is restricted to labels on phrases; grammatical function labels on words are ignored in training and testing."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-132",
"text": "This results in an unbalanced comparison between Negra and T\u00fcBa-D/Z since, as discussed in section 2, T\u00fcBa-D/Z includes unary-branching phrases above all single-word arguments whereas Negra does not."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-133",
"text": "In effect, single-word arguments in Negra -mainly pronouns and bare nouns -are not considered in the evaluation from K\u00fcbler et al. (2006) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-134",
"text": "The result is thus a comparison of multiword arguments in Negra to both single-and multiword arguments in T\u00fcBa-D/Z. Recall from section 3.1 that this is not a minor difference: single-word arguments account for 38% of subjects, accusative objects, and dative objects in Negra."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-135",
"text": "As discussed in the data preparation section, Negra was modified for our experiment so as not to provide the parser with the grammatical function labels for single word phrases as part of the perfect tags provided."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-136",
"text": "This evaluation handles multiple categories of arguments, not just NPs, so it focuses solely on the grammatical function labels, ignoring the phrasal categories."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-137",
"text": "For example, in Negra an NP-OA in a parse is considered a correct accusative object even if the OA label in the gold standard has the category MPN."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-158",
"text": "It is possible to automatically find an appropriate head verb for all but 2.7% of subjects, accusative objects, and dative objects."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-138",
"text": "The results are shown in Table 4 In contrast to the results for NP grammatical functions of K\u00fcbler et al. (2006) we saw in Table 3 , Negra and T\u00fcBa-D/Z perform quite similarly overall, with Negra slightly outperforming T\u00fcBa-D/Z for all types of arguments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-139",
"text": "These results also form a clear contrast to the PARSEVAL results we saw in Table 2 ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-140",
"text": "Contrary to the finding in K\u00fcbler et al. (2006) , the PAR-SEVAL evaluation does not echo the grammatical function label evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-141",
"text": "In keeping with the results from Rehbein and van Genabith (2007a) , we find that PARSEVAL is not an adequate predictor of performance in an evaluation targeting the functorargument structure of the sentence for comparisons between PCFG parsers trained on corpora with different annotation schemes."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-142",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-143",
"text": "**LABELED DEPENDENCY TRIPLE EVALUATION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-144",
"text": "While determining the grammatical function of an element is an important part of determining the functor-argument structure of a sentence, the other necessary component is determining the head of each function."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-145",
"text": "To evaluate whether both the functor and the argument have been correctly found, an evaluation of labeled dependency triples is needed."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-146",
"text": "As in the previous section, we focus on the grammatical function labels for arguments of verbs."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-147",
"text": "To complete a labeled dependency triple for each argument, we additionally need to locate the lexical verbal head."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-148",
"text": "In Negra, the head is the sister of an argument marked with the function label \"HD\", however heads are only marked for a subset of the phrase categories: S, VP, AP, and AVP."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-149",
"text": "10 This subset includes the phrase categories that contain verbs and their arguments, S and VP."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-150",
"text": "In our experiment, the parser finds the HD grammatical function labels with a very high f-score: 99.5% precision and 96.5% recall."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-151",
"text": "If the sister with the label HD is a word, then that word is the lexical head for the purposes of this dependency evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-152",
"text": "If the sister with the label HD is a phrase, then a recursive search for heads within that phrase finds a lexical head."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-153",
"text": "In 3.2% of cases in the gold standard, it is not possible to find a lexical head for an argument."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-154",
"text": "Further methods could be applied to find the remaining heads heuristically, but we avoid the additional parameters this introduces for this evaluation by ignoring these cases."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-155",
"text": "For T\u00fcBa-D/Z, finding the head is not as simple because the verbal head and its arguments are in different topological fields."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-156",
"text": "To create a parallel comparison to Negra, the finite verb from the local clause is chosen as the head for all subjects."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-157",
"text": "The (finite or non-finite) main full verb is designated as the head for the accusative and dative objects."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-159",
"text": "11 As with Negra, only cases where a head verb can be found in the gold standard are considered in the evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-160",
"text": "As in the grammatical function evaluation in the previous section, only the grammatical function label, not the phrase category is considered in the evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-161",
"text": "The results for the labeled dependency evaluation are shown in Table 5 ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-162",
"text": "The parser trained on Negra outperforms the one trained on T\u00fcBa-D/Z for all types of arguments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-164",
"text": "**DISCUSSION OF RESULTS**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-165",
"text": "Comparing PARSEVAL scores for a parser trained on the Negra and the T\u00fcBa-D/Z corpus with a grammatical function and a labeled dependency evalua-10 However, some strings labeled as S and VP do not contain a head and thus lack a daughter with a HD function label."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-166",
"text": "11 The relative numbers of instances where a lexical head is not found are comparable for Negra and T\u00fcBa-D/Z. Heads are not found for approximately 4% of subjects, 1% of accusative objects, and 1% of dative objects."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-167",
"text": "These instances are frequently due to elision of the verb in headlines and coordinated clauses."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-168",
"text": "Table 5 : Labeled Dependency Evaluation tion, we confirm that the PARSEVAL scores do not correlate with the scores in the other two evaluations, which given their closeness to the semantic functor argument structure make meaningful targets for evaluating parsers."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-169",
"text": "Shifting the focus to the grammatical function evaluation, we showed that a grammatical function evaluation based on phrasal arguments as provided by K\u00fcbler et al. (2006) is inadequate for comparing parsers trained on the Negra and T\u00fcBa-D/Z corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-170",
"text": "By introducing non-branching phrase nodes above single-word arguments in Negra, it is possible to provide a balanced comparison for the grammatical function label evaluation between Negra and T\u00fcBa-D/Z on both phrasal and single-word arguments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-171",
"text": "The models trained on both corpora perform very similarly in the grammatical function evaluation, in contrast to the claims in K\u00fcbler et al. (2006) ."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-172",
"text": "When the grammatical function label evaluation is extended into a labeled dependency evaluation by finding the verbal head to complete the labeled dependency triple, the parser trained on Negra outperforms that trained on T\u00fcBa-D/Z. The more significant drop in results for T\u00fcBa-D/Z compared to the grammatical function label evaluation may be due to the fact that a verbal lexical head in T\u00fcBa-D/Z is not in the same local tree as its dependents, whereas it is in Negra."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-173",
"text": "The presence of intervening topological field nodes in T\u00fcBa-D/Z may make it difficult for the parser to consistently identify the elements of the dependency triple across several subtrees."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-174",
"text": "The Negra corpus annotation scheme makes it simple to identify the heads of verb arguments, but the flat NP and PP structures make it difficult to extend a labeled dependency analysis beyond verb arguments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-175",
"text": "On the other hand, T\u00fcBa-D/Z has marked heads in NPs and PPs, but it is not as easy to pair verb arguments with their heads because the verbs are in separate topological fields from their argu-ments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-176",
"text": "For a constituent-based corpus annotation scheme to lend itself to a thorough labeled dependency evaluation, heads should be marked clearly for all phrase categories and all non-head elements need to have marked grammatical functions."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-177",
"text": "The presence of topological field nodes in T\u00fcBa-D/Z deserves more discussion in relation to a grammatical dependency evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-178",
"text": "The corpus contains two very different types of nodes in its syntactic trees: nodes such as NP and PP that correspond to constituents and nodes such as VF (Vorfeld) and MF (Mittelfeld) that correspond to word order domains."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-179",
"text": "Constituents such as NP have grammatical relations to other elements in the sentence and have identifiable heads within them, whereas nodes encoding word order domains have neither."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-180",
"text": "12 While constituents and word order domains sometimes coincide, such as the Vorfeld normally consisting of a single constituent, this is not the general case."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-181",
"text": "For example, the Mittelfeld often contains multiple constituents which each stand in different grammatical relations to the verb(s) in the left and right sentence brackets (LK and VC)."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-182",
"text": "Returning to the issue of finding dependencies between constituents, the intervening word order domain nodes can make it non-trivial to determine these relations in T\u00fcBa-D/Z. For example, word order domain nodes will always intervene between a verb and its arguments."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-183",
"text": "In order to have all grammatical dependencies directly encoded in the treebank, it would be preferable for corpus annotation schemes to ensure that a homogeneous constituency representation can be easily obtained."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-185",
"text": "**FUTURE WORK**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-186",
"text": "An evaluation on arguments of verbs is just a first step in working towards a more complete labeled dependency evaluation."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-187",
"text": "Because Negra and T\u00fcBa-D/Z do not have parallel uses of many grammatical function labels beyond arguments of verbs, a more detailed evaluation on more types of dependency relations will require a complex dependency conversion method to provide comparable results."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-188",
"text": "Since previous work on head-lexicalized parsing models for German has focused on PARSEVAL evaluations, it would also be useful to perform a labeled dependency evaluation to determine what effect head lexicalization has on particular constructions for the parsers."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-189",
"text": "Because of the concerns discussed in the previous section and the difference in which types of clauses have marked heads in Negra and T\u00fcBa-D/Z, the effect of head lexicalization on the parsing results may differ for the two corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-190",
"text": "----------------------------------"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-191",
"text": "**CONCLUSION**"
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-192",
"text": "Addressing the general question of how to compare parsing results for different annotation schemes, we revisited the comparison of PCFG parsing results for the Negra and T\u00fcBa-D/Z corpora."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-193",
"text": "We show that these different annotation schemes lead to very significant differences in PARSEVAL scores for unlexicalized PCFG parsing models, but grammatical function label and labeled dependency evaluations for arguments of verbs show that this difference does not carry over to measures which are relevant to the semantic functor-argument structure."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-194",
"text": "In contrast to K\u00fcbler et al. (2006) a grammatical function evaluation on subjects, accusative objects, and dative objects establishes that Negra and T\u00fcBa-D/Z perform similarly when all types of words and phrases appearing as arguments are taken into consideration."
},
{
"sent_id": "366231b855f226f63d637e6b2e1667-C001-195",
"text": "A labeled dependency evaluation based on grammatical relations, which links this work to current work on formalism-independent parser evaluation (e.g., Clark and Curran, 2007) , shows that the parsing performance for Negra and T\u00fcBa-D/Z is comparable."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"366231b855f226f63d637e6b2e1667-C001-23",
"366231b855f226f63d637e6b2e1667-C001-24"
],
[
"366231b855f226f63d637e6b2e1667-C001-25"
],
[
"366231b855f226f63d637e6b2e1667-C001-27"
],
[
"366231b855f226f63d637e6b2e1667-C001-29"
],
[
"366231b855f226f63d637e6b2e1667-C001-31"
],
[
"366231b855f226f63d637e6b2e1667-C001-133"
]
],
"cite_sentences": [
"366231b855f226f63d637e6b2e1667-C001-24",
"366231b855f226f63d637e6b2e1667-C001-25",
"366231b855f226f63d637e6b2e1667-C001-27",
"366231b855f226f63d637e6b2e1667-C001-29",
"366231b855f226f63d637e6b2e1667-C001-31",
"366231b855f226f63d637e6b2e1667-C001-133"
]
},
"@MOT@": {
"gold_contexts": [
[
"366231b855f226f63d637e6b2e1667-C001-31",
"366231b855f226f63d637e6b2e1667-C001-32",
"366231b855f226f63d637e6b2e1667-C001-33",
"366231b855f226f63d637e6b2e1667-C001-34",
"366231b855f226f63d637e6b2e1667-C001-35",
"366231b855f226f63d637e6b2e1667-C001-36"
]
],
"cite_sentences": [
"366231b855f226f63d637e6b2e1667-C001-31",
"366231b855f226f63d637e6b2e1667-C001-36"
]
},
"@EXT@": {
"gold_contexts": [
[
"366231b855f226f63d637e6b2e1667-C001-36"
]
],
"cite_sentences": [
"366231b855f226f63d637e6b2e1667-C001-36"
]
},
"@DIF@": {
"gold_contexts": [
[
"366231b855f226f63d637e6b2e1667-C001-37"
],
[
"366231b855f226f63d637e6b2e1667-C001-96"
],
[
"366231b855f226f63d637e6b2e1667-C001-140"
],
[
"366231b855f226f63d637e6b2e1667-C001-169"
],
[
"366231b855f226f63d637e6b2e1667-C001-171"
],
[
"366231b855f226f63d637e6b2e1667-C001-194"
]
],
"cite_sentences": [
"366231b855f226f63d637e6b2e1667-C001-37",
"366231b855f226f63d637e6b2e1667-C001-96",
"366231b855f226f63d637e6b2e1667-C001-140",
"366231b855f226f63d637e6b2e1667-C001-169",
"366231b855f226f63d637e6b2e1667-C001-171",
"366231b855f226f63d637e6b2e1667-C001-194"
]
},
"@USE@": {
"gold_contexts": [
[
"366231b855f226f63d637e6b2e1667-C001-87"
],
[
"366231b855f226f63d637e6b2e1667-C001-90"
],
[
"366231b855f226f63d637e6b2e1667-C001-129"
],
[
"366231b855f226f63d637e6b2e1667-C001-138"
]
],
"cite_sentences": [
"366231b855f226f63d637e6b2e1667-C001-87",
"366231b855f226f63d637e6b2e1667-C001-90",
"366231b855f226f63d637e6b2e1667-C001-129",
"366231b855f226f63d637e6b2e1667-C001-138"
]
},
"@SIM@": {
"gold_contexts": [
[
"366231b855f226f63d637e6b2e1667-C001-122"
]
],
"cite_sentences": [
"366231b855f226f63d637e6b2e1667-C001-122"
]
}
}
},
"ABC_b56e408c53636ac5fbf5149226319f_3": {
"x": [
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-2",
"text": "We present a novel attention-based recurrent neural network for joint extraction of entity mentions and relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-3",
"text": "We show that attention along with long short term memory (LSTM) network can extract semantic relations between entity mentions without having access to dependency trees."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-4",
"text": "Experiments on Automatic Content Extraction (ACE) corpora show that our model significantly outperforms featurebased joint model by Li and Ji (2014) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-5",
"text": "We also compare our model with an end-toend tree-based LSTM model (SPTree) by Miwa and Bansal (2016) and show that our model performs within 1% on entity mentions and 2% on relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-6",
"text": "Our finegrained analysis also shows that our model performs significantly better on AGENT-ARTIFACT relations, while SPTree performs better on PHYSICAL and PART-WHOLE relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-9",
"text": "Extraction of entities and their relations from text belongs to a very well-studied family of structured prediction tasks in NLP."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-10",
"text": "There are several NLP tasks such as fine-grained opinion mining (Choi et al., 2006) , semantic role labeling (Gildea and Jurafsky, 2002) , etc., which have a similar structure; thus making it an important and a challenging task."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-11",
"text": "Several methods have been proposed for entity mention and relation extraction at the sentencelevel."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-12",
"text": "These can be broadly categorized into -1) pipeline models that treat the identification of entity mentions (Nadeau and Sekine, 2007) and relation classification (Zhou et al., 2005) as two separate tasks; and 2) joint models, also the more recent, which simultaneously identify the entity mention and relations (Li and Ji, 2014; Miwa and Sasaki, 2014) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-13",
"text": "Joint models have been argued to perform better than the pipeline models as knowledge of the typed relation can increase the confidence of the model on entity extraction and vice versa."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-14",
"text": "Recurrent networks (RNNs) (Elman, 1990 ) have recently become very popular for sequence tagging tasks such as entity extraction that involves a set of contiguous tokens."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-15",
"text": "However, their ability to identify relations between non-adjacent tokens in a sequence, e.g., the head nouns of two entities, is less explored."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-16",
"text": "For these tasks, RNNs that make use of tree structures have been deemed more suitable."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-17",
"text": "Miwa and Bansal (2016) , for example, propose an RNN comprised of a sequencebased long short term memory (LSTM) for entity identification and a separate tree-based dependency LSTM layer for relation classification using shared parameters between the two components."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-18",
"text": "As a result, their model depends critically on access to dependency trees, restricting it to sentencelevel extraction and to languages for which (good) dependency parsers exist."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-19",
"text": "Also, their model does not jointly extract entities and relations; they first extract all entities and then perform relation classification on all pairs of entities in a sentence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-20",
"text": "In our previous work (Katiyar and Cardie, 2016) , we address the same task in an opinion extraction context."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-21",
"text": "Our LSTM-based formulation explicitly encodes distance between the head of entities into opinion relation labels."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-22",
"text": "The output space of our model is quadratic in size of the entity and relation label set and we do not specifically identify the relation type."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-23",
"text": "Unfortunately, adding relation type makes the output label space very sparse, making it difficult for the model to learn."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-24",
"text": "In this paper, we propose a novel RNN-based model for the joint extraction of entity mentions and relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-25",
"text": "Unlike other models, our model does not depend on any dependency tree information."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-26",
"text": "Our RNN-based model is a multi-layer bidirectional LSTM over a sequence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-27",
"text": "We encode the output sequence from left-to-right."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-28",
"text": "At each time step, we use an attention-like model on the previously decoded time steps, to identify the tokens in a specified relation with the current token."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-29",
"text": "We also add an additional layer to our network to encode the output sequence from right-to-left and find significant improvement on the performance of relation identification using bi-directional encoding."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-30",
"text": "Our model significantly outperforms the feature-based structured perceptron model of Li and Ji (2014) , showing improvements on both entity and relation extraction on the ACE05 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-31",
"text": "In comparison to the dependency treebased LSTM model of Miwa and Bansal (2016) , our model performs within 1% on entities and 2% on relations on ACE05 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-32",
"text": "We also find that our model performs significantly better than their tree-based model on the AGENT-ARTIFACT relation, while their tree-based model performs better on PHYSICAL and PART-WHOLE relations; the two models perform comparably on all other relation types."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-33",
"text": "The very competitive performance of our non-tree-based model bodes well for relation extraction of non-adjacent entities in low-resource languages that lack good parsers."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-34",
"text": "In the sections that follow, we describe related work (Section 2); our bi-directional LSTM model with attention (Section 3); the training (Section 4); the experiments on ACE dataset (Section 5); results (Section 6); error analysis (Section 7) and conclusion (Section 8)."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-36",
"text": "**RELATED WORK**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-37",
"text": "RNNs (Hochreiter and Schmidhuber, 1997) have been recently applied to many sequential modeling and prediction tasks, such as machine translation (Bahdanau et al., 2015; , named entity recognition (NER) (Hammerton, 2003) , opinion mining (Irsoy and Cardie, 2014) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-38",
"text": "Variants such as adding CRF-like objective on top of LSTMs have been found to produce state-of-the-art results on several sequence prediction NLP tasks (Collobert et al., 2011; Huang et al., 2015; Katiyar and Cardie, 2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-39",
"text": "These models assume conditional independence at the output layer whereas the model we propose in this paper does not assume any conditional independence at the output layer, allowing it to model an arbitrary distribution over output sequences."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-40",
"text": "Relation classification has been widely studied as a stand-alone task, assuming that the arguments of the relations are known in advance."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-41",
"text": "There have been several models proposed including featurebased models (Bunescu and Mooney, 2005; Zelenko et al., 2003) and neural network based models (Socher et al., 2012; dos Santos et al., 2015; Hashimoto et al., 2015; Xu et al., 2015a,b) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-42",
"text": "For joint-extraction of entities and relations, feature-based structured prediction models (Li and Ji, 2014; Miwa and Sasaki, 2014) , joint inference integer linear programming models (Yih and Roth, 2007; Yang and Cardie, 2013) , card-pyramid parsing (Kate and Mooney, 2010) and probabilistic graphical models (Yu and Lam, 2010; Singh et al., 2013) have been proposed."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-43",
"text": "In contrast, we propose a neural network model which does not depend on the availability of any features such as part of speech (POS) tags, dependency trees, etc."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-44",
"text": "Recently, Miwa and Bansal (2016) proposed an end-to-end LSTM based sequence and treestructured model."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-45",
"text": "They extract entities via a sequence layer and relations between the entities via the shortest path dependency tree network."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-46",
"text": "In this paper, we try to investigate recurrent neural networks with attention for extracting semantic relations between entity mentions without using any dependency parse tree features."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-47",
"text": "We also present the first neural network based joint model that can extract entity mentions and relations along with the relation type."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-48",
"text": "In our previous work (Katiyar and Cardie, 2016) , as explained earlier, we proposed a LSTM-based model for joint extraction of opinion entities and relations, but no relation types."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-49",
"text": "This model cannot be directly extended to include relation types as the output space becomes sparse making it difficult for the model to learn."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-50",
"text": "Recent advances in recurrent neural network has seen the application of attention on recurrent neural networks to obtain a representation weighted by the importance of tokens in the sequence model."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-51",
"text": "Such models have been very frequently used in question-answering tasks (for recent examples, see Chen et al. (2016) and Lee et al. (2016) ), machine translation (Luong et al., 2015; Bahdanau et al., 2015) , and many other NLP applications."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-52",
"text": "Pointer networks (Vinyals et al., 2015) , an adaptation of attention models, use these tokenlevel weights as pointers to the input elements."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-53",
"text": "Zhai et al. (2017) , for example, have used these for neural chunking, and Nallapati et al. (2016) and Cheng and Lapata (2016) , for summarization."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-54",
"text": "However, to the best of our knowledge, these networks have not been used for joint extraction of entity mentions and relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-55",
"text": "We present first such attempt to use these attention models with recurrent neural networks for joint extraction of entity mentions and relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-57",
"text": "**MODEL**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-58",
"text": "Our model comprises of a multi-layer bidirectional recurrent network which learns a representation for each token in the sequence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-59",
"text": "We use the hidden representation from the top layer for joint entity and relation extraction."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-60",
"text": "For each token in the sequence, we output an entity tag and a relation tag."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-61",
"text": "The entity tag corresponds to the entity type, whereas the relation tag is a tuple of pointers to related entities and their respective relation types."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-62",
"text": "Figure 1 shows the annotation for an example sentence from the dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-63",
"text": "We transform the relation tags from entity level to token level."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-64",
"text": "For example, we separately model the relation \"ORG-AFF\" for each token in the entity \"ITV News\"."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-65",
"text": "Thus, we model the relations between \"ITV\" and \"Martin Geissler\", and \"News\" and \"Martin Geissler\" separately."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-66",
"text": "We employ a pointer-like network on top of the sequence layer in order to find the relation tag for each token as shown in Figure 2 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-67",
"text": "At each time step, the network utilizes the information available about all output tags from the previous time steps in order to output the entity tag and relation tag jointly for the current token."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-69",
"text": "**MULTI-LAYER BI-DIRECTIONAL RECURRENT NETWORK**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-70",
"text": "We use multi-layer bi-directional LSTMs for sequence tagging because LSTMs are more capable of capturing long-term dependencies between tokens, making it ideal for both entity mention and relation extraction."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-71",
"text": "Using LSTMs, we can compute the hidden state \u2212 \u2192 h t in the forward direction and \u2190 \u2212 h t in the backward direction for every token as below:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-72",
"text": "For every token t in the subsequent layer l, we combine the representations"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-73",
"text": "and \u2190 \u2212 h l\u22121 t from previous layer l-1 and feed it as an input."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-74",
"text": "In this paper, we only use the hidden state from the last layer L for output layer and compute the top hidden layer representation as below:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-75",
"text": "\u2190 \u2212 V are weight matrices for combining hidden representations from the two directions."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-77",
"text": "**ENTITY DETECTION**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-78",
"text": "We formulate entity detection as a sequence labeling task using BILOU scheme similar to Li and Ji (2014) and Miwa and Bansal (2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-79",
"text": "We assign each token in the entity with the tag B appended with the entity type if it is the beginning of the entity, I for inside of an entity, L for the end of the entity or U if there is only one token in the entity."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-80",
"text": "Figure 1 shows an example of the entity tag sequence assigned to the sentence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-81",
"text": "For each token in the sequence, we perform a softmax over all candidate tags to output the most likely tag:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-82",
"text": "Our network structure as shown in Figure 2 also contains connections from the output y t\u22121 of the previous time step to the current top hidden layer."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-83",
"text": "Thus our outputs are not conditionally independent from each other."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-84",
"text": "In order to add connections from y t\u22121 , we transform this output k into a label embedding b k t\u22121 1 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-85",
"text": "We represent each label type Figure 2: Our network structure based on bi-directional LSTMs for joint entity and relation extraction."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-86",
"text": "This snapshot shows the network when encoding the relation tag for the word \"Safwan\" in the sentence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-87",
"text": "The dotted lines in the figure show that top hidden layer and label embeddings for tokens is copied into relation layer."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-88",
"text": "The pointers at attention layer indicate the probability distribution over tokens, the length of the pointers is used to denote the probability value."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-89",
"text": "k with a dense representation b k ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-90",
"text": "We compute the output layer representations as:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-91",
"text": "We decode the output sequence from left to right in a greedy manner."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-93",
"text": "**ATTENTION MODEL**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-94",
"text": "We use attention model for relation extraction."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-95",
"text": "Attention models, over an encoder sequence of representations z, can compute a soft probability distribution p over these learned representations, where d i is the i th token in decoder sequence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-96",
"text": "These probabilities are an indication of the importance of different tokens in the encoder sequence:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-97",
"text": "v is a weight matrix for attention which transforms the hidden representations into attention scores."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-98",
"text": "We use pointer networks (Vinyals et al., 2015) in our approach, which are a variation of these attention models."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-99",
"text": "Pointer networks interpret these p i t as the probability distribution over the input encoding sequence and use u i t as pointers to the input elements."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-100",
"text": "We can use these pointers to encode relation between the current token and the previous predicted tokens, making it fit for relation extraction as explained in Section 3.4."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-102",
"text": "**RELATION DETECTION**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-103",
"text": "We formulate relation extraction also as a sequence labeling task."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-104",
"text": "For each token, we want to find the tokens in the past that the current token is related to along with its relation type."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-105",
"text": "In Figure 1 , \"Safwan\" is related to the tokens \"Martin\" as well as \"Geissler\" by the relation type \"PHYS\"."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-106",
"text": "For simplicity, let us assume that there is only one previous token the current token is related to when training, i.e., \"Safwan\" is related to \"Geissler\" via PHYS relation."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-107",
"text": "We can extend our approach to output multiple relations as explained in Section 4."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-108",
"text": "We use pointer networks as described in Sec-tion 3.3."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-109",
"text": "At each time step, we stack the top hidden layer representations from the previous time steps z \u2264t 2 and its corresponding label embeddings b \u2264t ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-110",
"text": "We only stack the top hidden layer representations for the tokens which were predicted as non-O's for previous time steps as shown in Figure 2 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-111",
"text": "Our decoding representation at time t is the concatenation of z t and b t ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-112",
"text": "The attention probabilities can now be computed as below:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-113",
"text": "Thus, p t \u2264t corresponds to the probability of each token, in the sequence so far, being related to the current token at time step t. For the case of NONE relations, the token at t is related to itself."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-114",
"text": "We also want to find the type of the relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-115",
"text": "In order to achieve this, we add an extra dimension to v corresponding to the size of relation types R space."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-116",
"text": "Thus, u i t is no longer a score but a R dimensional vector."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-117",
"text": "We then take softmax over this vector of size O(|z \u2264t |\u00d7R) to find the most likely tuple of pointer to the related entity and its relation type."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-119",
"text": "**BI-DIRECTIONAL ENCODING**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-120",
"text": "Bi-directional LSTMs have been found to be able to capture context better than plain left-to-right LSTMs, based on their performance on various NLP tasks (Irsoy and Cardie, 2014) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-121",
"text": "Also, found that their performance on machine translation task improved on reversing the input sentences during training."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-122",
"text": "Inspired by these developments, we experiment with bi-directional encoding at the output layer."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-123",
"text": "We add another top hidden layer on Bi-LSTM in Figure 2 which encodes the output sequence from rightto-left."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-124",
"text": "The two encoding share the same multilayer bi-directional LSTM except for the top hidden layer."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-125",
"text": "Thus, we have two output layers in our network which output the entity tags and relation tags separately."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-126",
"text": "At inference time, we employ heuristics to combine the output from the two directions."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-128",
"text": "**TRAINING**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-129",
"text": "We train our network by maximizing the logprobability of the correct entity E and relation R tag sequences jointly given the sentence S as below:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-130",
"text": "log p(E, R|S, \u03b8) = 1 |S| i\u2208|S| log p(e i , r i |e Miwa and Bansal (2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-139",
"text": "Multiple Relations Our approach to relation extraction is different from Miwa and Bansal (2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-140",
"text": "Miwa and Bansal (2016) present each pair of entities to their model for relation classification."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-141",
"text": "In our approach, we use pointer networks to identify the related entities."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-142",
"text": "Thus, for our approach described so far if we only compute the argmax on our objective then we limit our model to output only one relation label per token."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-143",
"text": "However, from our analysis of the dataset, an entity may be related to more than one entity in the sentence."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-144",
"text": "Hence, we modify our objective to include multiple relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-145",
"text": "In Figure 2 , token \"Safwan\" is related to both tokens \"Martin\" and \"Geissler\" of the entity \"Martin Geissler\", hence we assign probability of 0.5 to both these tokens."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-146",
"text": "This can be easily expanded to include tokens from other related entities, such that we assign equal probability 1 N to all tokens 3 depending on the number N of these related tokens."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-147",
"text": "The log-probability for the entity part remain the same as in our objective discussed in Section 4, however we modify the relation log-probability as below:"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-148",
"text": "|j:r i,j >0| r i,j log p(r i,j |e \u2264i , r Miwa and Bansal (2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-163",
"text": "Also, there are relation types namely Physical (PHYS), Person-Social (PER-SOC), Organization-Affiliation (ORG-AFF), Agent-Artifact (ART), GPE-Affiliation (GPE-AFF)."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-164",
"text": "ACE05 has a total of 6 relation types including PART-WHOLE."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-165",
"text": "We use the same data splits as Li and Ji (2014) and Miwa and Bansal (2016) such that there are 351 documents for training, 80 for development and the remaining 80 documents for the test set."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-166",
"text": "ACE04 has 7 relation types with an additional Discourse (DISC) type and split ORG-AFF relation type into ORG-AFF and OTHER-AFF."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-167",
"text": "We perform 5-fold cross validation similar to Chan and Roth (2011) for fair comparison with the state-of-theart."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-169",
"text": "**EVALUATION METRICS**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-170",
"text": "In order to compare our system with the previous systems, we report micro F1-scores, Precision and Recall on both entities and relations similar to Li and Ji (2014) and Miwa and Bansal (2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-171",
"text": "An entity is considered correct if we can identify its head and the entity type correctly."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-172",
"text": "A relation is considered correct if we can identify the head of the argument entities and also the relation type."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-173",
"text": "We also report a combined score when both argument entities and relations are correct."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-175",
"text": "**BASELINES AND PREVIOUS MODELS**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-176",
"text": "We compare our approach with two previous approaches."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-177",
"text": "The model proposed by Li and Ji (2014) is a feature-based structured perceptron model with efficient beam-search."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-178",
"text": "They employ a segment-based decoder instead of token-based decoding."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-179",
"text": "Their model outperformed previous stateof-the-art pipelined models."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-180",
"text": "Miwa and Sasaki (2014) (SPTree) recently proposed a LSTM-based model with a sequence layer for entity identification, and a tree-based dependency layer which identifies relations between pairs of candidate entities using the shortest dependency path between them."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-181",
"text": "We also employed our previous approach (Katiyar and Cardie, 2016) for extraction of opinion entities and relations to this task."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-182",
"text": "We found that the performance was not competitive with the two approaches mentioned above, performing upto 10 points lower on relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-183",
"text": "Hence, we do not include the results in Table 1 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-184",
"text": "Also, Li and Ji (2014) showed that the joint model performs better than the pipelined approaches."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-185",
"text": "Thus, we do not include any pipeline baselines."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-186",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-187",
"text": "**HYPERPARAMETERS AND TRAINING DETAILS**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-188",
"text": "We train our model using Adadelta (Zeiler, 2012) with gradient clipping."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-189",
"text": "We regularize our network using dropout (Srivastava et al., 2014) with the drop-out rate tuned using development set."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-190",
"text": "We initialized our word embeddings Table 1 : Performance on ACE05 test dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-191",
"text": "The dashed (\"-\") performance numbers were missing in the original paper (Miwa and Bansal, 2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-192",
"text": "1 We ran the system made publicly available by Miwa and Bansal (2016) , on ACE05 dataset for filling in the missing values and comparing our system with theirs at fine-grained level."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-193",
"text": "Table 2 : Performance of different encoding methods on ACE05 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-194",
"text": "with 300-dimensional word2vec (Mikolov et al., 2013) word embeddings trained on Google News dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-195",
"text": "We have 3 hidden layers in our network and the dimensionality of the hidden units is 100."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-196",
"text": "All the weights in the network are initialized from small random uniform noise."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-197",
"text": "We tune our hyperparameters based on ACE05 development set and use them for training on ACE04 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-198",
"text": "Table 1 compares the performance of our system with respect to the baselines on ACE05 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-199",
"text": "We find that our joint model significantly outperforms the joint structured perceptron model (Li and Ji, 2014) on both entities and relations, despite the unavailability of features such as dependency trees, POS tags, etc."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-200",
"text": "However, if we compare our model to the SPTree models, then we find that their model has better recall on both entities and relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-201",
"text": "In Section 7, we perform error analysis to understand the difference in the performance of the two models in detail."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-202",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-203",
"text": "**RESULTS**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-204",
"text": "We also compare the performance of various encoding schemes in Table 2 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-205",
"text": "We compare the benefits of introducing multiple relations in our objective and bi-directional encoding compared to leftto-right encoding."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-206",
"text": "Multiple Relations We find that modifying our objective to include multiple relations improves the recall of our system on relations, leading to slight improvement on the overall performance on relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-207",
"text": "However, careful tuning of the threshold may further improve precision."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-208",
"text": "Bi-directional Encoding By adding bidirectional encoding to our system, we find that we can significantly improve the performance of our system compared to left-to-right encoding."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-209",
"text": "It also improves precision compared to left-toright decoding combined with multiple relations objective."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-210",
"text": "We find that for some relations it is easier to detect them with respect to one of the entities in the entity pair."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-211",
"text": "PHYS relation is easier identified with respect to GPE entity than PER entity."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-212",
"text": "Thus, our bi-directional encoding of relations allows us to encode these relations with respect to both entities in the relation."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-213",
"text": "Table 3 shows the performance of our model on ACE04 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-214",
"text": "We believe that tuning the hyperparameters of our model can further improve the results on this dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-215",
"text": "As also pointed out by Li and Ji (2014) that ACE05 has better annotation quality, we focused on ACE05 dataset for this work."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-217",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-218",
"text": "In this section, we perform a fine-grained comparison of our model with respect to the SPTree (Miwa and Bansal, 2016) model."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-219",
"text": "We compare the performance of the two models with respect to entities, relation types and the distance between the relation arguments and provide examples from the test set in Table 6 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-220",
"text": "Table 3 : Performance on ACE04 test dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-221",
"text": "The dashed (\"-\") performance numbers were missing in the original paper (Miwa and Bansal, 2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-222",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-223",
"text": "**ENTITIES**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-224",
"text": "We find that our model has lower recall on entity extraction than SPTree as shown in Table 1 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-225",
"text": "Miwa and Bansal (2016) , in one of the ablation tests on ACE05 development set, show that their model can gain upto 2% improvement in recall by entity pretraining."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-226",
"text": "Since we propose a jointmodel, we cannot directly apply their pretraining trick on entities separately."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-227",
"text": "We leave it for future work."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-228",
"text": "Li and Ji (2014) mentioned in their analysis of the dataset that there were many \"UNK\" tokens in the test set which were never seen during training."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-229",
"text": "We verified the same and we hypothesize that for this reason the performance on the entities depends largely on the pretrained word embeddings being used."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-230",
"text": "We found considerable improvements on entity recall when using pretrained word embeddings, if available, for these \"UNK\" tokens."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-231",
"text": "Miwa and Bansal (2016) also use additional features such as POS tags in addition to pretrained word embeddings at the input layer."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-232",
"text": "Table 5 : Performance based on the distance between entity arguments in relations for ACE05 test dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-233",
"text": "in Table 4 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-234",
"text": "Interestingly, we find that the performance of the two models is varied over different relation types."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-235",
"text": "The dependency tree-based model significantly outperforms our joint-model on PHYS and PART-WHOLE relations, whereas our model is significantly better than tree-based model on ART relation."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-236",
"text": "We show an example sentence (S1) in Table 6 , where SPTree model identifies the entities in ART relation correctly but fails to identify ART relation."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-237",
"text": "We compare the performance with respect to PHYS relation in Section 7.3."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-238",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-239",
"text": "**DISTANCE-BASED ANALYSIS**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-240",
"text": "We also compare the performance of the two models on relations based on the distance between the entities in a relation in Table 5 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-241",
"text": "We find that the performance of both the models is very low for distance greater than 7."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-242",
"text": "SPTree model can identify 36 relations out of 131 such relations correctly, while our model can only identify 20 relations in this category."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-243",
"text": "We manually compare the output of the two systems on these cases on several examples to understand the gain of using dependency tree on longer distances."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-244",
"text": "Interestingly, the majority of these relations belong to PHYS type, thus resulting in lower performance on PHYS as discussed in Section 7.2."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-245",
"text": "We found that there were a few instances of co-reference errors as shown in S2 in Table 6 ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-246",
"text": "Our model identifies a PHYS relation between \"here\" and \"baghdad\", whereas the gold annotation has PHYS relation between \"location\" and \"baghdad\"."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-247",
"text": "We think that Table 6 : Examples from the dataset with label annotations from SPTree and our model for comparison."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-248",
"text": "The first row for each example is the gold standard."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-249",
"text": "incorporating these co-reference information during both training and evaluation will further improve the performance of both systems."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-250",
"text": "Another source of error that we found was the inability of our system to extract entities (lower recall) as in S3."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-251",
"text": "Our model could not identify the FAC entity \"residence\"."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-252",
"text": "Hence, we think an improvement on entity performance via methods like pretraining might be helpful in identifying more relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-253",
"text": "For distance less than 7, we find that our model has better recall but lower precision, as expected."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-254",
"text": "----------------------------------"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-255",
"text": "**CONCLUSION**"
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-256",
"text": "In this paper, we propose a novel attention-based LSTM model for joint extraction of entity mentions and relations."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-257",
"text": "Experimentally, we found that our model significantly outperforms feature-rich structured perceptron joint model by Li and Ji (2014) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-258",
"text": "We also compare our model to an endto-end LSTM model by Miwa and Bansal (2016) which comprises of a sequence layer for entity extraction and a tree-based dependency layer for relation classification."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-259",
"text": "We find that our model, without access to dependency trees, POS tags, etc performs within 1% on entities and 2% on relations on ACE05 dataset."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-260",
"text": "We also find that our model performs significantly better than their treebased model on the ART relation, while their treebased model performs better on PHYS and PART-WHOLE relations; the two models perform comparably on all other relation types."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-261",
"text": "In future, we plan to explore pretraining methods for our model which were shown to improve recall on entity and relation performance by Miwa and Bansal (2016) ."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-262",
"text": "We introduce bi-directional output encoding as well as an objective to learn multiple relations in this paper."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-263",
"text": "However, this presents the challenge of combining predictions from the two directions."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-264",
"text": "We use heuristics in this paper to combine the predictions."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-265",
"text": "We think that using probabilistic methods to combine model predictions from both directions may further improve the performance."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-266",
"text": "We also plan to use Sparsemax (Martins and Astudillo, 2016) instead of Softmax for multiple relations, as the former is more suitable for multi-label classification for sparse labels."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-267",
"text": "It would also be interesting to see the effect of reranking (Collins and Koo, 2005 ) on our joint model."
},
{
"sent_id": "b56e408c53636ac5fbf5149226319f-C001-268",
"text": "We also plan to extend the identification of entities to full entity mention span instead of only the head phrase as in Lu and Roth (2015) ."
}
],
"y": {
"@SIM@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-5"
],
[
"b56e408c53636ac5fbf5149226319f-C001-31"
],
[
"b56e408c53636ac5fbf5149226319f-C001-78"
],
[
"b56e408c53636ac5fbf5149226319f-C001-138"
],
[
"b56e408c53636ac5fbf5149226319f-C001-162"
],
[
"b56e408c53636ac5fbf5149226319f-C001-170"
],
[
"b56e408c53636ac5fbf5149226319f-C001-258",
"b56e408c53636ac5fbf5149226319f-C001-259"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-5",
"b56e408c53636ac5fbf5149226319f-C001-31",
"b56e408c53636ac5fbf5149226319f-C001-78",
"b56e408c53636ac5fbf5149226319f-C001-138",
"b56e408c53636ac5fbf5149226319f-C001-162",
"b56e408c53636ac5fbf5149226319f-C001-170",
"b56e408c53636ac5fbf5149226319f-C001-258"
]
},
"@USE@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-5"
],
[
"b56e408c53636ac5fbf5149226319f-C001-165"
],
[
"b56e408c53636ac5fbf5149226319f-C001-192"
],
[
"b56e408c53636ac5fbf5149226319f-C001-218"
],
[
"b56e408c53636ac5fbf5149226319f-C001-258"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-5",
"b56e408c53636ac5fbf5149226319f-C001-165",
"b56e408c53636ac5fbf5149226319f-C001-192",
"b56e408c53636ac5fbf5149226319f-C001-218",
"b56e408c53636ac5fbf5149226319f-C001-258"
]
},
"@BACK@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-17"
],
[
"b56e408c53636ac5fbf5149226319f-C001-44"
],
[
"b56e408c53636ac5fbf5149226319f-C001-225"
],
[
"b56e408c53636ac5fbf5149226319f-C001-231"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-17",
"b56e408c53636ac5fbf5149226319f-C001-44",
"b56e408c53636ac5fbf5149226319f-C001-225",
"b56e408c53636ac5fbf5149226319f-C001-231"
]
},
"@MOT@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-44",
"b56e408c53636ac5fbf5149226319f-C001-45",
"b56e408c53636ac5fbf5149226319f-C001-46"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-44"
]
},
"@DIF@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-139",
"b56e408c53636ac5fbf5149226319f-C001-140",
"b56e408c53636ac5fbf5149226319f-C001-141"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-139",
"b56e408c53636ac5fbf5149226319f-C001-140"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-191"
],
[
"b56e408c53636ac5fbf5149226319f-C001-221"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-191",
"b56e408c53636ac5fbf5149226319f-C001-221"
]
},
"@FUT@": {
"gold_contexts": [
[
"b56e408c53636ac5fbf5149226319f-C001-225",
"b56e408c53636ac5fbf5149226319f-C001-226",
"b56e408c53636ac5fbf5149226319f-C001-227"
],
[
"b56e408c53636ac5fbf5149226319f-C001-261"
]
],
"cite_sentences": [
"b56e408c53636ac5fbf5149226319f-C001-225",
"b56e408c53636ac5fbf5149226319f-C001-261"
]
}
}
},
"ABC_4a28a289ffc730fea4114f6c71bd06_3": {
"x": [
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-2",
"text": "People can learn a new concept and use it compositionally, understanding how to \"blicket twice\" after learning how to \"blicket.\" In contrast, powerful sequence-tosequence (seq2seq) neural networks fail such tests of compositionality, especially when composing new concepts together with existing concepts."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-3",
"text": "In this paper, I show that neural networks can be trained to generalize compositionally through meta seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-4",
"text": "In this approach, models train on a series of seq2seq problems to acquire the compositional skills needed to solve new seq2seq problems."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-5",
"text": "Meta se2seq learning solves several of the SCAN tests for compositional learning and can learn to apply rules to variables."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-8",
"text": "People can learn new words and use them immediately in a rich variety of ways, thanks to their skills in compositional learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-9",
"text": "Once a person learns the meaning of the verb \"to Facebook\", she or he can understand how to \"Facebook slowly,\" \"Facebook eagerly,\" or \"Facebook while walking.\" These abilities are due to systematic compositionality, or the algebraic capacity to understand and produce novel utterances by combining familiar primitives [5, 26] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-10",
"text": "The \"Facebook slowly\" example depends on knowledge of English, but people generalize compositionally in other domains too, such as learning novel commands and meanings in artificial languages [17] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-11",
"text": "A key challenge for cognitive science and artificial intelligence is to understand the computational underpinnings of human compositional learning and to build machines with similar capabilities."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-12",
"text": "Neural networks have long been criticized for lacking compositionality, leading critics to argue they are inappropriate for modeling language and thought [8, 23, 24] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-13",
"text": "Nonetheless neural architectures have continued to advance and make important contributions in natural language processing (NLP) [19] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-14",
"text": "Recent work has revisited these classic critiques through studies of modern neural architectures [10, 15, 3, 20, 22, 2, 6] , with a focus on the sequence-to-sequence (seq2seq) models used successfully in machine translation and other NLP tasks [32, 4, 36] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-15",
"text": "These studies show that powerful seq2seq approaches still have substantial difficulties with compositional generalization, especially when combining a new concept (\"to Facebook\") with previous concepts (\"slowly\" or \"eagerly\") [15, 3, 20] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-16",
"text": "New benchmarks have been proposed to encourage progress [10, 15, 2] , including the SCAN dataset for compositional learning [15] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-17",
"text": "SCAN involves learning to follow instructions such as \"walk twice and look right\" by performing a sequence of appropriate output actions; in this case, the correct response is to \"WALK WALK RTURN LOOK."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-18",
"text": "\" A range of SCAN examples are shown in Table 1 ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-19",
"text": "Seq2seq models are trained on thousands of instructions built compositionally from primitives (\"look\", \"walk\", \"run\", \"jump\", etc.), modifiers (\"twice\", \"around right,\" etc.) and conjunctions (\"and\" and \"after\")."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-20",
"text": "After training, the aim is to execute, zero-shot, novel instructions such as \"walk around right after look twice.\" Previous studies show that seq2seq recurrent neural networks (RNN) generalize well when the training and test sets are similar, but fail catastrophically when generalization requires systematic compositionality [15, 3, 20] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-21",
"text": "For instance, models often fail to understand how to \"jump twice\" after learning how to \"run twice,\" \"walk twice,\" and how to \"jump.\" Developing neural architectures with these compositional abilities remains an open problem."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-22",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-23",
"text": "**MODEL**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-24",
"text": "The meta sequence-to-sequence approach learns how to learn sequence-to-sequence (seq2seq) problems -it uses a series of training seq2seq problems to develop the needed compositional skills for solving new seq2seq problems."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-25",
"text": "An overview of the meta seq2seq learner is illustrated in Figure 1 ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-26",
"text": "In this figure, the network is processing a query instruction \"jump twice\" in the context of a support set that shows how to \"run twice,\" \"walk twice\", \"look twice,\" and \"jump.\" In broad strokes, the architecture is a standard seq2seq model [21] translating a query input into a query output (Figure 1) ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-27",
"text": "A recurrent neural network (RNN) encoder (f ie ; red RNN in bottom right of Figure 1 ) and a RNN decoder (f od ; green RNN in top right of Figure 1 ) work together to interpret the query sequence as an output sequence, with the encoder passing an embedding at each timestep (Q) to a Luong attention decoder [21] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-28",
"text": "The architecture differs The meta sequence-to-sequence learner."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-29",
"text": "The backbone is a sequence-to-sequence (seq2seq) network augmented with a context C produced by an external memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-30",
"text": "The seq2seq model uses an RNN encoder (fie; bottom right) to read a query and then pass stepwise messages Q to an attention-based RNN decoder (f od ; top right)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-31",
"text": "Distinctive to meta seq2seq learning, the messages Q are transformed into C based on context from the support set (left)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-32",
"text": "The transformation operates through a key-value memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-33",
"text": "Support item inputs are encoded and used a keys K while outputs are encoded and used as value V ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-34",
"text": "The query is stepwise compared to the keys, retrieving weighted sums M of the most similar values."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-35",
"text": "This is mapped to C which is decoded as the final output sequence."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-36",
"text": "Color coding indicates shared RNN modules."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-37",
"text": "from standard seq2seq modeling through its use of the support set, external memory, and training procedure."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-38",
"text": "As the messages pass from the query encoder to the query decoder, they are infused with stepwise context C provided by an external memory that stores the support items."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-39",
"text": "The inner-working of the architecture are described in detail below."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-40",
"text": "Input encoder."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-41",
"text": "The input encoder f ie (\u00b7, \u00b7) is shown as red in Figure 1 and is used for the query input instruction (e.g., \"jump twice') and each of the of n s support items and their input instructions (\"run twice\", \"walk twice\", \"jump\", etc.)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-42",
"text": "The encoder first embeds the sequence of symbols (e.g., words) using an embedding layer to get a sequence of input embeddings w t \u2208 R m ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-43",
"text": "The RNN processes each w t to produce the RNN embedding h t \u2208 R m such that"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-44",
"text": "For the query sequence, the embedding h t at each step t = 1, . . . , T passes through both the external memory and the output decoder."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-45",
"text": "For each support sequence, only the last step embedding is needed, and thus each support instruction is expressed as a single vector K i \u2208 R m for i = 1, . . . , n s ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-46",
"text": "These RNN embeddings K i become the keys in the external key-value memory ( Figure 1 )."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-47",
"text": "All of the experiments in this paper use bidirectional long short-term memory (biLSTM) encoders [13] although other choices are possible."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-48",
"text": "Output encoder."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-49",
"text": "The output encoder f oe (\u00b7, \u00b7) is shown in blue in Figure 1 and used for each of the of n s support items and their output sequences (e.g., \"RUN RUN\", \"WALK WALK\", \"JUMP JUMP\", etc.)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-50",
"text": "First, the encoder embeds the sequence of output symbols (e.g., actions) using an embedding layer."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-51",
"text": "Second, a single embedding for the entire sequence is computed using the same process as f ie (\u00b7, \u00b7) (Equation 1)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-52",
"text": "Only the final RNN state is captured for each support item i and stored as the value vector V i \u2208 R m for i = 1, . . . , n s in the key-value memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-53",
"text": "A biLSTM encoder is also used."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-54",
"text": "External memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-55",
"text": "The architecture uses a soft key-value memory that operates similarly to memory networks [31] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-56",
"text": "The precise formulation used is described in [33] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-57",
"text": "The key-value memory uses the attention function"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-58",
"text": "with matrices Q, K, and V for the queries, keys, and values respectively, and the matrix A as the attention weights, A = softmax(QK T / \u221a m)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-59",
"text": "Each query instruction spawns T embeddings from the RNN encoder, one for each query symbol, which populate the rows of the query matrix Q \u2208 R T,m ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-60",
"text": "The encoded support items form the rows of K \u2208 R ns,m and the rows of V \u2208 R ns,m for their input and output sequences, respectively."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-61",
"text": "Attention weights A \u2208 R T,ns"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-62",
"text": "indicates which memory cells are active for each query step."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-63",
"text": "The output of the memory is a matrix M = AV where each row is a weighted combination of the value vectors, indicating the memory output for each of the T query input steps, M \u2208 R T,m ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-86",
"text": "My implementation processes each episode as a batch and takes just one gradient step per episode."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-64",
"text": "Finally, a stepwise context is computed by combining the query input embeddings h t and the stepwise memory outputs M t \u2208 R m with a concatenation layer C t = tanh(W c1 [h t ; M t ]) producing a stepwise context matrix C \u2208 R T,m ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-65",
"text": "For additional representational power, the key-value memory could replace the simple attention module with a multi-head attention module, or even a transformer-style multi-layer multi-head attention module [33] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-66",
"text": "This additional power was not needed for the tasks tackled in this paper, but it is compatible with the meta seq2seq approach."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-67",
"text": "Output decoder."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-68",
"text": "The output decoder is shown in green in Figure 1 and translates the stepwise context C into a sequence of output symbols."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-69",
"text": "The decoder embeds the previous output symbol to get a vector o j\u22121 \u2208 R m which is fed to the RNN (LSTM) along with the previous hidden state g j\u22121 \u2208 R m to get the next hidden state"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-70",
"text": "The initial hidden state g 0 is seeded with the context from the last step C T \u2208 R m ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-71",
"text": "Luong-style attention [21] is used to compute a decoder context u j \u2208 R m such that u j = Attention(g j , C, C)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-72",
"text": "This context is passed through another concatenation layer g j = tanh(W c2 [g j ; u j ]) which is then mapped to a softmax output layer to produce an output symbol."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-73",
"text": "This process repeats until all of the output symbols are produced and the RNN terminates the response by producing an end-of-sequence symbol."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-74",
"text": "Meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-75",
"text": "Meta-training optimizes the network across a set of related training episodes."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-76",
"text": "Each episode is a novel seq2seq problem that consists of a set of n s support items pairs and a set of n q query items (see Figure 2 for an example)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-77",
"text": "Each seq2seq item has a sequence of input symbols and a sequence of output symbols."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-78",
"text": "The support items are embedded and read into the key-value memory as described above."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-79",
"text": "The vocabulary of the entire model is the union of the vocabulary from each of the training episodes."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-80",
"text": "The loss function is the negative log-likelihood of the predicted output sequences for the queries."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-81",
"text": "If reasonable initial training progress can be made without the external memory, it is important to encourage the network to use its memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-82",
"text": "One method passes the support items through the network as additional \"query items\" when computing the training loss, such that the overall loss is based on both the query and support items, i.e. using an auxiliary \"support loss."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-83",
"text": "\" The support output sequences have been observed, embedded, and stored in the key-value memory, thus it is not noteworthy that the network can learn to reconstruct these output sequences."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-84",
"text": "Nevertheless, the support loss can help train the memory if the queries alone would lead the network to under-utilize the memory or get stuck in a local optimum."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-85",
"text": "Other than the support loss, the network is never trained directly on the support items; it is only trained to make generalizations to query items."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-87",
"text": "For improved sample efficiency and training efficiency, the optimizer can take multiple gradient steps per episode or repeatedly cycle through the training episodes, although this was not explored here."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-88",
"text": "The meta seq2seq learner is implemented in PyTorch."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-90",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-91",
"text": "4.1 Architecture and training parameters I use a common architecture and training procedure for all experiments in this paper."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-92",
"text": "The meta seq2seq architecture builds upon the seq2seq architecture from [15] that performed best across a range of SCAN evaluations."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-93",
"text": "The input and output sequence encoders are two-layer biLSTMs with m = 200 hidden units per layer and produce m dimensional embeddings."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-94",
"text": "The output decoder is a two-layer LSTM also with m = 200 hidden units per layer."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-95",
"text": "Dropout is applied with probability 0.5 to each symbol embedding and to each LSTM."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-96",
"text": "A greedy decoder is used since it is effective on SCAN's deterministic outputs [15] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-97",
"text": "In each experiment, the network is meta-trained for 10,000 episodes with the ADAM optimizer [14] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-98",
"text": "Halfway through training, the learning rate is reduced from 0.001 to 0.0001."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-99",
"text": "Gradients with a l 2 -norm larger than 50 are clipped."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-100",
"text": "On the SCAN tasks, my meta seq2seq implementation takes less than an hour to train on a single NVIDIA Titan X GPU."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-101",
"text": "For comparison, my seq2seq implementation takes less than 30 minutes."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-102",
"text": "All models were trained five times with different random initializations and random meta-training episodes."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-104",
"text": "**EXPERIMENT: MUTUAL EXCLUSIVITY**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-105",
"text": "This experiment examines the compositional skills of meta seq2seq learning through a synthetic task."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-106",
"text": "As shown in Figure 2 , each episode introduces a new mapping from non-sense words (\"dax\", \"wif\", etc.) to non-sense meanings (\"red circle\", \"green circle\", etc.), as demonstrated by the support set."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-107",
"text": "To answer the queries, a model must demonstrate two abilities inspired by human generalization patterns [17] : 1) it must learn to use isolated symbol mappings to translate concatenated symbol sequences, and 2) it must learn to reason by mutual exclusivity (ME) to resolve unseen mappings."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-108",
"text": "Children use ME to help learn the meaning of new words [25] , making ME an important area of study in cognitive development."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-109",
"text": "Using ME, children assume that if an object already has one label, then it does not need another."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-110",
"text": "When provided with a familiar object (e.g., a cup) and an unfamiliar object (e.g., a cherry pitter) and asked to \"Show me the dax,\" children tend to pick the unfamiliar object rather than the familiar one."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-111",
"text": "Adults also use ME to help resolve ambiguity."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-112",
"text": "When presented with episodes like Figure 2 in a laboratory setting, participants use ME to resolve unseen mappings and translate queries of concatenated sequences in a symbol-by-symbol manner."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-113",
"text": "Most people generalize in this way spontaneously, without any instructions or feedback about how to respond to compositional queries [17] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-114",
"text": "An untrained meta seq2seq learner would not be expected to generalize spontaneously -human participants come to the task with a starting point that is richer in every way -but computational models should nonetheless be capable of these inferences if trained to make them."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-115",
"text": "This is a challenge for neural networks because the mappings change every episode, and standard architectures do not reason using ME -in fact, they tend to map novel inputs to the most familiar outputs [9] , which is the opposite of reasoning by ME."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-116",
"text": "Experimental setup."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-117",
"text": "The domain consists of four possible pseudowords (input symbols) and four possible meanings (output symbols)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-118",
"text": "During meta-training, each episode is generated by sampling a random mapping from input symbols to output symbols (19 possibilities used for training)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-119",
"text": "Three mappings are presented as support items and one is withheld."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-120",
"text": "The queries consist of arbitrary concatenations of the pseudowords ranging in length from 2 to 6, which can be translated symbol-by-symbol to produce the proper output responses (20 queries per episode)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-121",
"text": "The fourth input symbol -not shown in the support set -is also be queried."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-122",
"text": "The model must learn how to use ME to map this unseen symbol to an unseen meaning rather than a seen meaning (Figure 2 )."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-123",
"text": "During testing, the model is evaluated on five word-to-meaning mappings that were not seen during meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-124",
"text": "Results."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-125",
"text": "Meta seq2seq successfully learns to concatenate and reason about novel mappings using ME, achieving 100% accuracy on the task (SD = 0%)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-126",
"text": "Based on the isolated mappings stored in memory, the network learns to translate sequences of those items."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-127",
"text": "Moreover, it can acquire and use new mappings at test time, utilizing only its external memory and the activation dynamics."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-128",
"text": "By learning to use ME, the network shows it can reason about the absence of symbols in the memory rather than simply their presence."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-129",
"text": "The attention weights and use of memory is visualized and presented in the appendix (Figure A.1 )."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-131",
"text": "**EXPERIMENT: ADDING A NEW PRIMITIVE THROUGH PERMUTATION META-TRAINING**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-132",
"text": "This experiment applies meta seq2seq learning to the SCAN task of adding a new primitive [15] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-133",
"text": "Models are trained to generalize compositionally by decomposing the original SCAN seq2seq task into a series of related seq2seq sub-tasks."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-134",
"text": "The goal is to learn a new primitive instruction and use it compositionally, operationalized in SCAN as the \"add jump\" split [15] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-135",
"text": "Models learn a new primitive \"jump\" and aim to use it compositionally in other instructions, resembling the \"to Facebook\" example introduced earlier in this paper."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-136",
"text": "First, the original seq2seq problem from [15] is described."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-137",
"text": "Second, the adapted problem for training meta seq2seq learners is described."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-138",
"text": "Seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-139",
"text": "Standard seq2seq models applied to SCAN have both a training and a test phase."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-140",
"text": "During training, seq2seq models are exposed to the \"jump\" instruction in a single context demonstrating how to jump in isolation."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-141",
"text": "Also during training, the models are exposed to all primitive and composed instructions for the other actions (e.g., \"walk\", \"walk twice\", \"look around right and walk twice\", etc.) along with the correct output sequences, which is about 13,000 unique instructions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-142",
"text": "Following [15] , the critical \"jump\" demonstration is overrepresented in training to ensure it is learned."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-143",
"text": "During test, models are evaluated on all of the composed instructions that use the \"jump\" primitive, examining the ability to integrate new primitives and use them productively."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-144",
"text": "For instance, models are evaluated on instructions such as \"jump twice\", \"jump around right and walk twice\", \"walk left thrice and jump right thrice,\" along with about 7,000 other instructions using jump."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-145",
"text": "Meta seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-146",
"text": "Meta seq2seq models applied to SCAN have both a meta-training and a test phase."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-147",
"text": "During meta-training, the models observe episodes that are variants of the original seq2seq problem, each of which requires rapid learning of new meanings for the primitives."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-148",
"text": "Specifically, each meta-training episode provides a different random assignment of the primitive instructions ('jump','run', 'walk', 'look') to their meanings ('JUMP','RUN','WALK','LOOK'), with the restriction that the proper (original) permutation not be observed during meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-149",
"text": "Withholding the original permutation, there are 23 possible permutations for meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-150",
"text": "Each episode presents 20 support and 20 query instructions, with instructions sampled from the full SCAN set."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-151",
"text": "The models predict the response to the query instructions, using the support instructions and their outputs as context."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-152",
"text": "Through meta-training, the models are familiarized with all of the possible SCAN training and test instructions, but no episode maps all of its instructions to their original (target) outputs sequences."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-153",
"text": "In fact, models have no signal to learn which primitives in general correspond to which actions, since the assignments are sampled anew for each episode."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-154",
"text": "During test, models are evaluated on rapid learning of new meanings."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-155",
"text": "Just four support items are observed and loaded into memory, consisting of the isolated primitives ('jump','run', 'walk', 'look') paired with their original meanings ('JUMP','RUN','WALK','LOOK')."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-156",
"text": "Notably, memory use at test time (with only four primitive items loaded in memory) diverges substantially from memory use during meta-training (with 20 complex instructions loaded in memory)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-157",
"text": "To evaluate test accuracy, models make predictions on each of the original SCAN test instructions [28] 78.4% --consisting of all composed instructions using \"jump.\" An output sequence is considered correct only if it perfectly matches the whole target sequence."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-158",
"text": "Alternative models."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-159",
"text": "The meta seq2seq learner is compared with an analogous \"standard seq2seq\" learner [21] , which uses the same architecture with the external memory removed."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-160",
"text": "The standard seq2seq learner is trained on the original SCAN problem with a fixed meaning for each primitive."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-161",
"text": "Each meta seq2seq \"episode\" can be interpreted as a standard seq2seq \"batch,\" and a batch size of 40 is chosen to equate the total number of presentations between approaches."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-162",
"text": "All other architectural and training parameters are shared between meta seq2seq learning and seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-163",
"text": "The meta seq2seq learner is also compared with two additional lesioned variants that examine the importance of different architectural components."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-164",
"text": "First, the meta seq2seq learner is trained \"without support loss\" (Section 3 meta-training), which guides the architecture about how to best use its memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-165",
"text": "Second, the meta seq2seq learner is trained \"without decoder attention\" (Section 3 output decoder)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-166",
"text": "This leads to substantial differences in the architecture operation; rather than producing a sequence of context embeddings C 1 , . . . , C T for each step of the T steps of a query sequence, only the last step context C T is computed and passed as a message to the decoder."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-167",
"text": "Results."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-168",
"text": "The results are summarized in Table 2 ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-169",
"text": "On the \"add jump\" test set [15] , standard seq2seq modeling completely fails to generalize compositionally, reaching an average performance of only 0.03% correct (SD = 0.02)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-170",
"text": "It fails even while achieving near perfect performance on the training set (>99% on average)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-171",
"text": "This replicates the results from [15] which trained many seq2seq models, finding the best network performed at only 1.2% accuracy."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-172",
"text": "Again, standard seq2seq models do not show the necessary systematic compositionality."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-173",
"text": "The meta seq2seq learner succeeds at learning compositional skills, achieving an average performance of 99.95% correct (SD = 0.08)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-174",
"text": "At test, the support set contains only the four primitives and their mappings, demonstrating that meta seq2seq learning can handle test episodes that are qualitatively different from those seen during training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-175",
"text": "Moreover, the network learns how to store and retrieve variables from memory with arbitrary assignments, as long as the network is familiarized with the possible input and output symbols during meta-training (but not necessarily how they correspond)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-176",
"text": "A visualization of how meta seq2seq uses attention on SCAN is shown in the appendix (Figure A.2) ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-221",
"text": "Seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-177",
"text": "The meta seq2seq learner also outperforms syntactic attention which achieves 78.4% and varies widely in performance across runs (SD = 27.4) [28] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-178",
"text": "The lesion analyses demonstrate the importance of various components."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-179",
"text": "The meta seq2seq learner fails to solve the task without the guidance of the support loss, achieving only 5.43% correct (SD = 7.6)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-180",
"text": "These runs typically learn the consistent, static meanings such as \"twice\", \"thrice\", \"around right\" and \"after\", but fail to use its memory properly to learn the dynamic primitives."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-181",
"text": "The meta seq2seq learner also fails when the decoder attention is removed (10.32% correct; SD = 6.4), suggesting that a single m dimensional embedding is not sufficient to relate a query to the support items."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-183",
"text": "**EXPERIMENT: ADDING A NEW PRIMITIVE THROUGH AUGMENTATION META-TRAINING**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-184",
"text": "Experiment 4.3 demonstrates that the meta seq2seq approach can learn how to learn the meaning of a primitive and use it compositionally."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-185",
"text": "However, only a small set of four input primitives and four meanings was considered; it is unclear whether meta seq2seq learning works in more complex compositional domains."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-186",
"text": "In this experiment, meta seq2seq is evaluated on a much larger domain produced by augmenting the meta-training with 20 additional input and action primitives."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-187",
"text": "This more challenging task requires that the networks handle a much larger set of possible meanings."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-188",
"text": "The architecture and training procedures are identical to those used in Experiment 4.3 except where noted."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-189",
"text": "Seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-190",
"text": "To equate learning environment across approaches, standard seq2seq models use a training phase that is substantially expanded from that in Experiment 4.3."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-191",
"text": "During training, the input primitives include the original four ('jump','run', 'walk', 'look') as well as 20 new symbols ('Primitive1,' . . . , 'Primitive20')."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-192",
"text": "The output meanings include the original four ('JUMP','RUN','WALK','LOOK') as well as 20 new actions ('Action1,' . . . , 'Action20')."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-193",
"text": "In the seq2seq training (but notably, not in meta seq2seq training), 'Primitive1' always corresponds to 'Action1,' 'Primitive2' corresponds to 'Action2,' and so on."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-194",
"text": "A training batch uses the original SCAN templates with primitives sampled from the augmented set rather than the original set; for instance, a training instruction may be \"look around right and Primitive20 twice.\" During training the \"jump\" primitive is only presented in isolation, and it is included in every batch to ensure the network learns it properly."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-195",
"text": "Compared to Experiment 4.3, the augmented SCAN domain provides substantially more evidence for compositionality and productivity."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-196",
"text": "Meta seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-197",
"text": "Meta seq2seq models are trained similarly to Experiment 4.3 with an augmented primitive set."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-198",
"text": "During meta-training, episodes are generated by randomly sampling a set of four primitive instructions (from the set of 24) and their corresponding meanings (from the set of 24)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-199",
"text": "For instance, an example training episode could use the four instruction primitives 'Primitive16', 'run', 'Primitive2', and 'Primitive12' mapped respectively to actions 'Action3', 'Action20', 'JUMP', and 'Action11'."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-200",
"text": "Although Experiment 4.3 has only 23 possible assignments, this experiment has orders-of-magnitude more possible assignments than training episodes, ensuring meta-training only provides a very small subset."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-201",
"text": "Moreover, the models are evaluated using a stricter criterion for generalization: the primitive \"jump\" is never assigned to the proper action \"JUMP\" during meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-202",
"text": "The test phase is analogous to the previous experiment."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-203",
"text": "Models are evaluated by loading all of the isolated primitives ('jump','run', 'walk', 'look') paired with their original meanings ('JUMP','RUN','WALK','LOOK') into memory as support items."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-204",
"text": "No other items are included in memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-205",
"text": "To evaluate test accuracy, models make predictions on the original SCAN test instructions consisting of all composed instructions using \"jump.\""
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-206",
"text": "Results."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-207",
"text": "The results are summarized in Table 2 ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-208",
"text": "The meta seq2seq learner succeeds as picking up the meaning of \"jump\" and using it correctly, achieving 98.71% correct (SD = 1.49) on the test instructions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-209",
"text": "The slight decline in performance compared to Experiment 4.3 is not statistically significant with five runs."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-210",
"text": "The standard seq2seq learner takes advantage of the augmented training to generalize better than in standard SCAN training (Experiment 4.3 and [15] ), achieving 12.26% accuracy (SD = 8.33) on the test instructions (with >99% accuracy during training)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-211",
"text": "The augmented task provides 23 fully compositional primitives during training, compared to the three in the original task."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-212",
"text": "Despite this salient compositionality, the basic seq2seq model is still largely unable to make use of it."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-213",
"text": "The lesion analyses show that the support loss is not critical in this setting, and the meta seq2seq learner achieves 99.48% correct without it (SD = 0.37)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-214",
"text": "In contrast to Experiment 4.3, using many primitives more strongly guides the network to use the memory, since the network cannot substantially reduce the training loss without it."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-215",
"text": "The decoder attention remains critical in this setting, and the network attains merely 9.29% correct without it (SD = 13.07)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-216",
"text": "These results demonstrate that only the full meta seq2seq learner is a satisfactory solution to both the learning problems in this experiment and the previous experiment (Table 2 )."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-218",
"text": "**EXPERIMENT: COMBINING FAMILIAR CONCEPTS THROUGH META-TRAINING**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-219",
"text": "The previous experiments show that the meta seq2seq approach can learn how to learn a new primitive."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-220",
"text": "The next experiment examines whether the approach can learn how to combine familiar concepts in new ways, based on the SCAN primitive \"around right\" split [20] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-222",
"text": "Seq2seq training holds out all instances of \"around right\" while training on all of the other SCAN instructions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-223",
"text": "Using the symmetry between \"left\" and \"right,\" the network must extrapolate to \"jump around right\" from training examples like \"jump around left,\" \"jump left,\" and \"jump right.\" During test, the models are evaluated on all uses of \"around right.\""
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-224",
"text": "Meta seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-225",
"text": "Meta-training proceeds similarly to Experiment 4.4 with the goal of learning to infer the meaning of \"around right\" from \"around\" and \"right\" through augmentation."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-226",
"text": "Instead of just \"left\" and \"right\", the possibilities also include \"Direction1\" and \"Direction2\" (or since the labels are arbitrary, \"forward\" and \"backward\")."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-227",
"text": "Meta-training episodes are generated by randomly sampling two directions to be used in the instructions (from \"left\", \"right\", \"forward\", \"backward\") and their meanings (from \"LTURN,\" \"RTURN,\" \"FORWARD\",\"BACKWARD\"), permuted to have no systematic correspondence."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-228",
"text": "The primitive \"right\" is never assigned to the proper meaning during meta-training, and thus \"around right\" is never mapped to its correct meaning either."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-229",
"text": "As in the previous SCAN experiments, there are 20 support instructions and 20 query instructions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-230",
"text": "During test, models must infer the proper meaning of \"around right\" and use it compositionally to interpret all of it uses in the original SCAN instructions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-231",
"text": "The test support set is simply just \"turn left\" and \"turn right\" mapped to their proper meanings."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-232",
"text": "Results."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-233",
"text": "Meta seq2seq learning is nearly perfect at inferring the meaning of \"around right\" from its components (99.96% correct; SD = 0.08; Table 3 ), while standard seq2seq fails catastrophically (0.0% correct)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-234",
"text": "Syntactic attention also struggles (28.9%; SD = 34.8) [28] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-235",
"text": "In additional informal experiments, I experienced difficulty training the meta seq2seq learner with 20 additional directions instead of just two in the augmentation set."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-236",
"text": "In this setting, it has trouble learning to use variables and achieves only 36.33% correct (SD = 7.30)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-237",
"text": "Compared to the \"add jump\" split which was successful with 20 additional primitives, the \"around right\" split provides fewer meaning combinations per episode (two rather than four) and learning the directions is more nuanced than learning the actions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-238",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-239",
"text": "**EXPERIMENT: GENERALIZING TO LONGER INSTRUCTIONS THROUGH META-TRAINING**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-240",
"text": "The final experiment examines whether the meta seq2seq approach can learn to generalize to longer sequences, even when the test sequences are longer than any experienced during meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-241",
"text": "This experiment uses the SCAN \"length\" split [15] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-242",
"text": "Seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-243",
"text": "The SCAN instructions are divided into training and test sets based on the number of required output actions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-244",
"text": "Standard seq2seq models are trained on all instructions that require 22 or fewer actions (about 17,000 instructions) and evaluated on all instructions that require longer action sequences (about 4,000 instructions ranging in length from [24] [25] [26] [27] [28] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-245",
"text": "During test, the network must execute instructions such as \"jump around right twice and look opposite right thrice\" that require 25 actions."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-246",
"text": "Both sub-instructions \"jump around right twice\" and \"look opposite right thrice\" are presented during training, but the model was never before asked to produce the conjunction or any output sequence of that length."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-247",
"text": "Meta seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-248",
"text": "Meta-training optimizes the network to extrapolate from shorter items in the support set to longer items in the query set."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-249",
"text": "During test, the model is examined on even longer queries than seeing during meta-training (drawn from the SCAN \"length\" test set)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-250",
"text": "To produce this training specification, the original \"length\" training set is sub-divided into the support pool (all instructions with less than 12 output actions) and a query pool (all instructions with 12 to 22 output actions)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-251",
"text": "During meta-training, the network learns to respond to 20 longer instructions in the query pool given the shorter instructions in the support pool."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-252",
"text": "To encourage the network to use its external memory (rather than learned weights) when answering queries, each episode applies primitive augmentation as in Experiment 4.4."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-253",
"text": "To further amplify the memory, each episode also provides 100 support items."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-254",
"text": "During test, the models load 100 support items from the original \"length\" split training set (lengths 1 to 22 output actions) and responds to queries from the original test set (lengths 24-28)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-255",
"text": "Results."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-256",
"text": "None of the models perform well on longer sequences (Table 3 )."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-257",
"text": "The meta seq2seq learner achieves 16.64% accuracy (SD = 2.10) while the baseline seq2seq learner achieves 7.71% (SD = 1.90)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-258",
"text": "Syntactic attention [28] performs similarly to meta seq2seq at 15.2% (SD = 0.7)."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-259",
"text": "Although the meta seq2seq learner has compositional capabilities, it lacks the truly systematic compositionality needed to properly produce longer output sequences."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-260",
"text": "----------------------------------"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-261",
"text": "**DISCUSSION**"
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-262",
"text": "People are skilled compositional learners while standard neural networks are not."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-263",
"text": "After learning how to \"dax,\" people understand how to \"dax twice,\" \"dax slowly,\" or even \"dax like there is no tomorrow.\" These abilities are central to language and thought yet they are conspicuously lacking in modern neural networks [15, 3, 20, 22, 2] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-264",
"text": "In this paper, I introduced a meta sequence-to-sequence (meta seq2seq) approach for learning to generalize compositionally, exploiting the algebraic structure of a domain to help understand novel utterances."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-265",
"text": "In contrast to standard seq2seq, meta seq2seq learners can abstract away the surface patterns and operate closer to rule space."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-266",
"text": "Rather than attempting to solve \"jump around right twice and walk thrice\" by comparing surface level patterns with training items, meta seq2seq learns to treat the instruction as a template \"x around right twice and y thrice\", where x and y are variables that can be filled arbitrarily."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-267",
"text": "This approach is able to solve SCAN tasks for compositional learning that have eluded standard NLP approaches, with the exception of generalizing to longer sequences [15] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-268",
"text": "In this way, meta seq2seq learners are several steps closer to capturing the compositional abilities studied in synthetic learning tasks [17] and motivated in the \"to dax\" or \"to Facebook\" thought experiments."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-269",
"text": "Meta seq2seq learning has implications for understanding how people generalize compositionally."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-270",
"text": "Similarly to meta-training, people learn in dynamic rather than static environments, tackling a series of changing learning problems rather than iterating repeatedly through a static dataset."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-271",
"text": "There is natural pressure to generalize systematically after a single experience with a new verb like \"to Facebook,\" and thus people are incentivized to generalize compositionally in ways that may resemble the meta-training loss introduced here."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-272",
"text": "Meta learning is a powerful new toolbox for studying learning-to-learn and other elusive cognitive abilities [16, 35] , although more work is needed to understand its implications for cognitive science."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-273",
"text": "The models studied here can learn variables that assign novel meanings to words at test time, using only the network dynamics and the external memory."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-274",
"text": "Although powerful, this is a limited concept of \"variable\" since it requires familiarity with all of the possible input and output assignments during meta-training."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-275",
"text": "This limitation is shared by nearly all existing neural architectures [31, 11, 29] and shows that the meta seq2seq framework falls short of addressing Marcus's challenge of extrapolating outside the training space [23, 24, 22] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-276",
"text": "In future work, I intend to explore adding more symbolic machinery to the architecture [27] with the goal of handling genuinely new symbols."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-277",
"text": "Hybrid models could also address the challenge of generalizing to longer output sequences, a problem that continues to vex neural networks [15, 3, 28] including meta seq2seq learning."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-278",
"text": "The meta seq2seq approach could be applied to a wide range of tasks including low resource machine translation [12] or to graph traversal problems [11] ."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-279",
"text": "For traditional seq2seq tasks like machine translation, standard seq2seq training could be augmented with hybrid training that alternates between standard training and meta-training to encourage compositional generalization."
},
{
"sent_id": "4a28a289ffc730fea4114f6c71bd06-C001-280",
"text": "I am excited about the potential of the meta seq2seq approach both for solving practical problems and for illuminating the foundations of human compositional learning."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-14"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-20"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-169"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-263"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-277"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-14",
"4a28a289ffc730fea4114f6c71bd06-C001-20",
"4a28a289ffc730fea4114f6c71bd06-C001-169",
"4a28a289ffc730fea4114f6c71bd06-C001-263",
"4a28a289ffc730fea4114f6c71bd06-C001-277"
]
},
"@EXT@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-92"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-92"
]
},
"@USE@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-96"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-132"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-134",
"4a28a289ffc730fea4114f6c71bd06-C001-135",
"4a28a289ffc730fea4114f6c71bd06-C001-136"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-142"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-241"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-96",
"4a28a289ffc730fea4114f6c71bd06-C001-132",
"4a28a289ffc730fea4114f6c71bd06-C001-134",
"4a28a289ffc730fea4114f6c71bd06-C001-136",
"4a28a289ffc730fea4114f6c71bd06-C001-142",
"4a28a289ffc730fea4114f6c71bd06-C001-241"
]
},
"@SIM@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-170",
"4a28a289ffc730fea4114f6c71bd06-C001-171"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-171"
]
},
"@DIF@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-210"
],
[
"4a28a289ffc730fea4114f6c71bd06-C001-263",
"4a28a289ffc730fea4114f6c71bd06-C001-264",
"4a28a289ffc730fea4114f6c71bd06-C001-265"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-210",
"4a28a289ffc730fea4114f6c71bd06-C001-263"
]
},
"@MOT@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-263",
"4a28a289ffc730fea4114f6c71bd06-C001-264"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-263"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-267"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-267"
]
},
"@FUT@": {
"gold_contexts": [
[
"4a28a289ffc730fea4114f6c71bd06-C001-276",
"4a28a289ffc730fea4114f6c71bd06-C001-277"
]
],
"cite_sentences": [
"4a28a289ffc730fea4114f6c71bd06-C001-277"
]
}
}
},
"ABC_e4452ce844b74c35f257c916aae120_3": {
"x": [
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-2",
"text": "In this paper, we study AMR-to-text generation, framing it as a translation task and comparing two different MT approaches (Phrasebased and Neural MT)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-3",
"text": "We systematically study the effects of 3 AMR preprocessing steps (Delexicalisation, Compression, and Linearisation) applied before the MT phase."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-4",
"text": "Our results show that preprocessing indeed helps, although the benefits differ for the two MT models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-5",
"text": "The implementation of the models are publicly available 1 ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-8",
"text": "Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-9",
"text": "While there is broad consensus among NLG scholars on the output of NLG systems (i.e., text), there is far less agreement on what the input should be; see Gatt and Krahmer (2017) for a recent review."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-10",
"text": "Over the years, NLG systems have taken a wide range of inputs, including for example images (Xu et al., 2015) , numeric data (Gkatzia et al., 2014) and semantic representations (Theune et al., 2001 )."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-11",
"text": "This study focuses on generating natural language based on Abstract Meaning Representations (AMRs) (Banarescu et al., 2013) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-12",
"text": "AMRs encode the meaning of a sentence as a rooted, directed and acyclic graph, where nodes represent concepts, and labeled directed edges represent relations among these concepts."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-13",
"text": "The formalism strongly relies on the PropBank notation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-14",
"text": "Figure 1 shows an example."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-15",
"text": "1 https://github.com/ThiagoCF05/LinearAMR AMRs have increased in popularity in recent years, partly because they are relatively easy to produce, to read and to process automatically."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-16",
"text": "In addition, they can be systematically translated into firstorder logic, allowing for a well-specified modeltheoretic interpretation (Bos, 2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-17",
"text": "Most earlier studies on AMRs have focused on text understanding, i.e. processing texts in order to produce AMRs (Flanigan et al., 2014; Artzi et al., 2015) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-18",
"text": "However, recently the reverse process, i.e. the generation of texts from AMRs, has started to receive scholarly attention (Flanigan et al., 2016; Song et al., 2016; Pourdamghani et al., 2016; Song et al., 2017; Konstas et al., 2017) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-19",
"text": "We assume that in practical applications, conceptualisation models or dialogue managers (models which decide \"what to say\") output AMRs."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-20",
"text": "In this paper we study different ways in which these AMRs can be converted into natural language (deciding \"how to say it\")."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-21",
"text": "We approach this as a translation problem-automatically translating from AMRs into natural language-and the keycontribution of this paper is that we systematically compare different preprocessing strategies for two different MT systems: Phrase-based MT (PBMT) and Neural MT (NMT)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-22",
"text": "We look at potential benefits of three preprocessing steps on AMRs before feeding them into an MT system: delexicalisation, compression, and linearisation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-23",
"text": "Delexicalisation decreases the sparsity of an AMR by removing constant values, compression removes nodes and edges which are less likely to be aligned to any word on the textual side and linearisation 'flattens' the AMR in a specific order."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-24",
"text": "Com- Figure 1 : Example of an AMR bining all possibilities gives rise to 2 3 = 8 AMR preprocessing strategies, which we evaluate for two different MT systems: PBMT and NMT."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-25",
"text": "Following earlier work in AMR-to-text generation and the MT literature, we evaluate the system outputs in terms of fluency, adequacy and post-editing effort, using BLEU (Papineni et al., 2002) , METEOR (Lavie and Agarwal, 2007) and TER (Snover et al., 2006) scores, respectively."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-26",
"text": "We show that preprocessing helps, although the extent of the benefits differs for the two MT systems."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-28",
"text": "**RELATED STUDIES**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-29",
"text": "To the best of our knowledge, Flanigan et al. (2016) was the first study that introduced a model for natural language generation from AMRs."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-30",
"text": "The model consists of two steps."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-31",
"text": "First, the AMR-graph is converted into a spanning tree, and then, in a second step, this tree is converted into a sentence using a tree transducer."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-32",
"text": "In Song et al. (2016) , the generation of a sentence from an AMR is addressed as an asymmetric generalised traveling salesman problem (AGTSP)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-33",
"text": "For sentences shorter than 30 words, the model does not beat the system described by Flanigan et al. (2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-34",
"text": "However, Song et al. (2017) treat the AMR-to-text task using a Synchronous Node Replacement Grammar (SNRG) and outperform Flanigan et al. (2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-35",
"text": "Although AMRs do not contain articles and do not represent inflectional morphology for tense and number (Banarescu et al., 2013) , the formalism is relatively close to the (English) language."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-36",
"text": "Motivated by this similarity, Pourdamghani et al. (2016) proposed an AMR-to-text method that organises some of these concepts and edges in a flat representation, commonly known as Linearisation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-37",
"text": "Once the linearisation is complete, Pourdamghani et al. (2016) map the flat AMR into an English sentence using a Phrase-Based Machine Translation (PBMT) system."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-38",
"text": "This method yields better results than Flanigan et al. (2016) on development and test set from the LDC2014T12 corpus."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-39",
"text": "Pourdamghani et al. (2016) train their system using a set of AMR-sentence pairs obtained by the aligner described in Pourdamghani et al. (2014) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-40",
"text": "In order to decrease the sparsity of the AMR formalism caused by the ratio of broad vocabulary and relatively small amount of data, this aligner drops a considerable amount of the AMR structure, such as role edges :ARG0, :ARG1, :mod, etc."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-41",
"text": "However, inspection of the gold-standard alignments provided in the LDC2016E25 corpus revealed that this rulebased compression can be harmful for the generation of sentences, since such role edges can actually be aligned to function words in English sentences."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-42",
"text": "So having these roles available arguably could improve AMR-to-text translation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-43",
"text": "This indicates that a better comparison of the effects of different preprocessing steps is called for, which we do in this study."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-44",
"text": "In addition, Pourdamghani et al. (2016) use PBMT, which is devised for translation but also utilised in other NLP tasks, e.g. text simplification (Wubben et al., 2012; \u0160tajner et al., 2015) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-45",
"text": "However, these systems have the disadvantage of having many different feature functions, and finding optimal settings for all of them increases the complexity of the problem from an engineering point of view."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-46",
"text": "An alternative MT model has been proposed: Neural Machine Translation (NMT)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-47",
"text": "NMT models frame translation as a sequence-to-sequence problem (Bahdanau et al., 2015) , and have shown strong results when translating between many different language pairs (Bojar et al., 2015) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-48",
"text": "Recently, Konstas et al. (2017) introduce sequence-to-sequence models for parsing (text-to-AMR) and generation (AMR-totext)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-49",
"text": "They use a semi-supervised training proce-dure, incorporating 20M English sentences which do not have a gold-standard AMR, thus overcoming the limited amount of data available."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-50",
"text": "They report stateof-the-art results for the task, which suggests that NMT is a promising alternative for AMR-to-text."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-52",
"text": "**MODELS**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-53",
"text": "We describe our AMR-to-text generation models, which rely on 3 preprocessing steps (delexicalisation, compression, and/or linearisation) followed by a machine translation and realisation steps."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-55",
"text": "**DELEXICALISATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-56",
"text": "Inspection of the LDC2016E25 corpus reveals that on average 22% of the structure of an AMR are AMR constant values, such as names, quantities, and dates."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-57",
"text": "This information increases the sparsity of the data, and makes it arguably more difficult to map an AMR into a textual format."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-58",
"text": "To address this, Pourdamghani et al. (2016) look for special realisation component for names, dates and numbers in development and test sets and add them on the training set."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-59",
"text": "On the other hand, similar to Konstas et al. (2017) , we delexicalised these constants, replacing the original information for tags (e.g., name1 , quant1 )."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-60",
"text": "A list of tag-values is kept, aiming to identifying the position and to insert the original information in the sentence after the translation step is completed."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-61",
"text": "Figure 2 shows a delexicalised AMR."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-63",
"text": "**COMPRESSION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-64",
"text": "Given the alignment between an AMR and a sentence, the nodes and edges in the AMR can either be aligned to words in the sentence or not."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-65",
"text": "So before the linearisation step, we would like to know which elements of an AMR should actually be part of the 'flattened' representation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-66",
"text": "Following the aligner of Pourdamghani et al. (2014) , Pourdamghani et al. (2016) clean an AMR by removing some nodes and edges independent of the context."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-67",
"text": "Instead, we are using alignments that may relate a given node or edge to an English word according to the context."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-68",
"text": "In Figure 1 for instance, the first edge :ARG1 is aligned to the preposition to from the sentence, whereas the second edge with a similar value is not aligned to any word in the sentence."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-69",
"text": "Therefore, we need to train a classifier to decide which parts of an AMR should be in the flattened representation according to the context."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-70",
"text": "To solve the problem, we train a Conditional Random Field (CRF) which determines whether a node or an edge of an AMR should be included in the flattened representation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-71",
"text": "The classification process is sequential over a flattened representation of an AMR obtained by depth first search through the graph."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-72",
"text": "Each element is represented by their name and parent name."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-73",
"text": "We use CRFSuite (Okazaki, 2007) to implement our model."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-75",
"text": "**LINEARISATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-76",
"text": "After Compression, we flatten the AMR to serve as input to the translation step, similarly as proposed in Pourdamghani et al. (2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-77",
"text": "We perform a depthfirst search through the AMR, printing the elements according to their visiting order."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-78",
"text": "In a second step, also following Pourdamghani et al. (2016), we implemented a version of the 2-Step Classifier from Lerner and Petrov (2013) to preorder the elements from an AMR according to the target side."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-80",
"text": "**2-STEP CLASSIFIER**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-81",
"text": "We implement the preordering method proposed by Lerner and Petrov (2013) in the following way."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-82",
"text": "We define the order among a head node and its subtrees in two steps."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-83",
"text": "In the first, we use a trained maximum entropy classifier to predict for each subtree whether it should occur before or after the head node."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-84",
"text": "As features, we represent the head node by its frameset, whereas the subtree is represented by its head node frameset and parent edge."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-85",
"text": "Once we divide the subtrees into the ones which should occur before and after the head node, we use a maximum entropy classifier for the size of the subtree group to predict their order."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-86",
"text": "For instance, for a group of 2 subtrees, a maximum entropy classifier specific for groups of 2 subtrees would be used to predict the permutation order of them (0-1 or 1-0)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-87",
"text": "As features, the head node is also represented by its PropBank frameset, whereas the subtrees of the groups are represented by their parent edges, their head node framesets and by which side of the head node they are (before or after)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-88",
"text": "We train classifiers for groups of sizes between 2 and 4 subtrees."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-89",
"text": "For bigger groups, we used the depth first search order."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-91",
"text": "**TRANSLATION MODELS**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-92",
"text": "To map a flat AMR representation into an English sentence, we use phrase-based (Koehn et al., 2003) and neural machine translation (Bahdanau et al., 2015) models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-94",
"text": "**PHRASE-BASED MACHINE TRANSLATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-95",
"text": "These models use Bayes rule to formalise the problem of translating a text from a source language f to a target language e. In our case, we want to translate a flat amr into an English sentence e as Equation 1 shows."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-96",
"text": "The a priori function P (e) usually is represented by a language model trained on the target language."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-97",
"text": "The a posteriori equation is calculated by the loglinear model described at Equation 2."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-98",
"text": "Each h j (amr, e) is an arbitrary feature function over AMR-sentence pairs."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-99",
"text": "To calculate it, the flat amr is segmented into I phrases\u0101mr I 1 , such that each phrase\u0101mr i is translated into a target phras\u0113 e i as described by Equation 3."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-100",
"text": "As feature functions, we used direct and inverse phrase translation probabilities and lexical weighting; word, unknown word and phrase penalties."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-101",
"text": "We also used models to reorder a flat amr according to the target sentence e at decoding time."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-102",
"text": "They work on the word-level (Koehn et al., 2003) , at the level of adjacent phrases (Koehn et al., 2005) and beyond adjacent phrases (hierarchical-level) (Galley and Manning, 2008 )."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-103",
"text": "Phrase-and hierarchical level models are also known as lexicalized reordering models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-104",
"text": "As Koehn et al. (2003) , given s i the start position of the source phrase\u0101mr i translated into the English phrase\u0113 i , and f i\u22121 the end position of the source phrase\u0101mr i\u22121 translated into the English phras\u0113 e i\u22121 , a distortion model \u03b1 |s i \u2212f i\u22121 \u22121| is defined as a distance-based reordering model."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-105",
"text": "\u03b1 is chosen by tunning the model."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-212",
"text": "Delexicalisation seems to improve results, corroborating the findings from Konstas et al. (2017) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-213",
"text": "While Delexicalisation is harmful and Compression is beneficial for PBMT, we see the opposite in NMT models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-106",
"text": "Lexicalised models are more complex than distance-based ones, but usually help the system to obtain better results (Koehn et al., 2005; Galley and Manning, 2008) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-107",
"text": "Given a possible set of target phrases e = (\u0113 1 , ... ,\u0113 n ) based on a source amr, and a set of alignments a = (a 1 , ... , a n ) that maps a source phrase\u0101mr a i into a target phrase\u0113 i , a lexicalised model aims to predict a set of orientations o = (o 1 , ... , o n ) as Equation 4 shows."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-108",
"text": "In the hierarchical model, we distinguished the discontinuous operation by the direction: discontinuous right (a i \u2212 a i\u22121 < 1) and discontinuous left (a i \u2212 a i\u22121 > 1)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-109",
"text": "These models are important for our task, since the preordering method used in the Linearisation step can be insufficient to adequate it to the target sentence order."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-111",
"text": "**NEURAL MACHINE TRANSLATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-112",
"text": "Following the attention-based Neural Machine Translation (NMT) model introduced by Bahdanau et al. (2015) , given a flat amr = (amr 1 , amr 2 , \u00b7 \u00b7 \u00b7 , amr N ) and its English sentence translation e = (e 1 , e 2 , \u00b7 \u00b7 \u00b7 , e M ), a single neural network is trained to translate amr into e by directly learning to model p(e | amr)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-113",
"text": "The network consists of one encoder, one decoder, and one attention mechanism."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-114",
"text": "The encoder is a bi-directional RNN with gated recurrent units (GRU) (Cho et al., 2014) , where one forward RNN \u2212 \u2192 \u03a6 enc reads the amr from left to right and generates a sequence of forward annotation vectors ("
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-115",
"text": ", and a backward RNN \u2190 \u2212 \u03a6 enc reads the amr from right to left and generates a sequence of backward annotation vectors"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-116",
"text": "The final annotation vector is the concatenation of forward and backward vec-"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-117",
"text": "The decoder is a neural LM conditioned on the previously emitted words and the source sentence via an attention mechanism over C. A multilayer perceptron is used to initialise the decoder's hidden state s 0 , where the input to this network is the concatenation of the last forward and backward vectors \u2212 \u2192 h N ; \u2190 \u2212 h 1 ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-118",
"text": "At each time step t of the decoder, we compute a time-dependent context vector c t based on the annotation vectors C, the decoder's previous hidden state s t\u22121 and the target English word\u1ebd t\u22121 emitted by the decoder in the previous time step."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-119",
"text": "A single-layer feed-forward network computes an expected alignment a t,i between each source annotation vector h i and the target word to be emitted at the current time step t, as in (6):"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-120",
"text": "In Equation (7), these expected alignments are normalised and converted into probabilities:"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-121",
"text": "where \u03b1 t,i are called the model's attention weights, which are in turn used in computing the timedependent context vector c t = N i=1 \u03b1 t,i h i ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-122",
"text": "Finally, the context vector c t is used in computing the decoder's hidden state s t for the current time step t, as shown in Equation (8):"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-123",
"text": "where s t\u22121 is the decoder's previous hidden state, W e [\u1ebd t\u22121 ] is the embedding of the word emitted in the previous time step, and c t is the updated timedependent context vector."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-147",
"text": "For the Linearisation step, we flatten the AMR structure based on a depth-first search (-Preorder) or preorder it with our 2-step classifier (+Preorder)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-124",
"text": "Given a hidden state s t , the probabilities for the next target word are computed using one projection layer followed by a softmax, as illustrated in eq. (9), where the matrices L o , L s , L w and L c are transformation matrices and c t is the time-dependent context vector."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-126",
"text": "**REALISATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-127",
"text": "Since we delexicalise names, dates, quantities and values from AMRs, we need to textually realise this information once we obtain the results from the translation step."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-128",
"text": "As we kept all the original information and their relation with the tags, we just need to replace one for the other."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-129",
"text": "We implement some rules to adequate our generated texts to the ones we saw in the training set."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-130",
"text": "Different from the AMRs, we represent months nominally, and not numerically -month 5 will be May for example."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-131",
"text": "Values and quantities bigger than a thousand are also part realised nominally."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-132",
"text": "The value 8500000000 would be realised as 8.5 billion for instance."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-133",
"text": "On the other hand, names are realised as they are."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-134",
"text": "(9)"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-136",
"text": "**EVALUATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-138",
"text": "**DATA**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-139",
"text": "We used the corpus LDC2016E25 provided by the SemEval 2017 Task 9 in our evaluation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-140",
"text": "This corpus consists of aligned AMR-sentence pairs, mostly newswire."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-141",
"text": "We considered the train/dev/test sets splitting proposed in the original setting, totaling 36,521, 1,368 and 1,371 AMR-sentence pairs, respectively."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-142",
"text": "Compression and Linearisation methods, as well as Phrase-based Machine Translation models were trained over the gold-standard alignments between AMRs and sentences on the training set of the corpus."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-144",
"text": "**EVALUATED MODELS**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-145",
"text": "We test models with and without the Delexicalisation/Realisation (-Delex and +Delex) and Compression (-Compress and +Compress) steps."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-146",
"text": "In models without the Compression step, we include all the elements from an AMR in the flattened representation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-148",
"text": "Finally, we translate a flattened AMR into text using a Phrase-based (PBMT) and a Neural Machine Translation model (NMT)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-149",
"text": "In total, we evaluated 16 models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-151",
"text": "**PHRASE-BASED MACHINE TRANSLATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-152",
"text": "We used a standard PBMT system built using Moses toolkit (Koehn et al., 2007) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-153",
"text": "At training time, we extract and score phrase sentences up to the size of 9 tokens."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-154",
"text": "All the feature functions were trained using the gold-standard alignments from the training set and their weights were tuned on the development data using k-batch MIRA with k = 60 (Cherry and Foster, 2012) with BLEU as the evaluation metric."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-155",
"text": "A distortion limit of 6 was used for the reordering models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-156",
"text": "Lexicalised reordering models were bidirectional."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-157",
"text": "At decoding time, we use a stack size of 1000."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-158",
"text": "Our language model P (e) is a 5-gram LM trained on the Gigaword Third Edition corpus using KenLM (Heafield et al., 2013) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-159",
"text": "For the models with the Delexicalisation step, we trained the language model with a delexicalised version of Gigaword by parsing the corpus using the Stanford Named Entity Recognition tool (Finkel et al., 2005) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-160",
"text": "All the entities labeled as LOCATION, PERSON, ORGANISATION or MISC were replaced by the tag nameX ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-161",
"text": "Entities labeled as NUMBER or MONEY were replaced by the tag quantX ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-162",
"text": "Finally, entities labeled as PERCENT or ORDINAL were replaced by valueX ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-163",
"text": "In the tags, X is replaced by the ordinal position of the entity in the sentence."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-165",
"text": "**NEURAL MACHINE TRANSLATION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-166",
"text": "The encoder is a bidirectional RNN with GRU, each with a 1024D hidden unit."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-167",
"text": "Source and target word embeddings are 620D each and are both trained jointly with the model."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-168",
"text": "All non-recurrent matrices are initialised by sampling from a Gaussian (\u00b5 = 0, \u03c3 = 0.01), recurrent matrices are random orthogonal and bias vectors are all initialised to zero."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-169",
"text": "The decoder RNN also uses GRU and is a neural LM conditioned on its previous emissions and the source sentence by means of the source attention mechanism."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-211",
"text": "Neural MT The first impression from our NMT experiments is that using Compression consistently deteriorates translations according to all metrics evaluated."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-170",
"text": "We apply dropout with a probability of 0.3 in both source and target word embeddings, in the encoder and decoder RNNs inputs and recurrent connections, and before the readout operation in the decoder RNN."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-171",
"text": "We follow Gal and Ghahramani (2016) and apply dropout to the encoder and decoder RNNs using the same mask in all time steps."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-172",
"text": "Models are trained using stochastic gradient descent with Adadelta (Zeiler, 2012) and minibatches of size 40."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-173",
"text": "We apply early stopping for model selection based on BLEU scores, so that if a model does not improve on the validation set for more than 20 epochs, training is halted."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-175",
"text": "**MODELS FOR COMPARISON**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-176",
"text": "We compare BLEU scores for some of the AMRto-text systems described in the literature (Flanigan et al., 2016; Song et al., 2016; Pourdamghani et al., 2016; Song et al., 2017; Konstas et al., 2017) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-177",
"text": "Since the models of Flanigan et al. (2016) and Pourdamghani et al. (2016) are publicly available, we also use them with the same training data as our models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-178",
"text": "For Flanigan et al. (2016) , we specifically use the version available on GitHub 2 ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-179",
"text": "For Pourdamghani et al. (2016) , we use the version available at the first author's website 3 ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-180",
"text": "The rules used for the preordering model and the feature functions from the PBMT system are trained using alignments over AMR-sentence pairs from the training set obtained with the aligner described by Pourdamghani et al. (2014) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-181",
"text": "We do not use lexicalised reordering models as Pourdamghani et al. (2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-182",
"text": "Moreover, we tune the weights of the feature functions with MERT (Och, 2003) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-183",
"text": "Both models make use of a 5-gram language model trained on Gigaword Third Edition corpus with KenLM."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-185",
"text": "**METRICS**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-186",
"text": "To evaluate fluency, adequacy and post-editing effort of the models, we use BLEU (Papineni et al., 2002) , METEOR (Lavie and Agarwal, 2007) and TER (Snover et al., 2006) , respectively."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-187",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-188",
"text": "**RESULTS**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-189",
"text": "Table 1 depicts the scores of the different models by the size of the data they were trained on."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-190",
"text": "For illustration, we depicted the BLEU scores of all the AMR-to-text systems described in the literature."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-191",
"text": "The models of Flanigan et al. (2016) and Pourdamghani et al. (2016) were officially trained with 10,313 AMR-sentence pairs from the LDC2014T12 corpus, and with 36,521 AMR-sentence pairs from the LDC2016E25 in our study (as our models)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-192",
"text": "The ones of Song et al. (2016) and Song et al. (2017) were trained with 16,833 pairs from the LDC2015E86 corpus."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-193",
"text": "Konstas et al. (2017) , which presents the highest quantitative result in the task so far, also used the LDC2015E86 corpus plus 20 million English sentences from the Gigaword corpus with a semi-supervised approach."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-194",
"text": "We report the results when their model were trained only with AMRsentence pairs from the corpus, and when improved with more 20 million sentences."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-195",
"text": "Among the PBMT models, the Delexicalisation step (+Delex) does not seem to play a role in obtaining better sentences from AMRs."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-196",
"text": "All the models with the preordering method in Linearisation Song et al. (2017) and introduce competitive results with Pourdamghani et al. (2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-197",
"text": "In our NMT models, apparently the Compression step is harmful to the task, whereas Delexicalisation and preordering in Linearisation lead to better results."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-198",
"text": "However, none of the NMT models outperform neither the PBMT models nor the baselines."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-199",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-200",
"text": "**DISCUSSION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-201",
"text": "In this paper, we studied models for AMR-to-text generation using machine translation."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-202",
"text": "We systematically analysed the effects of 3 processing strategies on AMRs before feeding them either to a Phrasebased or a Neural MT system."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-203",
"text": "The evaluation was performed on the LDC2016E25 corpus, provided by SemEval 2017 Task 9."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-204",
"text": "All the models had the fluency, adequacy and post-editing effort of their produced sentences measured by BLEU, METEOR and TER, respectively."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-205",
"text": "In general, we found that pro-cessing AMRs helps, although the effects differ for the different systems."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-206",
"text": "Phrase-based MT Delexicalisation (+Delex) does not seem to play a role in obtaining better sentences from AMRs using PBMT."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-207",
"text": "Our best model (PBMT-Delex+Compress+Preorder) presents competitive results to Pourdamghani et al. (2016) with the advantage that no technique is necessary to overcome data sparsity."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-208",
"text": "Compressing an AMR graph with a classifier shows improvements over a comparable model without compression, but not as strong as preordering the elements in the Linearisation step."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-209",
"text": "In fact, preordering seems to be the most important preprocessing step across all three MT preprocessing metrics."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-210",
"text": "We note that the preordering success was expected, based on previous results (Pourdamghani et al., 2016) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-214",
"text": "Besides the differences between these two MT architectures, applying preordering in the Linearisation step improves results in both cases."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-215",
"text": "This seems to contradict the finding in Konstas et al. (2017) regarding neural models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-216",
"text": "We conjecture that the additional training data used by Konstas et al. (2017) may have decreased the gap between using and not using preordering (see also below)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-217",
"text": "More research is necessary to settle this point."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-218",
"text": "PBMT vs. NMT PBMT models generate much better sentences from AMRs than NMT models in terms of fluency, adequacy and post-editing effort."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-219",
"text": "We believe that the lower performance of NMT models is due to the small size of the training set (36,521 AMR-sentence pairs)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-220",
"text": "Neural models are known to perform well when trained on much larger data sets, e.g. in the order of millions of entries, as exemplified by Konstas et al. (2017) ."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-221",
"text": "PBMT models trained on small data sets clearly outperform NMT ones, e.g. Konstas et al. (2017) reported 22.0 BLEU, whereas Pourdamghani et al. (2016) 's best model achieved 26.9 BLEU, and our best model performs comparably (26.8 BLEU)."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-222",
"text": "Model comparison While the best PBMT models are comparable to the state-of-the-art AMR-totext systems, the current best results are reported by Konstas et al. (2017) , showing the potential of applying deep learning onto large amounts of training data with a 33.8 BLEU-score."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-223",
"text": "However, this result crucially relies on the existence of a very large dataset."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-224",
"text": "Interestingly, when applied in a situation with limited amounts of data, Konstas et al. (2017) report substantially lower performance scores."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-225",
"text": "In such situations, our PBMT models, like Pourdamghani et al. (2016) , look appear to be a good alternative option."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-226",
"text": "----------------------------------"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-227",
"text": "**CONCLUSION**"
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-228",
"text": "In this work, we systematically studied different MT models to translate AMRs into natural language."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-229",
"text": "We observed that the Delexicalisation, Compression, and Linearisation steps have different impacts on AMR-to-text generation depending on the MT architecture used."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-230",
"text": "We observed that delexicalising AMRs yields the best results in NMT models, in contrast to PBMT models."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-231",
"text": "On the other hand, for both PBMT models and NMT models, preordering the AMR in Linearisation introduces better results."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-232",
"text": "Among our models, PBMT generally outperforms NMT."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-233",
"text": "Finally, the literature suggests that the improvements obtained by having more data are larger than those obtained with improved preprocessing strategies."
},
{
"sent_id": "e4452ce844b74c35f257c916aae120-C001-234",
"text": "Nonetheless, combining the right preprocessing strategy with large volumes of training data should lead to further improvements."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"e4452ce844b74c35f257c916aae120-C001-18"
],
[
"e4452ce844b74c35f257c916aae120-C001-36",
"e4452ce844b74c35f257c916aae120-C001-37"
],
[
"e4452ce844b74c35f257c916aae120-C001-44"
],
[
"e4452ce844b74c35f257c916aae120-C001-58"
],
[
"e4452ce844b74c35f257c916aae120-C001-66"
]
],
"cite_sentences": [
"e4452ce844b74c35f257c916aae120-C001-18",
"e4452ce844b74c35f257c916aae120-C001-36",
"e4452ce844b74c35f257c916aae120-C001-37",
"e4452ce844b74c35f257c916aae120-C001-44",
"e4452ce844b74c35f257c916aae120-C001-58",
"e4452ce844b74c35f257c916aae120-C001-66"
]
},
"@MOT@": {
"gold_contexts": [
[
"e4452ce844b74c35f257c916aae120-C001-18",
"e4452ce844b74c35f257c916aae120-C001-19",
"e4452ce844b74c35f257c916aae120-C001-20"
]
],
"cite_sentences": [
"e4452ce844b74c35f257c916aae120-C001-18"
]
},
"@DIF@": {
"gold_contexts": [
[
"e4452ce844b74c35f257c916aae120-C001-66",
"e4452ce844b74c35f257c916aae120-C001-67"
],
[
"e4452ce844b74c35f257c916aae120-C001-181",
"e4452ce844b74c35f257c916aae120-C001-182"
],
[
"e4452ce844b74c35f257c916aae120-C001-191"
]
],
"cite_sentences": [
"e4452ce844b74c35f257c916aae120-C001-66",
"e4452ce844b74c35f257c916aae120-C001-181",
"e4452ce844b74c35f257c916aae120-C001-191"
]
},
"@SIM@": {
"gold_contexts": [
[
"e4452ce844b74c35f257c916aae120-C001-76"
],
[
"e4452ce844b74c35f257c916aae120-C001-196"
],
[
"e4452ce844b74c35f257c916aae120-C001-207"
],
[
"e4452ce844b74c35f257c916aae120-C001-221"
],
[
"e4452ce844b74c35f257c916aae120-C001-225"
]
],
"cite_sentences": [
"e4452ce844b74c35f257c916aae120-C001-76",
"e4452ce844b74c35f257c916aae120-C001-196",
"e4452ce844b74c35f257c916aae120-C001-207",
"e4452ce844b74c35f257c916aae120-C001-221",
"e4452ce844b74c35f257c916aae120-C001-225"
]
},
"@USE@": {
"gold_contexts": [
[
"e4452ce844b74c35f257c916aae120-C001-78"
],
[
"e4452ce844b74c35f257c916aae120-C001-176"
],
[
"e4452ce844b74c35f257c916aae120-C001-210"
]
],
"cite_sentences": [
"e4452ce844b74c35f257c916aae120-C001-78",
"e4452ce844b74c35f257c916aae120-C001-176",
"e4452ce844b74c35f257c916aae120-C001-210"
]
}
}
},
"ABC_0e5c3df8309dbaf93d10c94fb292fc_3": {
"x": [
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-129",
"text": "**P(I)**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-2",
"text": "A large part of human communication involves referring to entities in the world and often these entities are objects that are visually present for the interlocutors."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-3",
"text": "A system that aims to resolve such references needs to tackle a complex task: objects and their visual features need to be determined, the referring expressions must be recognised, and extra-linguistic information such as eye gaze or pointing gestures need to be incorporated."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-4",
"text": "Systems that can make use of such information sources exist, but have so far only been tested under very constrained settings, such as WOz interactions."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-5",
"text": "In this paper, we apply to a more complex domain a reference resolution model that works incrementally (i.e., word by word), grounds words with visually present properties of objects (such as shape and size), and can incorporate extra-linguistic information."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-6",
"text": "We find that the model works well compared to previous work on the same data, despite using fewer features."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-7",
"text": "We conclude that the model shows potential for use in a realtime interactive dialogue system."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-10",
"text": "Referring to entities in the world via definite descriptions makes up a large part of human communication (Poesio and Vieira, 1997) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-11",
"text": "In task-oriented situations, these references are often to entities that are visible in the shared environment."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-12",
"text": "This kind of reference has attracted attention in recent computational research, but the kinds of interactions studied are often fairly restricted in controlled lab situations (Tanenhaus and Spivey-Knowlton, 1995) or simulated human/computer interactions, Chai et al., 2014) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-13",
"text": "In such task-oriented, co-located settings, interlocutors can make use of extra-linguistic cues such as gaze or pointing gestures."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-14",
"text": "Furthermore, listeners resolve references as they unfold, often identifying the referred entity before the end of the reference (Tanenhaus and Spivey-Knowlton, 1995; Spivey et al., 2002) , however research in reference resolution has mostly focused on full, completed referring expressions."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-15",
"text": "In this paper we make a first move towards addressing somewhat more complex domains."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-16",
"text": "We apply a model of reference resolution, which has been tested in a simpler setup, on more natural data coming from a corpus of human/human interactions."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-17",
"text": "The model is incremental in that it does not wait until the end of an utterance to process, rather it updates its interpretation at each word increment."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-18",
"text": "The model can also incorporate other modalities, such as gaze or pointing cues (deixis) incrementally."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-19",
"text": "We also model the saliency of the context, and show that the model can easily take such contextual information into account."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-20",
"text": "The model improves over previous work on reference resolution applied to the same data (Iida et al., 2010; Iida et al., 2011) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-21",
"text": "The paper is structured as follows: in the following section we discuss related work on incremental resolution of referring expressions."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-22",
"text": "We explain the model that we use in Section 3 and the data we apply it to in Section 4."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-23",
"text": "We then describe the experiments and the results and provide a discussion."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-24",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-25",
"text": "**RELATED WORK**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-26",
"text": "Reference resolution (RR), which is the task of resolving referring expressions (REs) to what they are intended to refer to, has been well-studied in various fields such as psychology (Isaacs and Clark, 1987; Tanenhaus and Spivey-Knowlton, 1995) , linguistics (Pineda and Garza, 2000) , as well as human/human (Iida et al., 2010) and human/machine interaction (Prasov and Chai, 2010; Siebert and Schlangen, 2008; ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-27",
"text": "In recent years, multi-modal corpora have emerged which provide RR with important contextual information: collecting dialogue between two humans Spanger et al., 2012) , between a human and a (simulated) dialogue system Liu et al., 2013) , with gaze, information about the shared environment, and in some cases deixis."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-28",
"text": "It has been shown that incorporating gaze improves RR in a situated setting because speakers need to look at and distinguish from distractors the objects they are describing: this has been shown in a static scene on a computer screen (Prasov and Chai, 2008) , in human-human interactive puzzle tasks (Iida et al., 2010; Iida et al., 2011) , in web browsing (Hakkani-t\u00fcr et al., 2014) , and in a moving car where speakers look at objects in their vicinity (Misu et al., 2014) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-29",
"text": "Incorporating pointing (deictic) gestures is also potentially useful in situated RR; as for example Matuszek et al. (2014) have shown in work on resolving objects processed by computer vision techniques."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-30",
"text": "Chen and Eugenio (2012) looked into reference in multi-modal settings, with focus on co-referential pronouns and pointing gestures."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-31",
"text": "However, these approaches were applied in settings in which communication between the two interlocutors was constrained, or the developed systems did not process incrementally."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-32",
"text": "Kehler (2000) presented approach that focused more on interaction in a map task, though the model was not incremental, nor did grounding occur between language and world, as we do here."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-33",
"text": "Incremental RR has also been studied in a number of papers, including a framework for fast incremental interpretation (Schuler et al., 2009) , a Bayesian filtering model approach that was sensitive to disfluencies , a model that used Markov Logic Networks to resolve objects on a screen , a model of RR and incremental feedback (Traum et al., 2012) , and an approach that used a semantic representation to refer to objects (Peldszus et al., 2012; Kennington et al., 2014) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-34",
"text": "However, the approaches reported there did not incorporate multi-modal information, were too slow to work in real-time, were evaluated on constrained data, or only focused on a specific type of RR, ignoring pronouns or deixis."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-35",
"text": "In this paper, we opted to use the model presented in , the simple incremental update model (SIUM)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-36",
"text": "It has been tested extensively against data from a puzzle-playing human/computer interaction domain (the PENTO data, ); it can incorporate multi-modal information, works in real-time, and can resolve definite, exophoric, and deictic references in a single framework, all of which makes it a potential candidate for working in an interactive, multi-modal dialogue system."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-37",
"text": "The model is similar to the one proposed in Funakoshi et al. (2012) , which could resolve descriptions, anaphora, and deixis in a unified manner, but that model does not work incrementally."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-38",
"text": "1 The main contributions of this paper are the more thorough exposition of the model (in Section 3) and its application and evaluation on much less constrained, more interactive (and hence realistic) data than what it has previously been tested on (Section 4)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-39",
"text": "Moreover, the data set used here is also from a typologically very different language (Japanese) than what the model has been previously tested on (German), and so the robustness of the model against these differences is also investigated."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-40",
"text": "We will now describe the model, and that will be followed by a description of the corpus we used."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-42",
"text": "**THE SIMPLE INCREMENTAL UPDATE MODEL**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-43",
"text": "Following and Kennington et al. (2014) , we model the task at hand as one of recovering I, the intention of the speaker making the RE, where I ranges over the possible alternatives (the objects in the domain)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-44",
"text": "This recovery proceeds incrementally (word by word), for RE of arbitrary length."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-45",
"text": "That is, if U denotes the current word, we are interested in P (I|U ), the current hypothesis about the intended referent, given the observed word."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-46",
"text": "We assume the presence of an unobserved, latent variable R, which models properties of the candidate objects such as colour or shape; explained further below), and so the computation formally is:"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-47",
"text": "Which, after making some independence assumptions, can be factored into:"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-48",
"text": "This is an update model in the usual sense that the posterior P (I|U ) at one step becomes the prior P (I) at the next."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-49",
"text": "P (R|I) provides the link between the intentions (that is, objects) and the properties (e.g., the colour and shape of each object), and P (U |R) the link between properties and (observed) words."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-50",
"text": "Being incremental, this model is computed at each word."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-51",
"text": "As properties play an important role in this model, they will now be explained."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-53",
"text": "**PROPERTIES**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-54",
"text": "The variable R models visual or abstract properties of entities (such as real-world objects or linguistic entities) and their selection for verbalisation in the referring expression."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-55",
"text": "The simple assumption made by the model is that only such properties can be selected for verbalisation which the candidate object actually has."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-56",
"text": "Hence, the starting point for the model is a representation of the world and the current dialogue context in terms of the properties of the objects."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-57",
"text": "For this paper, this means properties belonging to objects in the shared work space."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-58",
"text": "We will explain the properties we used in our implementation of this model (henceforth SIUM, i.e., simple incremental update model), the motivation for using them, and give an example of applying the model in Section 5."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-60",
"text": "**THE REX DATA**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-61",
"text": "The corpora presented in Iida et al. (2011) and Spanger et al. (2012) are a collection of human/human interaction data where the participants collaboratively solved Tangram puzzles."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-62",
"text": "In this setting, anaphoric references (i.e., pronoun references to entities in an earlier utterance, e.g., \"move it to the left\") and exophoric references via definite descriptions (i.e., references to real-world objects, e.g., \"that one\" or \"the big triangle\") are common (note that both refer in different ways to objects that are physically present)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-63",
"text": "The corpus also records an added modality: the gaze of the puzzle solver (SV) who gives the instructions and that of the operator (OP), who moves the tangram pieces."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-64",
"text": "The mouse pointer controlled by the OP could also be considered a modality, used as a kind of pointing gesture that both participants can observe."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-65",
"text": "The goal of the task was to arrange puzzle pieces on a board into a specified shape (example in Figure 1 ), which was only known to SV and hidden from OP."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-66",
"text": "The language of the dialogues was Japanese."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-108",
"text": "For vertical placement, top, center and bottom properties were given to objects in the respective vertical segments."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-67",
"text": "Figure 1: Example Tangram Board; the goal shape is the swan in the top left, the shared work area is the large board on the right, the mouse cursor and OP gaze (blue dot) are on object 5, the SV gaze is the red dot (gaze points were not seen by the participants)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-68",
"text": "This environment provided frequent use of REs that aimed to distinguish puzzle pieces (and piece groups) from each other."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-69",
"text": "The following are some example REs from the REX corpus:"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-70",
"text": "(1) a. chicchai sankakkei b. small triangle (2) a. sono ichiban migi ni shippo ni natte iru sankakkei b. that most right tail becoming triangle 'that right-most triangle that is the tail'"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-71",
"text": "Example (1) is a typical example of an RE as found in the corpus."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-72",
"text": "Note that this at the same time constitutes the whole utterance, which hence can be classi-fied as a non-sentential utterance (Schlangen, 2004) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-73",
"text": "Its transliteration consists of 8 Japanese characters, which could be tokenized into two words."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-74",
"text": "The more difficult RE shown in Example (2) requires the model to learn how spatial placements map to certain descriptions."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-75",
"text": "Moreover, Japanese is a head-final language where comparative landmark pieces are uttered before the referent."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-76",
"text": "Also, because this was a highly interactive setting, many exophoric pronouns were used, e.g., sore and sono, both meaning that."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-77",
"text": "2 Pronoun references like this made up around 32% of the utterances."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-78",
"text": "Corpus annotations included (for both participants) transcriptions of utterances, the object being looked at any given time, the object being pointed at or manipulated by the mouse, segmentation of the REs and the corresponding referred object or objects."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-79",
"text": "The spatial layout of the board was recorded each time an object was manipulated."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-80",
"text": "Further details of the corpus can be found in (Iida et al., 2011) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-81",
"text": "In order to directly compare our work with previous work, in our evaluations below we consider the same annotated REs."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-82",
"text": "Iida et al. (2011) applied a support vector machine-based ranking algorithm (Joachims, 2002) to the task of resolving REs in this corpus."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-83",
"text": "They used a total of 36 binary features in the SVM classifier, which predicted the referred object."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-84",
"text": "They further used a separate model for pronoun utterances and non-pronoun utterances, allowing the classifier to learn patterns without confusing utterance types."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-85",
"text": "More details on the results of these models are given below."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-86",
"text": "The SIU-model has previously been applied to two datasets from the Pentomino domain , where the speaker's goal was to identify one out of a set of tetris-like (but consisting of five instead of four blocks) puzzle pieces."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-87",
"text": "However, in these datasets, the references were \"one-shot\" and not embedded in longer dialogues, as is the case in the REX corpus."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-88",
"text": "A summary of differences between the two tasks is summarised in"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-90",
"text": "**EXPERIMENT**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-91",
"text": "Procedure The procedure for this experiment is as follows."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-92",
"text": "In order to compare our results directly with those of Iida et al. (2011) , we provide our model with the same training and evaluation data, in a 10-fold cross-validation of the 1192 REs from 27 dialogues (the T2009-11 corpus in )."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-93",
"text": "For development, we used a separate part of the REX corpus (N2009-11) that was structured similarly to the one used in our evaluation."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-94",
"text": "Task The task is RR."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-95",
"text": "At each increment, SIUM returns a distribution over all objects; the probability for each object represents the strength of the belief that it is the referred one."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-96",
"text": "The argmax of the distribution is chosen as the hypothesised referred object."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-97",
"text": "P(R|I) P (R|I) models the likelihood of selecting a property of a candidate object for verbalisation; this likelihood is assumed to be uniform for all the properties that the candidate object has."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-98",
"text": "3 We derive these properties from a representation of the scene; similar to how Iida et al. (2011) computed features to present to their classifier: namely Ling (linguistic features), TaskSp (task specific features), and Gaze (from SV only)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-99",
"text": "Some features were binary, others such as shape and size had more values."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-100",
"text": "Table 2 shows all the properties that were used here."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-101",
"text": "Each will now be explained."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-102",
"text": "Ling Each object had a shape, size, and relative position to the other pieces."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-103",
"text": "We determined by hand the shape and size properties which remained static through each dialogue."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-104",
"text": "The position properties were derived from the corpus logs."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-105",
"text": "For each object, the centroid of each object was computed."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-106",
"text": "Then, the vertical and horizontal range for all of the objects was calculated and then split into three even sections in each dimension (see Figure 2 )."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-107",
"text": "An object with a centroid in the left-most section of the horizontal range received a left property, similarly middle and right properties were calculated for corresponding objects."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-109",
"text": "Figure 2 shows an example segmentation."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-110",
"text": "Each object had a vertical and a horizontal property at all times, however, moving an object could result in a change of one of these spatial properties as the dialogue progressed."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-111",
"text": "As an example, compare Figure 1 , which is a snapshot of the interaction towards the beginning, and Figure 2 , which shows a later stage of the game board; spatial layout changes throughout the dialogue."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-112",
"text": "These properties differ somewhat from the features for the Ling model presented in Iida et al. (2011) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-113",
"text": "Three features that we did use as properties had to do with reference recency: the most recently referred object received the referred X properties, if an object was referred to in the past 5, 10, or 20 seconds."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-114",
"text": "TaskSp Iida et al. (2011) used 14 task-specific features, three of which they found to be the most informative in their model."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-115",
"text": "Here, we will only use the two most informative features as properties (the third one, whether or not an object was being manipulated at the beginning of the RE, did not improve results in a held-out test): the object that was most recently moved received the most recent move property and objects that have the mouse cursor over them received the mouse pointed property (see Figure 2 ; object 4 would receive both of these properties, but only for the duration that the mouse was actually over it)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-116",
"text": "Each of these properties can be extracted directly from the corpus annotations."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-117",
"text": "Gaze Similar to Iida et al. (2011) , we consider gaze during a window of 1500ms before the onset of the RE."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-118",
"text": "The object that was gazed at the longest during that time received a longest gazed at property, the object which was fixated upon most recently during that interval before the RE onset received a recent fixation property, and the object which had the most fixations received the most gazed at property."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-119",
"text": "During a RE, an object received the gazed at in utt property if it is gazed at during the RE up until that point."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-120",
"text": "These properties can be extracted directly from the corpus annotations."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-121",
"text": "Other gaze features are not really accessible to an incremental model such as this, as gaze features extracted from gaze activity over the RE can only be computed when it is complete."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-122",
"text": "Our Gaze properties are made up of these 4 properties, as opposed to the 14 features in Iida et al. (2011) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-124",
"text": "**P(U|R) P (U |R)**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-125",
"text": "is the model that connects the property selected for verbalisation with a way of verbalising it (a value for U )."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-126",
"text": "Instead of directly learning this model from data, which would suffer from data sparseness, we trained a naive Bayes model for P (R|U ) (as, according to Bayes' rule, P (U |R) is equal to P (R|U )P (U ) 1 P (R) , which, plugged in into formula (2), cancels out 1 P (U ) ; further assuming the P (R) is uniform, we can directly replace P (U |R) with P (R|U ) here)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-127",
"text": "On the language side (the variable U in the model), we used n-grams over Japanese characters (we attempted tokenisation of the REs into words, but found that using characters worked just as well in the held-out set)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-130",
"text": "The prior P (I) is the posterior of the previously computed increment."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-131",
"text": "In the first increment, it can simply be set to a uniform distribution."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-132",
"text": "Here, we apply a more informative prior based on saliency."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-133",
"text": "We learn a context model which is queried when the first word begins, taking information about the context immediately before the beginning of the RE into account, producing a distribution over objects, which becomes P (I) of the first increment in the RE."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-134",
"text": "The context model itself is a simple application of the SIUM, where instead of being a word, U is a token that represents saliency."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-135",
"text": "The context model thus learns what properties are important to the pre-RE context and provides an up-to-date distribution over the objects as a RE begins."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-137",
"text": "**EXAMPLE**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-138",
"text": "Figure 3: Example scene with two triangles and one square, 1 is being looked at by the SV, 3 was recently moved and the mouse pointer is still over it."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-139",
"text": "We will now give a simple example of how the model is applied to the REX data using a subset of the above properties for the RE small triangle."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-140",
"text": "Table 3 shows a simple normalised co-occurrence count of how many times properties were observed as belonging to a referred object (the basis for P (U |R))."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-141",
"text": "Figure 3 shows the current toy scene, and Table 4 shows the properties that each object in the scene has during the utterance."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-142",
"text": "Table 5 shows the full application of the model by summing over the properties for the product P (U |R)P (R|I) and multiplying by the prior P (I), the posterior of the previous step."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-143",
"text": "Included in this example is how the initial prior is computed from the context model."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-144",
"text": "Table 3 : Applications of P (U |R), for some values of U and R; we assume that this model is learned from data (rows are excerpted from a larger distribution over all the words in the vocabulary) Table 4 : P (R|I), for our example domain."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-145",
"text": "The probability mass is distributed over the number of properties that a candidate object actually has."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-146",
"text": "Before the RE even begins, the prior saliency yields that 3 is the most likely object to be referred; it was the most salient in that it was the most recently moved object and the mouse pointer was still over it."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-147",
"text": "However, initial prior information alone is not enough to resolve the intended object; for that the RE is needed."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-148",
"text": "After the word small is uttered, 1 is the most likely referred object."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-149",
"text": "After triangle, 1 remains the highest in the distribution."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-150",
"text": "With the RE alone, in this case there would have been enough information to infer that 1 was the referred object, but adding the prior information provided additional evidence."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-151",
"text": ".162 Table 5 : Application of RE small triangle, where 1 is the referred object Evaluation Metrics We report results of our evaluation in referential accuracy on utterances that were annotated as referring to a single object (references to group objects is left for future work)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-152",
"text": "Going beyond Iida et al. (2011) , our model computes a resolution hypothesis incrementally; for the performance of this aspect of the system we followed previously used metrics for evaluation : first correct: how deep into the RE does the model predict the referent for the first time? first final: how deep into the RE does the model predict the correct referent and keep that decision until the end? edit overhead: how often did the model unnecessarily change its prediction (the only necessary prediction happens when it first makes a correct prediction)?"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-153",
"text": "We compare non-incremental results to three evaluations performed in Iida et al. (2011) , namely when Ling is used alone, Ling+TaskSP used together, and Ling+TaskSp+Gaze."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-154",
"text": "Furthermore, they show results of models where a separate part handled REs that used pronouns, as well as a part that handled the non-pronoun REs, and a combined model that handled both types of expressions."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-156",
"text": "**RESULTS**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-157",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-158",
"text": "**REFERENCE RESOLUTION**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-159",
"text": "Results of our evaluation are shown in Figure 4 ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-160",
"text": "The SIUM model performs better than the combined approach of Iida et al. (2011) , and performs better than their separated model-when not including gaze (there is a significant difference between SIUM and the separated models for Ling+TaskSp, though (2011) SIUM only got one more correct than the separated model)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-161",
"text": "This is a welcome result, as it shows that our very simple incremental model that uses a basic classifier is comparable to a non-incremental approach that uses a more complicated classifier."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-162",
"text": "It further shows that the SIUM model is robust to using TaskSp and Gaze features as properties, as long as those features are available immediately before the RE begins, or during the RE."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-163",
"text": "The best-performing approach is the Iida2011-separated model with gaze."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-164",
"text": "This is the case for several reasons: first, their models use features that are not available to our incremental model (e.g., their model uses 14 gaze features, some of which were based on the entire RE, ours only uses 4 properties)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-165",
"text": "Second, and more importantly, separated models means less feature confusion: in Iida et al. (2011) (Section 5.2) , the authors give a comparison of the most informative features for each model; task and gaze features were prominent for the pronoun model, whereas gaze and language features were prominent for the non-pronoun model."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-166",
"text": "We also tested SIUM under separated conditions to better compare with the approaches presented here."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-167",
"text": "The separated models, however, did not improve."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-168",
"text": "This, we assume, is because the model grounds language with properties (see Discussion below)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-169",
"text": "An interactive dialogue system might not have the luxury of choosing between two models at runtime."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-170",
"text": "We assume that a model that can sufficiently handle both 1-5 6-8 9-14 first correct (% into RE) 35.47 22.34 14.8 first final (% into RE) 69.0 49.85 48.0 edit overhead (all lengths) 0.88% never correct (all lengths) 5.5% types of utterances is to be preferred to one that doesn't."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-171",
"text": "Table 6 shows how our model fares using the incremental metrics described earlier."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-172",
"text": "(As this has not been done in Iida et al. (2011) , direct comparison is not possible.) For the evaluation, REs are binned into short, normal, and long (1-5, 6-8, 9-14 characters, respectively, based on what the average numbers of words in REs in this corpus is), to make relative statements (\"% into the utterance\") comparable."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-174",
"text": "**INCREMENTAL BEHAVIOUR**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-175",
"text": "Ideally, a system would make the first correct decision as early as possible without changing that decision."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-176",
"text": "The results in the table show a respectable incremental model; on average it picks the right object early, with some edit overhead (making unnecessary changes in its prediction), finally fixing on a final decision before the end of the RE with low edit overhead, meaning it rarely changes its mind once it has made a decision."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-177",
"text": "In some cases, SIUM never guessed the correct object, labeled never correct in the table."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-178",
"text": "These incremental results are consistent with previous work for the SIUM; overall, the model is stable across the RE."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-179",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-180",
"text": "**DISCUSSION**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-181",
"text": "Despite being very simple, there is an important difference that allows SIUM to improve over previous work."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-182",
"text": "It learns to connect object properties selected for verbalisation to ways of verbalising them, and forms a stochastic expectation about which properties might be selected for verbalisation (namely, those that are present)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-183",
"text": "This represents a type of grounding (Harnad, 1990; Roy, 2005) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-184",
"text": "4 In terms of the SIUM formalism, the link between object and words is mediated by the properties the object has and by a stochastic process of associating words with properties."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-185",
"text": "Figure 6 visualises this: each word has a stochastic connection between each property and objects have a set of properties."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-186",
"text": "The property names are arbitrary as long as they are consistent."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-187",
"text": "In contrast, previous work in RR (Iida et al., 2011; Chai et al., 2014 ) used a hand-coded concept-labeled semantic representation and checked if aspects of the RE match that of a particular object."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-188",
"text": "If so, a binary compatibility feature was set."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-189",
"text": "Figure 5 shows this; words can only link to objects via hand-crafted rules (e.g., the word or FOL predicate and property string must match)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-190",
"text": "By the way SIUM uses properties, it can also perform (exophoric) pronoun resolution, deixis (the mouse pointer) and definite descriptions, in a single framework."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-191",
"text": "This is a nice feature of the model: adding additional modalities does not require model reformulation."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-192",
"text": "Incorporating saliency information via a context model is also a nice feature of the model."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-193",
"text": "In this paper, we computed the initial P (I) using a context model instantiated by SIUM."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-194",
"text": "By considering only this saliency information, the context model can predict the referred object in 41% of the cases."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-195",
"text": "It also learned which properties were important for saliency (that is, these are the properties that the model would most likely select): recently fixated, most gaze at, longest gazed at, prev ref, as might be expected."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-196",
"text": "In less than 2% of the cases, the context model referred to the correct object, but was wrongly \"overruled\" when processing the corresponding RE."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-197",
"text": "There were shortcomings, however."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-198",
"text": "In previous work, it was shown that SIUM performed well when REs contained pronouns (see , experiment 2)."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-199",
"text": "However, in the current work we observed that REs with pronouns were more difficult for the model to resolve than the model presented in Iida et al. (2011) ."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-200",
"text": "We surmise that SIUM had a difficult time grounding certain properties, as the Japanese pronoun sore can be used anaphorically or demonstratively in this kind of context (i.e., sometimes sore refers to previously-manipulated objects, or objects that are newly identified with a mouse pointer over them); the model presented in Iida et al. (2011) made more use of contextual information when pronouns were used, particularly in the combined model which incorporated gaze information, as shown above."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-202",
"text": "**CONCLUSION**"
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-203",
"text": "The SIUM is a model of RR that grounds language with the world, works incrementally, can incorporate modalities such as gaze and deixis, and can resolve multiple kinds of RRs in a single framework."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-204",
"text": "This paper represents the natural next step in evaluating SIUM in a setting that was less constrained and more interactive, with added knowledge that it can work in more than one language."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-205",
"text": "There is more to be tested for SIUM."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-206",
"text": "A common form of RR happens collaboratively over multiple utterances (Clark and Wilkes-Gibbs, 1986; Heeman and Hirst, 1995) , SIUM has only been tested on isolated REs."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-207",
"text": "Though SIUM required fewer features (realised as properties) than previous work, those properties still need to be computed."
},
{
"sent_id": "0e5c3df8309dbaf93d10c94fb292fc-C001-208",
"text": "We leave for future work investigation of a version of the model that can ground language with raw(er) information from the world (e.g., vision information), eliminating the need to determine properties."
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-20"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-112"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-114",
"0e5c3df8309dbaf93d10c94fb292fc-C001-115"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-122"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-160"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-187"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-199"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-200"
]
],
"cite_sentences": [
"0e5c3df8309dbaf93d10c94fb292fc-C001-20",
"0e5c3df8309dbaf93d10c94fb292fc-C001-112",
"0e5c3df8309dbaf93d10c94fb292fc-C001-114",
"0e5c3df8309dbaf93d10c94fb292fc-C001-122",
"0e5c3df8309dbaf93d10c94fb292fc-C001-160",
"0e5c3df8309dbaf93d10c94fb292fc-C001-187",
"0e5c3df8309dbaf93d10c94fb292fc-C001-199",
"0e5c3df8309dbaf93d10c94fb292fc-C001-200"
]
},
"@BACK@": {
"gold_contexts": [
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-28"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-61"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-82"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-114"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-165"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-187"
]
],
"cite_sentences": [
"0e5c3df8309dbaf93d10c94fb292fc-C001-28",
"0e5c3df8309dbaf93d10c94fb292fc-C001-61",
"0e5c3df8309dbaf93d10c94fb292fc-C001-82",
"0e5c3df8309dbaf93d10c94fb292fc-C001-114",
"0e5c3df8309dbaf93d10c94fb292fc-C001-165",
"0e5c3df8309dbaf93d10c94fb292fc-C001-187"
]
},
"@USE@": {
"gold_contexts": [
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-80"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-92"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-153"
]
],
"cite_sentences": [
"0e5c3df8309dbaf93d10c94fb292fc-C001-80",
"0e5c3df8309dbaf93d10c94fb292fc-C001-92",
"0e5c3df8309dbaf93d10c94fb292fc-C001-153"
]
},
"@SIM@": {
"gold_contexts": [
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-98"
],
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-117"
]
],
"cite_sentences": [
"0e5c3df8309dbaf93d10c94fb292fc-C001-98",
"0e5c3df8309dbaf93d10c94fb292fc-C001-117"
]
},
"@EXT@": {
"gold_contexts": [
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-152"
]
],
"cite_sentences": [
"0e5c3df8309dbaf93d10c94fb292fc-C001-152"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"0e5c3df8309dbaf93d10c94fb292fc-C001-172"
]
],
"cite_sentences": [
"0e5c3df8309dbaf93d10c94fb292fc-C001-172"
]
}
}
},
"ABC_6c4264bedb6683e909c1e530f22262_3": {
"x": [
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-6",
"text": "The same model achieved improvements of up to 5.3 and 6.5 p.p."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-2",
"text": "Mention detection is an important aspect of the annotation task and interpretation process for applications such as coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-3",
"text": "In this work, we propose and compare three neural network-based approaches to mention detection."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-4",
"text": "The first approach is based on the mention detection part of a state-of-the-art coreference resolution system; the second uses ELMo embeddings together with a bidirectional LSTM and a biaffine classifier; the third approach uses the recently introduced BERT model."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-5",
"text": "Our best model (using a biaffine classifier) achieved gains of up to 1.8 percentage points on mention recall when compared with a strong baseline in a HIGH RECALL setting."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-7",
"text": "when compared with the best-reported mention detection F1 on the CONLL and CRAC data sets respectively in a HIGH F1 setting."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-8",
"text": "We further evaluated our models on coreference resolution by using mentions predicted by our best model in the start-of-the-art coreference systems."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-9",
"text": "The enhanced model achieved absolute improvements of up to 1.7 and 0.7 p.p."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-10",
"text": "when compared with the best pipeline system and the state-of-the-art end-to-end system respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-13",
"text": "Mention detection (MD) is the task of identifying mentions of entities in text."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-14",
"text": "It is an important preprocessing step for downstream applications such as coreference resolution ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-15",
"text": "As such, the quality of mention detection affects very deeply both the quality of an annotation and the performance of a model for such applications (Chamberlain et al., 2016; Poesio et al., 2019) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-16",
"text": "Comparing to the simplified version that focuses on classifying named entity mentions for named entity recognition (NER), the full MD task for coreference resolution is more complex in two respects: firstly, it identifies more mention types, such as nominal mentions and pronouns; secondly, the mentions can be nested, so the task cannot be treated as a simple sequence labelling task, as is the norm in NER systems."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-17",
"text": "The most recent neural network approaches such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) , have achieved substantial improvements in the NER benchmark CONLL 2003 data set."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-18",
"text": "However, most of the MD system used by the state-of-the-art coreference systems do not take advantage of these advances and still heavily rely on parse trees (Bj\u00f6rkelund and Kuhn, 2014; Wiseman et al., 2015; Wiseman et al., 2016; Clark and Manning, 2016a; Clark and Manning, 2016b) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-19",
"text": "They either use all the NPs as candidate mentions (Bj\u00f6rkelund and Kuhn, 2014; Wiseman et al., 2015; Wiseman et al., 2016) or use the rule-based mention detector from the Stanford deterministic system (Lee et al., 2013) to extract mentions from NPs, named entity mentions and pronouns (Clark and Manning, 2015; Clark and Manning, 2016b) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-20",
"text": "There are only very few studies that attempt to apply neural network approaches to the MD task."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-21",
"text": "Lee et al. (2017; Lee et al. (2018) first introduced a neural mention detector as a part of their end-to-end coreference system; however, the system does not output intermediate mentions, hence the mention detector cannot be used by other coreference systems directly."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-22",
"text": "To the best of our knowledge, Poesio et al. (2018) introduced the only standalone neural mention detector."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-23",
"text": "By using a modified version of the NER system of Lample et al. (2016) , they showed substantial performance gains at mention detection on the benchmark CONLL 2012 data set and on the CRAC 2018 data set when compared with the Stanford deterministic system (Lee et al., 2013) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-24",
"text": "To build a high accuracy standalone MD system is not only important for the downstream applications, but also beneficial for annotation tasks that require mentions (Chamberlain et al., 2016; Poesio et al., 2019) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-25",
"text": "In this paper, we compare three neural architectures for MD."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-26",
"text": "The first system is a slightly modified version of the mention detection part of the Lee et al. (2018) system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-27",
"text": "The second system employs a bi-directional LSTM on the sentence level and uses biaffine attention (Dozat and Manning, 2017) over the LSTM outputs to predict the mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-28",
"text": "The third system takes the outputs from BERT (Devlin et al., 2019) and feeds them into a feed-forward neural network to classify candidates into mentions and non mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-29",
"text": "We evaluate these three models on both the CONLL and the CRAC data sets, with the following results."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-30",
"text": "Firstly, we show that better mention performance of up to 1.5 percentage points 1 can be achieved by training the mention detector alone."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-31",
"text": "Secondly, our best system achieves improvements of 5.3 and 6.5 percentage points when compared with Poesio et al. (2018) 's neural MD system on CONLL and CRAC respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-32",
"text": "Thirdly, by using better mentions from our mention detector, we can improve the end-to-end Lee et al. (2018) system and the Clark and Manning (2016a) pipeline system by up to 0.7% and 1.7% respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-34",
"text": "**RELATED WORK**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-35",
"text": "Mention detection."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-36",
"text": "Despite neural networks having shown high performance in many natural language processing tasks, the rule-based mention detector of the Stanford deterministic system (Lee et al., 2013) remains frequently used in top performing coreference systems (Clark and Manning, 2015; Clark and Manning, 2016a; Clark and Man-ning, 2016b) , including the best pipeline system itself based on neural networks (Clark and Manning, 2016a) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-37",
"text": "This mention detector uses a set of predefined heuristic rules to select mentions from NPs, pronouns and named entity mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-38",
"text": "Many other coreference systems simply use all the NPs as the candidate mentions (Bj\u00f6rkelund and Kuhn, 2014; Wiseman et al., 2015; Wiseman et al., 2016) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-39",
"text": "Lee et al. (2017) first introduced a neural network based end-to-end coreference system in which the neural mention detection part is not separated."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-40",
"text": "This move proved very effective; however, as a result the mention detection part of their system needs to be trained jointly with the coreference resolution part, hence can not be used separately."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-41",
"text": "The system has been later extended by Zhang et al. (2018) and Lee et al. (2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-42",
"text": "Zhang et al. (2018) added biaffine attention to the coreference part of the Lee et al. (2017) system, improving the system by 0.6%."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-43",
"text": "Biaffine attention is also used in one of our approaches (BIAFFINE MD), but in a totally different manner, i.e. we use biaffine attention for mention detection while in Zhang et al. (2018) biaffine attention was used for computing mention-pair scores."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-44",
"text": "The system is the current state-of-the-art coreference system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-45",
"text": "In this new system, the Lee et al. (2017) model is substantially improved through the use of ELMo embeddings (Peters et al., 2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-46",
"text": "Other machine learning based mention detectors include Uryupina and Moschitti (2013) and Poesio et al. (2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-47",
"text": "The Uryupina and Moschitti (2013) system takes all the NPs as candidates and trains a SVM-based binary classifier to select mentions from all the NPs."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-48",
"text": "Poesio et al. (2018) briefly discuss a neural mention detector that they modified from the NER system of Lample et al. (2016) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-49",
"text": "The system uses a bidirectional LSTM followed by a FFNN to select mentions from spans up to a maximum width."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-50",
"text": "The system achieved substantial gains on mention F1 when compared with the (Lee et al., 2013) on CONLL and CRAC data sets."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-51",
"text": "Named entity recognition."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-52",
"text": "A subtask of mention detection that focuses only on detecting named entity mentions is studied more frequently."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-53",
"text": "However, most of the proposed approaches treat the NER task as a sequence labelling task which can not be directly applied to the MD task for coreference, as the later usually allow nested mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-54",
"text": "The first neural network based NER model was introduced by Collobert et al. (2011) , who used a CNN to encode the tokens and apply a CRF layer on top."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-55",
"text": "After that, many other network architectures for NER MD have also been proposed, such as LSTM-CRF (Lample et al., 2016; Chiu and Nichols, 2016) , LSTM-CRF + ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-56",
"text": ", s i , e i are the start and the end indices of N i where 1 \u2264 i \u2264 I. The task for an MD system is to assign all the spans (N ) a score (r m ) so that spans can be classified into two classes (mention or non mention), hence is a binary classification problem."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-58",
"text": "**SYSTEM ARCHITECTURE**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-59",
"text": "In this paper, we introduce three MD systems that use the latest neural network architectures 2 ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-60",
"text": "The first approach uses the mention detection part from the start-of-the-art coreference resolution system , which we refer to as LEE MD."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-61",
"text": "We remove the coreference part of the system and change the loss function to sigmoid cross entropy, that is commonly used for binary classification problems."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-62",
"text": "The second approach (BIAFFINE MD) uses a bi-directional LSTM to encode the sentences of the document, followed by a biaffine classifier (Dozat and Manning, 2017) to score the candidates."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-63",
"text": "The third approach (BERT MD) uses BERT (Devlin et al., 2019) to encode the document in the sentence level; in addition, a feed-forward neural network (FFNN) to score the candidate mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-64",
"text": "The three architectures are summarized in Figure 1 and discussed in detail below."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-65",
"text": "All three architectures are available in two output modes: HIGH F1 and HIGH RECALL."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-66",
"text": "The HIGH F1 mode is meant for applications that require highest accuracy, such as preprocessing for annotation."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-67",
"text": "The HIGH RECALL mode, on the other hand, predicts as many mentions as possible, which is more appropriate for preprocessing for a coreference system since mentions can be further filtered by the system during coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-68",
"text": "In HIGH F1 mode we output mentions whose probability p m (i) is larger then a threshold \u03b2 such as 0.5."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-69",
"text": "In HIGH RECALL mode we output mentions based on a fixed mention/word ratio \u03bb; this is the same method used by Lee et al. (2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-71",
"text": "**LEE MD**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-72",
"text": "Our first system is based on the mention detection part of the Lee et al. (2018) system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-73",
"text": "The system represents a candidate span with the outputs of a bi-directional LSTM."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-74",
"text": "The sentences of a document are encoded bidirectional via the LSTMs to obtain forward/backward representations for each token in the sentence."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-75",
"text": "The bi-directional LSTM takes as input the concatenated embeddings ((x t ) T t=1 ) of both word and character levels."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-76",
"text": "For word embeddings, GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018) embeddings are used."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-77",
"text": "Character embeddings are learned from convolution neural networks (CNN) during training."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-78",
"text": "The code is available at https://github.com/ juntaoy/dali-md (2018) coreference system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-79",
"text": "(b) Our second approach that uses biaffine classifier (Dozat and Manning, 2017) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-80",
"text": "(c) Our third approach that uses BERT (Devlin et al., 2019) to encode the document."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-81",
"text": "where \u03c6(i) is the span width feature embeddings."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-82",
"text": "To make the task computationally tractable, the model only considers the spans up to a maximum length of l, i.e. e i \u2212 s i < l, (s i , e i ) \u2208 N ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-83",
"text": "The span representations are passed to a FFNN to obtain the raw candidate scores (r m )."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-84",
"text": "The raw scores are then used to create the probabilities (p m ) by applying a sigmoid function to the r m :"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-85",
"text": "For the HIGH RECALL mode, the top ranked \u03bbT spans are selected from lT candidate spans (\u03bb < l) by ranking the spans in a descending order by their probability (p m )."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-86",
"text": "For the HIGH F1 mode, the spans that have a probability (p m ) larger than the threshold \u03b2 are returned."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-88",
"text": "**BIAFFINE MD**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-89",
"text": "In our second model, the same bi-directional LSTM is used to encode the tokens of a document in the sentence level."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-90",
"text": "However, instead of using the concatenations of multiple word/character embeddings, only ELMo embeddings are used, as we find in preliminary experiments that the additional GloVe embeddings and character-based embeddings do not improve the accuracy."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-91",
"text": "After obtaining the token representations from the bidirectional LSTM, we apply two separate FFNNs to create different representations (h s /h e ) for the start/end of the spans."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-92",
"text": "Using different representations for the start/end of the spans allows the system to learn important information to identify the start/end of the spans separately."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-93",
"text": "This is an advantage when compared to the model directly using the output states of the LSTM, since the tokens that are likely to be the start of the mention and end of the mention are very different."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-94",
"text": "Finally, we employ a biaffine attention (Dozat and Manning, 2017) over the sentence to create a l s \u00d7 l s scoring metric (r m ), where l s is the length of the sentence."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-95",
"text": "More precisely, we compute the raw score for span i (N i ) by:"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-96",
"text": "where s i and e i are the start and end indices of N i , W m is a d \u00d7 d metric and b m is a bias term which has a shape of d \u00d7 1."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-97",
"text": "The computed raw score (r m ) covers all the span combinations in a sentence, to compute the probability scores (p m ) of the spans we further apply a simple constrain (s i \u2264 e i ) such that the system only predict valid mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-98",
"text": "Formally:"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-99",
"text": "The resulted p m are then used to predict mentions by filtering out the spans according to different requirements (HIGH RECALL or HIGH F1)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-101",
"text": "**BERT MD**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-102",
"text": "Our third approach is based on the recently introduced BERT model (Devlin et al., 2019) which encodes sentences by deep bidirectional transformers."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-103",
"text": "Our model uses a pretrained BERT model to encode the documents in the sentence level to create token representations x * t ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-104",
"text": "The pretrained BERT model uses WordPiece embeddings (Wu et al., 2016) , in which tokens are further split into smaller word pieces as the name suggested."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-105",
"text": "For example in sentence: We respect ##fully invite you to watch a special edition of Across China ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-106",
"text": "The token \"respectfully\" is split into two pieces (\"respect\" and \"fully\")."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-107",
"text": "In the case that tokens have multiple representations (word pieces), we use the first representation of the token."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-108",
"text": "An indicator list is created during the data preparation step to link the tokens to the correct word pieces."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-109",
"text": "After obtaining the actual word representations, the model then creates candidate spans by considering spans up to a maximum span length (l)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-110",
"text": "The spans are represented by the concatenated representations of the start/end tokens of the spans."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-111",
"text": "This is followed by a FFNN and a sigmoid function to assign each span a probability score:"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-112",
"text": "We use the same methods we used for our first approach (LEE MD) to select mentions based on different settings (HIGH RECALL or HIGH F1) respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-114",
"text": "**LEARNING**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-115",
"text": "The learning objective of our mention detectors is to learn to distinguish mentions form non-mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-116",
"text": "Hence it is a binary classification problem, we optimise our models on the simple but effective cross entropy:"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-117",
"text": "where y i is the gold label (y i \u2208 {0, 1}) of i th spans."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-119",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-120",
"text": "We ran two series of experiments."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-121",
"text": "The first series of experiments focuses only on the mention detection task, and we evaluate the performance of the proposed mention detectors in isolation."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-122",
"text": "The second series of experiments focuses on the effects of our model on the downstream applications: i.e., we integrate the mentions extracted from our best system into state-of-the-art coreference systems (both end-to-end and the pipeline system)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-123",
"text": "The rest of this section introduces our experimental settings in detail."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-125",
"text": "**DATA SET**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-126",
"text": "We evaluate our models on two different corpora, the CONLL 2012 English corpora (Pradhan et al., 2012) and the CRAC 2018 corpora (Poesio et al., 2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-127",
"text": "The CONLL data set is the standard reference corpora for coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-128",
"text": "The English subset consists of 2802, 342, and 348 documents for the train, development and test sets respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-129",
"text": "The CONLL data set is not however ideal for mention detection, since not all mentions are annotated, but only mentions involved in coreference chains of length > 1."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-130",
"text": "This has a negative impact on learning since singleton mentions will always receive negative labels."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-131",
"text": "The CRAC corpus uses data from the ARRAU corpus (Uryupina et al., 2019 corpus is more appropriate for studying mention detection as all mentions are annotated."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-132",
"text": "As done in the CRAC shared task, we used the RST portion of the corpora, consisting of news texts (1/3 of the PENN Treebank)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-133",
"text": "Since none of the state-of-the-art coreference systems predict singleton mentions, a version of the CRAC dataset with singleton mentions excluded was created for the coreference task evaluation."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-135",
"text": "**EVALUATION METRIC**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-136",
"text": "For our experiments on the mention detection, we report recall, precision and F1 scores for mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-137",
"text": "For our evaluation that involves the coreference system, we use the official CONLL 2012 scoring script to score our predictions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-138",
"text": "Following standard practice, we report recall, precision, and F1 scores for MUC, B 3 and CEAF \u03c64 and the average F1 score of those three metrics."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-140",
"text": "**BASELINE SYSTEM**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-141",
"text": "For the mention detection evaluation we use the Lee et al. (2018) system as baseline."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-142",
"text": "The baseline is trained end-toend on the coreference task and we use as baseline the mentions predicted by the system before carrying out coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-143",
"text": "For the coreference evaluation we use the state-of-the-art Lee et al. (2018) system as our baseline for the end-to-end system, and the Clark and Manning (2016a) system as our baseline for the pipeline system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-144",
"text": "During the evaluation, we slightly modified the Lee et al. (2018) system to allow the system to take the mentions predicted by our model instead of its internal mention detector."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-145",
"text": "Other than that we keep the system unchanged."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-147",
"text": "**HYPERPARAMETERS**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-148",
"text": "For our first model (LEE MD) we use the default settings of Lee et al. (2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-149",
"text": "For word embeddings the system uses 300-dimensional GloVe embeddings (Pennington et al., 2014) and 1024-dimensional ELMo embeddings (Peters et al., 2018)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-150",
"text": "The character-based embeddings are produced by a convolution neural network (CNN) which has a window sizes of 3, 4, and 5 characters (each has 50 filters)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-151",
"text": "The characters embeddings (8-dimensional) are randomly initialised and learned during the training."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-152",
"text": "The maximum span width is set to 30 tokens."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-153",
"text": "For our BIAFFINE MD model, we use the same LSTM settings and the hidden size of the FFNN as our first approach."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-154",
"text": "For word embeddings, we only use the ELMo embeddings (Peters et al., 2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-155",
"text": "For our third model (BERT MD), we fine-tune on the pretrained BERT BERT that consists of 12 layers of transformers."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-156",
"text": "The transformers use 768-dimensional hidden states and 12 self-attention heads."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-157",
"text": "The WordPiece embeddings (Wu et al., 2016 ) have a vocabulary of 30,000 tokens."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-158",
"text": "We use the same maximum span width as in our first approach (30 tokens)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-159",
"text": "The detailed neural network settings can be found in Table 1 ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-161",
"text": "**RESULTS AND DISCUSSIONS**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-162",
"text": "In this section, we first evaluate the proposed models in isolation on the mention detection task."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-163",
"text": "After that, we integrate the mentions predicted by our system into coreference resolution systems to evaluate the effects of our MD systems on the downstream applications."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-165",
"text": "**MENTION DETECTION TASK**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-166",
"text": "Evaluation on the CONLL data set."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-167",
"text": "For mention detection on the CONLL data set, we first take the best model from Lee et al. (2018) and use its default mention/token ratio (\u03bb = 0.4) to output predicted mentions before coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-168",
"text": "We use this as our baseline for the HIGH RE-CALL setting."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-169",
"text": "We then evaluate all three proposed models with the same \u03bb as that of the baseline."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-170",
"text": "As a result, the number of mentions predicted by different systems is the same, which means mention precision will be similar."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-171",
"text": "Thus, for the HIGH RECALL setting we compare the systems by mention recall."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-172",
"text": "As we can see from Table 2 , the baseline system already achieved a reasonably good recall of 96.6%. But even when compared with such a strong baseline, by simply separately training the mention detection part of the baseline system, the stand-alone LEE MD achieved an improvement of 0.7 p.p."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-173",
"text": "This indicates that mention detection task does not benefit from joint mention detection and coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-174",
"text": "The BERT MD achieved the same recall as the LEE MD, but BERT MD uses a much deeper network and is more expensive to train."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-175",
"text": "By contrast, the BIAFFINE MD uses the simplest network architecture among the three approaches, yet achieved the best results, outperforming the baseline by 0.9 p.p."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-176",
"text": "(26.5% error reduction)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-177",
"text": "Evaluation on the CRAC data set 3 For the CRAC data set, we train the Lee et al. (2018) system end-to-end on the reduced corpus with singleton mentions removed and extract mentions from the system by set \u03bb = 0.4."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-178",
"text": "We then train our models with the same \u03bb but on the full corpus, since our mention detectors naturally support both mention 3 As the Lee et al. (2018) system does not predict singleton mentions, the results on CRAC data set in Table 2 are evaluated without singleton mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-179",
"text": "While the results reported in Table 3 are evaluated with singleton mentions included."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-180",
"text": "88.0 89.7 89.1 Table 3 : Comparison between our BIAFFINE MD and the top performing systems on the mention detection task using the CONLL and CRAC data sets."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-181",
"text": "types (singleton and non-singleton mentions)."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-182",
"text": "Again, the baseline system has a decent recall of 95.4%."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-183",
"text": "Benefiting from the singletons, our LEE MD and BIAFFINE MD models achieved larger improvements when compared with the gains achieved on the CONLL data set."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-184",
"text": "The largest improvement (1.8 p.p.) is achieved by our BIAFFINE MD model with an error reduction rate of 39.1%."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-185",
"text": "BERT MD achieved a relatively smaller gain (0.8 p.p.) when compared with the other models; this might as a result of the difference in corpus size between CRAC and CONLL data set."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-186",
"text": "(The CRAC corpus is smaller than the CONLL data set.) Comparison with the State-of-the-art."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-187",
"text": "We compare our best system BIAFFINE MD with the rule-based mention detector of the Stanford deterministic system (Lee et al., 2013) and the neural mention detector of Poesio et al. (2018) ."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-188",
"text": "For HIGH F1 setting we use the common threshold (\u03b2 = 0.5) for binary classification problems without tuning."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-189",
"text": "For evaluation on CONLL we create in addition a variant of the HIGH RECALL setting (BALANCE) by setting \u03bb = 0.2; this is because we noticed that the score differences between the HIGH RECALL and HIGH F1 settings are relatively large (see Table 3 )."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-190",
"text": "The score differences between our two settings on CRAC data set are smaller; this might because the CRAC data set annotated both singleton and non-singleton mentions, hence the models are trained in a more balanced way."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-191",
"text": "Overall, when compared with the best-reported system (Poesio et al., 2018) Table 4 : Comparison between the baselines and the models enhanced by our BIAFFINE MD on the coreference resolution task."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-192",
"text": "tings outperforms their system by large margin of 5.3% and 6.5% on CONLL and CRAC data sets respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-193",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-194",
"text": "**COREFERENCE RESOLUTION TASK**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-195",
"text": "We then integrate the mentions predicted by our best system into the coreference resolution system to evaluate the effects of our better mention detectors on the downstream application."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-196",
"text": "Evaluation with the end-to-end system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-197",
"text": "We first evaluate our BIAFFINE MD in combination with the end-to-end Lee et al. (2018) system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-198",
"text": "We slightly modified the system to feed the system mentions predicted by our mention detector."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-199",
"text": "As a result, the original mention selection function is switched off, we keep all the other settings (include the mention scoring function) unchanged."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-200",
"text": "We then train the modified system to obtain a new model."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-201",
"text": "As illustrated in Table 4 , the model trained using mentions supplied by our BIAFFINE MD achieved a F1 score slightly lower than the original end-to-end system, nevertheless our mention detector has a better performance."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-202",
"text": "We think the performance drop might be the result of two factors."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-203",
"text": "First, by replacing the original mention selection function, the system actually becomes a pipeline system, thus cannot benefit from joint learning."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-204",
"text": "Second, the performance difference between our mention detector and the original mention selection function might not be large enough to deliver improvements on the final coreference results."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-205",
"text": "To test our hypotheses, we evaluated our BIAFFINE MD with two additional experiments."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-206",
"text": "In the first experiment, we enabled the original mention selection function and fed the system slightly more mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-207",
"text": "More precisely, we configured our BIAFFINE MD to output 0.5 mention per token instead of 0.4 i.e. \u03bb = 0.5."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-208",
"text": "As a result, the coreference system has the freedom to select its own mentions from a candidate pool supplied by our BI-AFFINE MD."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-209",
"text": "After training the system with the new setting, we get an average F1 of 72.6% (see table 4), which narrows the performance gap between the end-to-end system and the model trained without the joint learning."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-210",
"text": "This confirms our first hypothesis that by downgrading the system to a pipeline setting does harm the overall performance of the coreference resolution."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-211",
"text": "For our second experiment, we used the Lee et al. (2017) instead."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-212",
"text": "The Lee et al. (2018) system is an extended version of the Lee et al. (2017) system, hence they share most of the network architecture."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-213",
"text": "The Lee et al. (2017) has a lower performance on mention detection (93.5% recall when \u03bb = 0.4), which creates a large (4%) difference when compared with the recall of our BIAFFINE MD."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-214",
"text": "We train the system without the joint learning, and the newly trained model achieved an average F1 of 67.7% and this is 0.5 better than the original end-to-end Lee et al. (2017) system (see table 4 )."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-215",
"text": "This confirms our second hypothesis that a larger gain on mention recall is needed in order to show improvement on the overall system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-216",
"text": "We further evaluated the Lee et al. (2018) system on the CRAC data set."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-217",
"text": "We first train the original Lee et al. (2018) on the reduced version (with singletons removed) of the CRAC data set to create a baseline."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-218",
"text": "As we can see from Table 4, the baseline system has an average F1 score of 68.4%."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-219",
"text": "We then evaluate the system with mentions predicted by our BIAFFINE MD, we experiment with both joint learning disabled and enabled."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-220",
"text": "As shown in Table 4 , the model without joint learning achieved an overall score 0.1% lower than the baseline, but the new model has clearly a better recall on all three metrics when compared with the baseline."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-221",
"text": "The model trained with joint learning enabled achieved an average F1 of 69.1% which is 0.7% better than the baseline."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-222",
"text": "Evaluation on the pipeline system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-223",
"text": "We then evaluated our best model (BIAFFINE MD) with a pipeline system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-224",
"text": "We use the best-reported pipeline system by Clark and Manning (2016a) as our baseline."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-225",
"text": "The original system used the rule-based mention detector from the Stanford deterministic coreference system (Lee et al., 2013 ) (a performance comparison between the Lee et al. (2013) EMD and our BIAFFINE MD can be found in Table 3 )."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-226",
"text": "We modified the preprocessing pipeline of the system to use mentions predicted by our BIAFFINE MD."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-227",
"text": "We ran the system with both mentions from the HIGH RECALL and BALANCE settings, as both settings have reasonable good mention recall which is required to train a coreference system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-228",
"text": "After training the system with mentions from our BIAFFINE MD, the newly obtained models achieved large improvements of 0.8% and 1.7% for HIGH RECALL and BALANCE settings respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-229",
"text": "This suggests that the Clark and Manning (2016a) system works better on a smaller number of high-quality mentions than a larger number but lower quality mentions."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-230",
"text": "We also noticed that the speed of the Clark and Manning (2016a) system is sensitive to the size of the predicted mentions, both training and testing finished much faster when tested on the BALANCE setting."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-231",
"text": "We did not test the Clark and Manning (2016a) system on the CRAC data set, as a lot of effects are needed to fulfil the requirements of the preprocessing pipeline, e.g. predicted parse trees, named entity tags."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-232",
"text": "Overall our BIAFFINE MD showed its merit on enhancing the pipeline system."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-233",
"text": "----------------------------------"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-234",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-235",
"text": "In this work, we compare three neural network based approaches for mention detection."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-236",
"text": "The first model is a modified version of the mention detection part of the state-ofthe-art coreference resolution system )."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-237",
"text": "The second model used ELMo embeddings together with a bidirectional LSTM, and with a biaffine classifier on top."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-238",
"text": "The third model adapted the BERT model that based on the deep transformers and followed by a FFNN."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-239",
"text": "We assessed the performance of our models in both mention detection and coreference tasks."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-240",
"text": "In the evaluation of mention detection, our proposed models reduced up to 26% and 39% of the recall error when compared with the strong baseline on CONLL and CRAC data sets in a HIGH RECALL setting."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-241",
"text": "The same model (BIAFFINE MD) outperforms the best performing system on the CONLL and CRAC by large 5-6% in a HIGH F1 setting."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-242",
"text": "In term of the evaluation on coreference resolution task, by integrating our mention detector with the state-of-the-art coreference systems, we improved the end-to-end and pipeline systems by up to 0.7% and 1.7% respectively."
},
{
"sent_id": "6c4264bedb6683e909c1e530f22262-C001-243",
"text": "Overall, we introduced three neural mention detectors and showed that the improvements achieved on the mention detection task can be transferred to the downstream coreference resolution task."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"6c4264bedb6683e909c1e530f22262-C001-21"
],
[
"6c4264bedb6683e909c1e530f22262-C001-41"
],
[
"6c4264bedb6683e909c1e530f22262-C001-212"
]
],
"cite_sentences": [
"6c4264bedb6683e909c1e530f22262-C001-21",
"6c4264bedb6683e909c1e530f22262-C001-41",
"6c4264bedb6683e909c1e530f22262-C001-212"
]
},
"@USE@": {
"gold_contexts": [
[
"6c4264bedb6683e909c1e530f22262-C001-25",
"6c4264bedb6683e909c1e530f22262-C001-26"
],
[
"6c4264bedb6683e909c1e530f22262-C001-69"
],
[
"6c4264bedb6683e909c1e530f22262-C001-72"
],
[
"6c4264bedb6683e909c1e530f22262-C001-141"
],
[
"6c4264bedb6683e909c1e530f22262-C001-143"
],
[
"6c4264bedb6683e909c1e530f22262-C001-148"
],
[
"6c4264bedb6683e909c1e530f22262-C001-167"
],
[
"6c4264bedb6683e909c1e530f22262-C001-177"
],
[
"6c4264bedb6683e909c1e530f22262-C001-197"
],
[
"6c4264bedb6683e909c1e530f22262-C001-211",
"6c4264bedb6683e909c1e530f22262-C001-212"
],
[
"6c4264bedb6683e909c1e530f22262-C001-216",
"6c4264bedb6683e909c1e530f22262-C001-217"
]
],
"cite_sentences": [
"6c4264bedb6683e909c1e530f22262-C001-26",
"6c4264bedb6683e909c1e530f22262-C001-69",
"6c4264bedb6683e909c1e530f22262-C001-72",
"6c4264bedb6683e909c1e530f22262-C001-141",
"6c4264bedb6683e909c1e530f22262-C001-143",
"6c4264bedb6683e909c1e530f22262-C001-148",
"6c4264bedb6683e909c1e530f22262-C001-167",
"6c4264bedb6683e909c1e530f22262-C001-177",
"6c4264bedb6683e909c1e530f22262-C001-197",
"6c4264bedb6683e909c1e530f22262-C001-212",
"6c4264bedb6683e909c1e530f22262-C001-216",
"6c4264bedb6683e909c1e530f22262-C001-217"
]
},
"@EXT@": {
"gold_contexts": [
[
"6c4264bedb6683e909c1e530f22262-C001-32"
],
[
"6c4264bedb6683e909c1e530f22262-C001-144"
]
],
"cite_sentences": [
"6c4264bedb6683e909c1e530f22262-C001-32",
"6c4264bedb6683e909c1e530f22262-C001-144"
]
},
"@DIF@": {
"gold_contexts": [
[
"6c4264bedb6683e909c1e530f22262-C001-178"
],
[
"6c4264bedb6683e909c1e530f22262-C001-197",
"6c4264bedb6683e909c1e530f22262-C001-198",
"6c4264bedb6683e909c1e530f22262-C001-199"
]
],
"cite_sentences": [
"6c4264bedb6683e909c1e530f22262-C001-178",
"6c4264bedb6683e909c1e530f22262-C001-197"
]
}
}
},
"ABC_af39041414dec545df878404328aab_3": {
"x": [
{
"sent_id": "af39041414dec545df878404328aab-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-2",
"text": "Until recently, the application of discriminative training to log linear-based statistical machine translation has been limited to tuning the weights of a limited number of features or training features with a limited number of parameters."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-3",
"text": "In this paper, we propose to scale up discriminative training of (He and Deng, 2012) to train features with 150 million parameters, which is one order of magnitude higher than previously published effort, and to apply discriminative training to redistribute probability mass that is lost due to model pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-4",
"text": "The experimental results confirm the effectiveness of our proposals on NIST MT06 set over a strong baseline."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-7",
"text": "State-of-the-art statistical machine translation systems based on a log-linear framework are parameterized by {\u03bb, \u03a6}, where the feature weights \u03bb are discriminatively trained (Och and Ney, 2002; Chiang et al., 2008b; Simianer et al., 2012) by directly optimizing them against a translation-oriented metric such as BLEU."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-8",
"text": "The feature parameters \u03a6 can be roughly divided into two categories: dense feature that measures the plausibility of each translation rule from a particular aspect, e.g., the rule translation probabilities p(f |e) and p(e|f ); and sparse feature that fires when certain phenomena is observed, e.g., when a frequent word pair co-occured in a rule."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-9",
"text": "In contrast to \u03bb, feature parameters in \u03a6 are usually modeled by generative models for dense features, or by indicator functions for sparse ones."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-10",
"text": "It is therefore desirable to train the dense features for each rule in a discriminative fashion to maximize some translation criterion."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-11",
"text": "The maximum expected BLEU training of (He and Deng, 2012 ) is a recent effort towards this direction, and in this paper, we extend their work to a scaled-up task of discriminative training of the features of a strong hierarchical phrase-based model and confirm its effectiveness empirically."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-12",
"text": "In this work, we further consider the application of discriminative training to pruned model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-13",
"text": "Various pruning techniques (Johnson et al., 2007; Zens et al., 2012; Eck et al., 2007; Lee et al., 2012; Tomeh et al., 2011) have been proposed recently to filter translation rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-14",
"text": "One common consequence of pruning is that the probability distribution of many surviving rules become deficient, i.e. f p(f |e) < 1."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-15",
"text": "In practice, others have chosen either to leave the pruned rules as it-is, or simply to re-normalize the probability mass by distributing the pruned mass to surviving rules proportionally."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-16",
"text": "We argue that both approaches are suboptimal, and propose a more principled method to re-distribute the probability mass, i.e. using discriminative training with some translation criterion."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-17",
"text": "Our experimental results demonstrate that at various pruning levels, our approach improves performance consistently."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-18",
"text": "Particularly at the level of 50% of rules being pruned, the discriminatively trained models performs better than the unpruned baseline grammar."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-19",
"text": "This shows that discriminative training makes it possible to achieve smaller models that perform comparably or even better than the baseline model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-20",
"text": "Our contributions in this paper are two-folded: First of all, we scale up the maximum expected BLEU training proposed in (He and Deng, 2012) in a number of ways including using 1) a hierarchical phrase-based model, 2) a richer feature set, and 3) a larger training set with a much larger parameter set, resulting in more than 150 million parameters in the model being updated, which is one order magnitude higher than the phrase-based model reported in (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-21",
"text": "We are able to show a reasonable improvement over this strong baseline."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-22",
"text": "Secondly, we combine discriminative training with pruning techniques to reestimate parameters of pruned grammar."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-23",
"text": "Our approach is shown to alleviate the loss due to pruning, and sometimes can even outperform the baseline unpruned grammar."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-24",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-25",
"text": "**DISCRIMINATIVE TRAINING OF \u03a6**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-26",
"text": "Given the entire training data {F n , E n } N n=1 , and current parameterization {\u03bb, \u03a6}, we decode the source side of training data F n to produce hypothesis {\u00ca n } N n=1 ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-27",
"text": "Our goal is to update \u03a6 towards \u03a6 that maximizes the expected BLEU scores of the entire training data given the current \u03bb:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-28",
"text": "where B(\u00ca 1 ...\u00ca N ) is the BLEU score of the concatenated hypothesis of the entire training data, following (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-29",
"text": "Eq. 1 summarizes over all possible combinations of\u00ca 1 ...\u00ca N , which is intractable."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-30",
"text": "Hence we make two simplifying approximations as follows."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-31",
"text": "First, let the k-best hypotheses of the n-th sentence,\u00ca n = \u00ca 1 n , ...,\u00ca K n , approximate all its possible translation."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-32",
"text": "In other words, we assume that K k=1P (\u00ca k n |F n ) = 1, \u2200n."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-33",
"text": "Second, let the sum of sentence-level BLEU approximate the corpus BLEU."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-34",
"text": "We note that corpus BLEU is not strictly decomposable (Chiang et al., 2008a) , however, as the training data's size N gets big as in our case, we expect them to become more positively correlated."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-35",
"text": "Under these assumptions and the fact that each sentence is decoded independently, Eq. 1 can be algebraically simplified into:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-36",
"text": "where"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-37",
"text": "We detail the process in the Appendix."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-38",
"text": "To further simplify the problem and relate it with model pruning, we consider to update a subset of \u03b8 \u2282 \u03a6 while keeping other parameterization of \u03a6 unchanged, where \u03b8 = {\u03b8 ij = p(e j |f i )} denotes our parameter set that satisfies j \u03b8 ij = 1 and \u03b8 ij \u2265 0."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-39",
"text": "In experiments, we also consider {\u03b8 ji = p(f i |e j )}."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-40",
"text": "To alleviate overfitting, we introduce KL-distance based reguralization as in (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-41",
"text": "We thus arrive at the following objective function:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-42",
"text": "where \u03c4 controls the regularization term's contribution, and \u03b8 0 represents a prior parameter set, e.g., from the conventional maximum likelihood training."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-43",
"text": "The optimization algorithm is based on the Extended Baum Welch (EBW) (Gopalakrishnan et al., 1991) as derived by (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-44",
"text": "The final update rule is as follow:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-45",
"text": "where \u03b8 ij is the updated parameter,"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-46",
"text": "and \u03bb is the current feature's weight."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-48",
"text": "**DT IS BENEFICIAL FOR PRUNING**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-49",
"text": "Pruning is often a key part in deploying large-scale SMT systems for many reasons, such as for reducing runtime memory footprint and for efficiency."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-50",
"text": "Many pruning techniques have been proposed to assess translation rules and filter rules out if they are less plausible than others."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-51",
"text": "While different pruning techniques may use different criterion, they all assume that pruning does not affect the feature function values of the surviving rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-52",
"text": "This assumption may be suboptimal for some feature functions that have probabilistic sense since pruning will remove a portion of the probability mass that is previously assigned to the pruned rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-53",
"text": "To be concrete, for the rule translation probabilities \u03b8 ij under consideration, the constraint j \u03b8 ij = 1 will not hold for all source rules i after pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-54",
"text": "Previous works typically left the probability mass as it-is, or simply renormalize the pruned mass, i.e.\u03b8 ij = \u03b8 ij / j \u03b8 ij ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-55",
"text": "We argue that applying the DT techniques to a pruned grammar, as described in Sec. 2, provides a more principled method to redistribute the mass, i.e. by quantizing how each rule contributes to the expected BLEU score in comparison to other competing rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-56",
"text": "To empirically verify this, we consider the significance test based pruning (Johnson et al., 2007) , though our general idea can be appllied to any pruning techniques."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-57",
"text": "For our experiments, we use the significance pruning tool that is available as part of Moses decoder package (Koehn et al., 2007) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-59",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-60",
"text": "Our experiments are designed to serve two goals: 1) to show the performance of discriminative training of feature parameters \u03b8 in a large-scale task; and 2) to show the effectiveness of DT when applied to pruned grammar."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-61",
"text": "Our baseline system is a state-of-the-art hierarchical phrase-based system as described in (Zhou et al., 2008) , trained on six million parallel sentences corpora that are available to the DARPA BOLT Chinese-English task."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-62",
"text": "The training corpora includes a mixed genre of news wire, broadcast news, web-blog and comes from various sources such as LDC, HK Hansard and UN data."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-63",
"text": "In total, there are 50 dense features in our translation system."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-64",
"text": "In addition to the standard features which include the rule translation probabilities, we incorporate features that are found useful for developing a state-of-the-art baseline, e.g. provenancebased lexical features (Chiang et al., 2011) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-65",
"text": "We use a large 6-gram language model, which we train on a 10 billion words monolingual corpus, including the English side of our parallel corpora plus other corpora such as Gigaword (LDC2011T07) and Google News."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-66",
"text": "To prevent possible over-fitting, we only kept the rules that have at most three terminal words (plus up to two nonterminals) on the source side, resulting in a grammar with 167 million rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-67",
"text": "Our discriminative training procedure includes updating both \u03bb and \u03b8, and we follow (He and Deng, 2012) to optimize them in an alternate manner."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-68",
"text": "That is, when we optimize \u03b8 via EBW, we keep \u03bb fixed and when we optimize \u03bb, we keep \u03bb fixed."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-69",
"text": "We use PRO (Hopkins and May, 2011) to tune \u03bb."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-70",
"text": "For discriminative training of \u03b8, we use a subset of 550 thousands of parallel sentences selected from the entire training data, mainly to allow for faster experimental cycle; they mainly come from news and web-blog domains."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-71",
"text": "For each sentence of this subset, we generate 500-best of unique hypotheses using the baseline model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-72",
"text": "The 1-best and the oracle BLEU scores for this subset are 40.19 and 47.06 respectively."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-73",
"text": "Following (He and Deng, 2012) , we focus on discriminative training of p(f |e) and p(e|f ), which in practice affects around 150 million of parameters; hence the title."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-74",
"text": "For the tuning and development sets, we set aside 1275 and 1239 sentences respectively from LDC2010E30 corpus."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-75",
"text": "The tune set is used by PRO for tuning \u03bb while the dev set is used to decide the best DT model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-76",
"text": "As for the blind test set, we report the performance on the NIST MT06 evaluation set, which consists of 1644 sentences from news and web-blog domains."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-77",
"text": "Our baseline system's performance on MT06 is 39.91 which is among the best number ever published so far in the community."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-78",
"text": "Table 1 compares the key components of our baseline system with that of (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-79",
"text": "As shown, we are working with a stronger system than (He and Deng, 2012) , especially in terms of the number of parameters under consideration |\u03b8|."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-80",
"text": "He&Deng (2012)"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-82",
"text": "**DT OF 150 MILLION PARAMETERS**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-83",
"text": "To ensure the correctness of our implementation, we show in Fig 2, the first five EBW updates with \u03c4 = 0.10."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-84",
"text": "As shown, the utility function log(U (\u03b8)) increases monotonically but is countered by the KL term, resulting in a smaller but consistent increase of the objective function O(\u03b8)."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-85",
"text": "This monotonicallyincreasing trend of the objective function confirms the correctness of our implementation since EBW algorithm is a bound-based technique that ensures growth transformations between updates."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-86",
"text": "We then explore the optimal setting for \u03c4 which controls the contribution of the regularization term."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-87",
"text": "Specifically, we perform grid search, exploring values of \u03c4 from 0.1 to 0.75."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-88",
"text": "For each \u03c4 , we run several iterations of discriminative training where each iteration involves one simultaneous update of p(f |e) and p(e|f ) according to Eq. 4, followed by one update of \u03bb via PRO (as in (He and Deng, 2012) )."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-89",
"text": "In total, we run 10 such iterations for each \u03c4 ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-90",
"text": "Across different \u03c4 , we find that the first iteration provides most of the gain while the subsequent iterations provide additional, smaller gain with occassional performance degradation; thus the translation performance is not always monotonically increasing over iteration."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-91",
"text": "We report the best score of each \u03c4 in Fig. 1 and at which iteration that score is produced."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-92",
"text": "As shown in Fig. 1 , all settings of \u03c4 improve over the baseline and \u03c4 = 0.10 gives the highest gain of 0.45 BLEU score."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-93",
"text": "This improvement is in the same ballpark as in (He and Deng, 2012 ) though on a scaledup task."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-94",
"text": "We next decode the MT06 using the best model (i.e. \u03c4 = 0.10 at 6-th iteration) observed on the dev set, and obtained 40.33 BLEU with an improvement of around 0.4 BLEU point."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-95",
"text": "We see this result as confirming the effectiveness of discriminative training but on a larger-scale task, adding to what was reported by (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-97",
"text": "**DT FOR SIGNIFICANCE PRUNING**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-98",
"text": "Next, we show the contribution of discriminative training for model pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-99",
"text": "To do so, we prune the translation grammar so that its size becomes 50%, 25%, 10% of the original grammar."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-100",
"text": "Respectively, we delete rules whose significance value below 15, 50 and 500."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-101",
"text": "Table 2 compares the statistics of the pruned grammars and the unpruned one."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-102",
"text": "In particular, columns 4 and 5 show the total averaged probability mass of the remaining rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-103",
"text": "This statistics provides some indication of how deficient the fea- (O(\u03b8 ) ), the regularization term (KL(\u03b8 )) and the unregularized objective function (log(U (\u03b8 ))) for five EBW updates of updating p(e j |f i ) tures are after pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-104",
"text": "As shown, the total averaged probability mass after pruning is below 100% and even lower for the more aggressive pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-105",
"text": "To show that the deficiency is suboptimal, we considers two baseline systems: models with/without mass renormalization."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-106",
"text": "We tune a new \u03bb for each model and use the new \u03bb to decode the dev and test sets."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-107",
"text": "The results are shown in columns 6 and 9 of Table 2 where we show the results for the unnormalized model in the brackets following the results for the re-normalized model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-108",
"text": "The results show that pruning degrades the performances and that naively re-normalizing the model provides no significant changes in performance."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-109",
"text": "Subsequently, we will focus on the normalized models as the baseline as they represents the starting points of our EBW iteration."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-110",
"text": "Next, we run discriminative training that would reassign the probability mass to the surviving rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-111",
"text": "First, we normalize p(f |e) and p(e|f ), so that they satisfy the sum to one constraint required by the algorithm."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-112",
"text": "Then, we run discriminative training on these pruned grammars using \u03c4 = 0.10 (i.e. the setting that gives the best performance for the unpruned grammar as discussed in Section 4.1)."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-113",
"text": "We report the results in columns 7 and 9 for the dev and test sets respectively, as well as the gain over the baseline system in columns 8 and 10."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-114",
"text": "As shown in Table 2 , DT provides a nice improvement over the baseline model of no mass reassignment."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-115",
"text": "For all pruning levels, DT can compensate the loss associated with pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-116",
"text": "In particular, at 50% level of pruning, there is a loss about 0.4"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-117",
"text": "baseline ( Table 2 : The statistics of grammars pruned at various level (column 1), including the number of unique source and target phrases (columns 2 & 3), total probability mass of the remaining rules for p(f |e) and p(e|f ) (columns 4 & 5), the performance of the pruned model before and after discriminative training as well as the gain on the dev and the test sets (columns 6 to 11)."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-118",
"text": "The iteration at which DT gives the best dev set is indicated by the number enclosed by bracket in column 7."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-119",
"text": "The baseline performance is in italics, followed by a number in the bracket which refers to the performance of using unnormalized model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-120",
"text": "The above-the-baseline performances are in bold."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-121",
"text": "BLEU point after pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-122",
"text": "With the DT on pruned model, all pruning losses are reclaimed and the new pruned model is even better than the unpruned original model."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-123",
"text": "This empirical result shows that leaving probability mass unassigned after pruning is suboptimal and that discriminative training provides a principled way to redistribute the mass."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-125",
"text": "**CONCLUSION**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-126",
"text": "In this paper, we first extend the maximum expected BLEU training of (He and Deng, 2012) to train two features of a state-of-the-art hierarchical phrasebased system, namely: p(f |e) and p(e|f )."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-127",
"text": "Compared to (He and Deng, 2012) , we apply the algorithm to a strong baseline that is trained on a bigger parallel corpora and comes with a richer feature set."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-128",
"text": "The number of parameters under consideration amounts to 150 million."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-129",
"text": "Our experiments show that discriminative training these two features (out of 50) gives around 0.40 BLEU point improvement, which is consistent with the conclusion of (He and Deng, 2012) but in a much larger-scale system."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-130",
"text": "Furthermore, we apply the algorithm to redistribute the probability mass of p(f |e) and p(e|f ) that is commonly lost due to conventional model pruning."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-131",
"text": "Previous techniques either leave the probability mass as it is or distribute it proportionally among the surviving rules."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-132",
"text": "We show that our proposal of using discriminative training to redistribute the mass empirically performs better, demonstrating the effectiveness of our proposal."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-134",
"text": "**APPENDIX**"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-135",
"text": "We describe the process to simplify Eq. 1 to Eq. 2, which is omitted in (He and Deng, 2012) ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-136",
"text": "For conciseness, we drop the conditions and write P (\u00ca i |F i ) as P (\u00ca i )."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-137",
"text": "We write Eq. 1 again below as Eq. 5 ."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-138",
"text": "We first focus on the first sentence E 1 /F 1 and expand the related terms from the equation as follow:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-139",
"text": "Expanding the inner summation, we arrive at:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-140",
"text": "Due to the that K k=1P (\u00ca K n |F n ) = 1, we can equate \u2200\u00ca 2 ...\u00ca N N i=2 P (\u00ca i ) and \u2200\u00ca 1 P (\u00ca 1 ) to 1."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-141",
"text": "Thus, we arrive at:"
},
{
"sent_id": "af39041414dec545df878404328aab-C001-142",
"text": "Notice that the second term has the same form as Eq. 5 except that the starting index starts from the second sentence."
},
{
"sent_id": "af39041414dec545df878404328aab-C001-143",
"text": "The same process can be performed and at the end, thus we can arrive at Eq. 2."
}
],
"y": {
"@EXT@": {
"gold_contexts": [
[
"af39041414dec545df878404328aab-C001-3"
],
[
"af39041414dec545df878404328aab-C001-11"
],
[
"af39041414dec545df878404328aab-C001-20"
],
[
"af39041414dec545df878404328aab-C001-95"
],
[
"af39041414dec545df878404328aab-C001-126"
]
],
"cite_sentences": [
"af39041414dec545df878404328aab-C001-3",
"af39041414dec545df878404328aab-C001-11",
"af39041414dec545df878404328aab-C001-20",
"af39041414dec545df878404328aab-C001-95",
"af39041414dec545df878404328aab-C001-126"
]
},
"@BACK@": {
"gold_contexts": [
[
"af39041414dec545df878404328aab-C001-11"
]
],
"cite_sentences": [
"af39041414dec545df878404328aab-C001-11"
]
},
"@DIF@": {
"gold_contexts": [
[
"af39041414dec545df878404328aab-C001-20"
],
[
"af39041414dec545df878404328aab-C001-79"
],
[
"af39041414dec545df878404328aab-C001-135"
]
],
"cite_sentences": [
"af39041414dec545df878404328aab-C001-20",
"af39041414dec545df878404328aab-C001-79",
"af39041414dec545df878404328aab-C001-135"
]
},
"@USE@": {
"gold_contexts": [
[
"af39041414dec545df878404328aab-C001-28"
],
[
"af39041414dec545df878404328aab-C001-40"
],
[
"af39041414dec545df878404328aab-C001-43"
],
[
"af39041414dec545df878404328aab-C001-73"
],
[
"af39041414dec545df878404328aab-C001-78"
],
[
"af39041414dec545df878404328aab-C001-88"
],
[
"af39041414dec545df878404328aab-C001-127"
]
],
"cite_sentences": [
"af39041414dec545df878404328aab-C001-28",
"af39041414dec545df878404328aab-C001-40",
"af39041414dec545df878404328aab-C001-43",
"af39041414dec545df878404328aab-C001-73",
"af39041414dec545df878404328aab-C001-78",
"af39041414dec545df878404328aab-C001-88",
"af39041414dec545df878404328aab-C001-127"
]
},
"@SIM@": {
"gold_contexts": [
[
"af39041414dec545df878404328aab-C001-93"
],
[
"af39041414dec545df878404328aab-C001-129"
]
],
"cite_sentences": [
"af39041414dec545df878404328aab-C001-93",
"af39041414dec545df878404328aab-C001-129"
]
}
}
},
"ABC_c384f48d5f04ea8d63bbbb94a3b24b_3": {
"x": [
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-2",
"text": "We present a graph-based semi-supervised learning algorithm to address the sentiment analysis task of rating inference."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-3",
"text": "Given a set of documents (e.g., movie reviews) and accompanying ratings (e.g., \"4 stars\"), the task calls for inferring numerical ratings for unlabeled documents based on the perceived sentiment expressed by their text."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-4",
"text": "In particular, we are interested in the situation where labeled data is scarce."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-5",
"text": "We place this task in the semi-supervised setting and demonstrate that considering unlabeled reviews in the learning process can improve ratinginference performance."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-6",
"text": "We do so by creating a graph on both labeled and unlabeled data to encode certain assumptions for this task."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-7",
"text": "We then solve an optimization problem to obtain a smooth rating function over the whole graph."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-8",
"text": "When only limited labeled data is available, this method achieves significantly better predictive accuracy over other methods that ignore the unlabeled examples during training."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-11",
"text": "Sentiment analysis of text documents has received considerable attention recently (Shanahan et al., 2005; Turney, 2002; Dave et al., 2003; Hu and Liu, 2004; Chaovalit and Zhou, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-12",
"text": "Unlike traditional text categorization based on topics, sentiment analysis attempts to identify the subjective sentiment expressed (or implied) in documents, such as consumer product or movie reviews."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-13",
"text": "In particular Pang and Lee proposed the rating-inference problem (2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-14",
"text": "Rating inference is harder than binary positive / negative opinion classification."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-15",
"text": "The goal is to infer a numerical rating from reviews, for example the number of \"stars\" that a critic gave to a movie."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-38",
"text": "y l \u2208 C. The remaining documents are unlabeled."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-39",
"text": "In our experiments, the unlabeled documents are also the test documents, a setting known as transduction."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-40",
"text": "The set of numerical ratings are C = {c 1 , . . . , c C }, with c 1 < . . ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-41",
"text": "< c C \u2208 R. For example, a one-star to four-star movie rating system has C = {0, 1, 2, 3}. We seek a function f : x \u2192 R that gives a continuous rating f (x) to a document x. Classification is done by mapping f (x) to the nearest discrete rating in C. Note this is ordinal classification, which differs from standard multi-class classification in that C is endowed with an order."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-42",
"text": "In the following we use 'review' and 'document,' 'rating' and 'label' interchangeably."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-43",
"text": "We make two assumptions:"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-44",
"text": "1. We are given a similarity measure w ij \u2265 0 between documents x i and x j ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-45",
"text": "w ij should be computable from features, so that we can measure similarities between any documents, including unlabeled ones."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-46",
"text": "A large w ij implies that the two documents tend to express the same sentiment (i.e., rating)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-47",
"text": "We experiment with positive-sentence percentage (PSP) based similarity which is proposed in (Pang and Lee, 2005) , and mutual-information modulated word-vector cosine similarity."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-48",
"text": "Details can be found in section 4."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-49",
"text": "2. Optionally, we are given numerical rating predictions\u0177 l+1 , . . . ,\u0177 n on the unlabeled documents from a separate learner, for instance -insensitive support vector regression (Joachims, 1999; Smola and Sch\u00f6lkopf, 2004) used by (Pang and Lee, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-50",
"text": "This acts as an extra knowledge source for our semisupervised learning framework to improve upon."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-51",
"text": "We note our framework is general and works without the separate learner, too."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-52",
"text": "(For this to work in practice, a reliable similarity measure is required.)"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-53",
"text": "We now describe our graph for the semisupervised rating-inference problem."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-54",
"text": "We do this piece by piece with reference to Figure 1 ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-55",
"text": "Our undirected graph G = (V, E) has 2n nodes V , and weighted edges E among some of the nodes."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-56",
"text": "\u2022 Each document is a node in the graph (open circles, e.g., x i and x j )."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-57",
"text": "The true ratings of these nodes f (x) are unobserved."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-58",
"text": "This is true even for the labeled documents because we allow for noisy labels."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-59",
"text": "Our goal is to infer f (x) for the unlabeled documents."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-60",
"text": "\u2022 Each labeled document (e.g., x j ) is connected to an observed node (dark circle) whose value is the given rating y j ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-61",
"text": "The observed node is a 'dongle' (Zhu et al., 2003) since it connects only to x j ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-62",
"text": "As we point out later, this serves to pull f (x j ) towards y j ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-63",
"text": "The edge weight between a labeled document and its dongle is a large number M ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-64",
"text": "M represents the influence of y j : if M \u2192 \u221e then f (x j ) = y j becomes a hard constraint."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-65",
"text": "\u2022 Similarly each unlabeled document (e.g., x i ) is also connected to an observed dongle node\u0177 i , whose value is the prediction of the separate learner."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-66",
"text": "Therefore we also require that f (x i ) is close to\u0177 i ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-67",
"text": "This is a way to incorporate multiple learners in general."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-68",
"text": "We set the weight between an unlabeled node and its dongle arbitrarily to 1 (the weights are scale-invariant otherwise)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-69",
"text": "As noted earlier, the separate learner is optional: we can remove it and still carry out graph-based semi-supervised learning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-70",
"text": "\u2022 Each unlabeled document x i is connected to kN N L (i), its k nearest labeled documents."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-71",
"text": "Distance is measured by the given similarity measure w. We want f (x i ) to be consistent with its similar labeled documents."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-72",
"text": "The weight between"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-73",
"text": "\u2022 Each unlabeled document is also connected to k N N U (i), its k nearest unlabeled documents (excluding itself)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-74",
"text": "The weight between x i and"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-75",
"text": "We also want f (x i ) to be consistent with its similar unlabeled neighbors."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-76",
"text": "We allow potentially different numbers of neighbors (k and k ), and different weight coefficients (a and b)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-77",
"text": "These parameters are set by cross validation in experiments."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-78",
"text": "The last two kinds of edges are the key to semisupervised learning: They connect unobserved nodes and force ratings to be smooth throughout the graph, as we discuss in the next section."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-80",
"text": "**GRAPH-BASED SEMI-SUPERVISED LEARNING**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-81",
"text": "With the graph defined, there are several algorithms one can use to carry out semi-supervised learning (Zhu et al., 2003; Delalleau et al., 2005; Joachims, 2003; Blum and Chawla, 2001; ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-82",
"text": "The basic idea is the same and is what we use in this paper."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-83",
"text": "That is, our rating function f (x) should be smooth with respect to the graph."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-84",
"text": "f (x) is not smooth if there is an edge with large weight w between nodes x i and x j , and the difference between f (x i ) and f (x j ) is large."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-85",
"text": "The (un)smoothness over the particular edge can be defined as w f ("
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-86",
"text": "Summing over all edges in the graph, we obtain the (un)smoothness L(f ) over the whole graph."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-87",
"text": "We call L(f ) the energy or loss, which should be minimized."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-88",
"text": "Let L = 1 . ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-89",
"text": ". l and U = l + 1 . . . n be labeled and unlabeled review indices, respectively."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-90",
"text": "With the graph in Figure 1 , the loss L(f ) can be written as"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-91",
"text": "A small loss implies that the rating of an unlabeled review is close to its labeled peers as well as its unlabeled peers."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-92",
"text": "This is how unlabeled data can participate in learning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-93",
"text": "The optimization problem is min f L(f )."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-94",
"text": "To understand the role of the parameters, we define \u03b1 = ak + bk and"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-95",
"text": "Thus \u03b2 controls the relative weight between labeled neighbors and unlabeled neighbors; \u03b1 is roughly the relative weight given to semi-supervised (nondongle) edges."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-96",
"text": "We can find the closed-form solution to the optimization problem."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-97",
"text": "Defining an n \u00d7 n matrixW ,"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-98",
"text": "Let W = max(W ,W ) be a symmetrized version of this matrix."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-99",
"text": "Let D be a diagonal degree matrix with"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-100",
"text": "Note that we define a node's degree to be the sum of its edge weights."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-101",
"text": "Let \u2206 = D \u2212 W be the combinatorial Laplacian matrix."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-102",
"text": "Let C be a diagonal dongle weight matrix with"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-103",
"text": "This is a quadratic function in f ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-104",
"text": "Setting the gradient to zero, \u2202L(f )/\u2202f = 0 , we find the minimum loss function"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-105",
"text": "Cy."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-106",
"text": "Because C has strictly positive eigenvalues, the inverse is well defined."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-107",
"text": "All our semi-supervised learning experiments use (7) in what follows."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-108",
"text": "Before moving on to experiments, we note an interesting connection to the supervised learning method in (Pang and Lee, 2005) , which formulates rating inference as a metric labeling problem (Kleinberg and Tardos, 2002) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-109",
"text": "Consider a special case of our loss function (1) when b = 0 and M \u2192 \u221e. It is easy to show for labeled nodes j \u2208 L, the optimal value is the given label: f (x j ) = y j ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-110",
"text": "Then the optimization problem decouples into a set of onedimensional problems, one for each unlabeled node"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-111",
"text": "The above problem is easy to solve."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-112",
"text": "It corresponds exactly to the supervised, non-transductive version of metric labeling, except we use squared difference while (Pang and Lee, 2005) used absolute difference."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-113",
"text": "Indeed in experiments comparing the two (not reported here), their differences are not statistically significant."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-114",
"text": "From this perspective, our semisupervised learning method is an extension with interacting terms among unlabeled data."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-116",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-117",
"text": "We performed experiments using the movie review documents and accompanying 4-class (C = {0, 1, 2, 3}) labels found in the \"scale dataset v1.0\" available at http://www.cs.cornell.edu/people/pabo/ movie-review-data/ and first used in (Pang and Lee, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-118",
"text": "We chose 4-class instead of 3-class labeling because it is harder."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-119",
"text": "The dataset is divided into four author-specific corpora, containing 1770, 902, 1307, and 1027 documents."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-120",
"text": "We ran experiments individually for each author."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-121",
"text": "Each document is represented as a {0, 1} word-presence vector, normalized to sum to 1."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-122",
"text": "We systematically vary labeled set size |L| \u2208 {0.9n, 800, 400, 200, 100, 50, 25, 12, 6} to observe the effect of semi-supervised learning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-123",
"text": "|L| = 0.9n is included to match 10-fold cross validation used by (Pang and Lee, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-124",
"text": "For each |L| we run 20 trials where we randomly split the corpus into labeled and test (unlabeled) sets."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-125",
"text": "We ensure that all four classes are represented in each labeled set."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-126",
"text": "The same random splits are used for all methods, allowing paired t-tests for statistical significance."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-127",
"text": "All reported results are average test set accuracy."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-128",
"text": "We compare our graph-based semi-supervised method with two previously studied methods: regression and metric labeling as in (Pang and Lee, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-130",
"text": "**REGRESSION**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-131",
"text": "We ran linear -insensitive support vector regression using Joachims' SVM light package (1999) with all default parameters."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-132",
"text": "The continuous prediction on a test document is discretized for classification."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-133",
"text": "Regression results are reported under the heading 'reg."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-134",
"text": "' Note this method does not use unlabeled data for training."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-136",
"text": "**METRIC LABELING**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-137",
"text": "We ran Pang and Lee's method based on metric labeling, using SVM regression as the initial label preference function."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-138",
"text": "The method requires an itemsimilarity function, which is equivalent to our similarity measure w ij ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-139",
"text": "Among others, we experimented with PSP-based similarity."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-140",
"text": "For consistency with (Pang and Lee, 2005) , supervised metric labeling results with this measure are reported under 'reg+PSP."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-141",
"text": "' Note this method does not use unlabeled data for training either."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-142",
"text": "PSP i is defined in (Pang and Lee, 2005) as the percentage of positive sentences in review x i ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-143",
"text": "The similarity between reviews x i , x j is the cosine angle Figure 2 : PSP for reviews expressing each fine-grain rating."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-144",
"text": "We identified positive sentences using SVM instead of Na\u00efve Bayes, but the trend is qualitatively the same as in (Pang and Lee, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-145",
"text": "between the vectors (PSP i , 1\u2212PSP i ) and (PSP j , 1\u2212 PSP j )."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-146",
"text": "Positive sentences are identified using a binary classifier trained on a separate \"snippet data set\" located at the same URL as above."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-147",
"text": "The snippet data set contains 10662 short quotations taken from movie reviews appearing on the rottentomatoes.com Web site."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-148",
"text": "Each snippet is labeled positive or negative based on the rating of the originating review."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-149",
"text": "Pang and Lee (2005) trained a Na\u00efve Bayes classifier."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-150",
"text": "They showed that PSP is a (noisy) measure for comparing reviews-reviews with low ratings tend to receive low PSP scores, and those with higher ratings tend to get high PSP scores."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-151",
"text": "Thus, two reviews with a high PSP-based similarity are expected to have similar ratings."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-152",
"text": "For our experiments we derived PSP measurements in a similar manner, but using a linear SVM classifier."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-153",
"text": "We observed the same relationship between PSP and ratings (Figure 2) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-154",
"text": "The metric labeling method has parameters (the equivalent of k, \u03b1 in our model)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-155",
"text": "Pang and Lee tuned them on a per-author basis using cross validation but did not report the optimal parameters."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-156",
"text": "We were interested in learning a single set of parameters for use with all authors."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-157",
"text": "In addition, since we varied labeled set size, it is convenient to tune c = k/|L|, the fraction of labeled reviews used as neighbors, instead of k. We then used the same c, \u03b1 for all authors at all labeled set sizes in experiments involving PSP."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-158",
"text": "Because c is fixed, k varies directly with |L| (i.e., when less labeled data is available, our algorithm considers fewer nearby labeled examples)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-159",
"text": "In an attempt to reproduce the findings in (Pang and Lee, 2005) , we tuned c, \u03b1 with cross validation."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-160",
"text": "Tuning ranges are c \u2208 {0.05, 0.1, 0.15, 0.2, 0.25, 0.3} and \u03b1 \u2208 {0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0}. The optimal parameters we found are c = 0.2 and \u03b1 = 1.5. (In section 4.4, we discuss an alternative similarity measure, for which we re-tuned these parameters.)"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-161",
"text": "Note that we learned a single set of shared parameters for all authors, whereas (Pang and Lee, 2005) tuned k and \u03b1 on a per-author basis."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-162",
"text": "To demonstrate that our implementation of metric labeling produces comparable results, we also determined the optimal author-specific parameters."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-163",
"text": "Table 1 shows the accuracy obtained over 20 trials with |L| = 0.9n for each author, using SVM regression, reg+PSP using shared c, \u03b1 parameters, and reg+PSP using authorspecific c, \u03b1 parameters (listed in parentheses)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-164",
"text": "The best result in each row of the table is highlighted in bold."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-165",
"text": "We also show in bold any results that cannot be distinguished from the best result using a paired t-test at the 0.05 level."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-166",
"text": "(Pang and Lee, 2005) found that their metric labeling method, when applied to the 4-class data we are using, was not statistically better than regression, though they observed some improvement for authors (c) and (d)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-167",
"text": "Using author-specific parameters, we obtained the same qualitative result, but the improvement for (c) and (d) appears even less significant in our results."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-168",
"text": "Possible explanations for this difference are the fact that we derived our PSP measurements using an SVM classifier instead of an NB classifier, and that we did not use the same range of parameters for tuning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-169",
"text": "The optimal shared parameters produced almost the same results as the optimal author-specific parameters, and were used in subsequent experiments."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-171",
"text": "**SEMI-SUPERVISED LEARNING**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-172",
"text": "We used the same PSP-based similarity measure and the same shared parameters c = 0.2, \u03b1 = 1.5 from our metric labeling experiments to perform graph-based semi-supervised learning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-173",
"text": "Table 1 : Accuracy using shared (c = 0.2, \u03b1 = 1.5) vs. author-specific parameters, with |L| = 0.9n."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-174",
"text": "additional parameters k , \u03b2, and M ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-175",
"text": "Again we tuned k , \u03b2 with cross validation."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-176",
"text": "Tuning ranges are k \u2208 {2, 3, 5, 10, 20} and \u03b2 \u2208 {0.001, 0.01, 0.1, 1.0, 10.0}. The optimal parameters are k = 5 and \u03b2 = 1.0."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-177",
"text": "These were used for all authors and for all labeled set sizes."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-178",
"text": "Note that unlike k = c|L|, which decreases as the labeled set size decreases, we let k remain fixed for all |L|."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-179",
"text": "We set M arbitrarily to a large number 10 8 to ensure that the ratings of labeled reviews are respected."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-181",
"text": "**ALTERNATE SIMILARITY MEASURES**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-182",
"text": "In addition to using PSP as a similarity measure between reviews, we investigated several alternative similarity measures based on the cosine of word vectors."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-183",
"text": "Among these options were the cosine between the word vectors used to train the SVM regressor, and the cosine between word vectors containing only words with high (top 1000 or top 5000) mutual information values."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-184",
"text": "The mutual information is computed with respect to the positive and negative classes in the 10662-document \"snippet data set.\" Finally, we experimented with using as a similarity measure the cosine between word vectors containing all words, each weighted by its mutual information."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-185",
"text": "We found this measure to be the best among the options tested in pilot trial runs using the metric labeling algorithm."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-186",
"text": "Specifically, we scaled the mutual information values such that the maximum value was one."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-187",
"text": "Then, we used these values as weights for the corresponding words in the word vectors."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-188",
"text": "For words in the movie review data set that did not appear in the snippet data set, we used a default weight of zero (i.e., we excluded them."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-189",
"text": "We experimented with setting the default weight to one, but found this led to inferior performance.)"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-190",
"text": "We repeated the experiments described in sections 4.2 and 4.3 with the only difference being that we used the mutual-information weighted word vector similarity instead of PSP whenever a similarity measure was required."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-191",
"text": "We repeated the tuning procedures described in the previous sections."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-192",
"text": "Using this new similarity measure led to the optimal parameters c = 0.1, \u03b1 = 1.5, k = 5, and \u03b2 = 10.0."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-193",
"text": "The results are reported under 'reg+WV' and 'SSL+WV,' respectively."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-194",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-195",
"text": "**RESULTS**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-196",
"text": "We tested the five algorithms for all four authors using each of the nine labeled set sizes."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-197",
"text": "The results are presented in table 2."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-198",
"text": "Each entry in the table represents the average accuracy across 20 trials for an author, a labeled set size, and an algorithm."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-199",
"text": "The best result in each row is highlighted in bold."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-200",
"text": "Any results on the same row that cannot be distinguished from the best result using a paired t-test at the 0.05 level are also bold."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-201",
"text": "The results indicate that the graph-based semisupervised learning algorithm based on PSP similarity (SSL+PSP) achieved better performance than all other methods in all four author corpora when only 200, 100, 50, 25, or 12 labeled documents were available."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-202",
"text": "In 19 out of these 20 learning scenarios, the unlabeled set accuracy by the SSL+PSP algorithm was significantly higher than all other methods."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-203",
"text": "While accuracy generally degraded as we trained on less labeled data, the decrease for the SSL approach was less severe through the mid-range labeled set sizes."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-204",
"text": "SSL+PSP remains among the best methods with only 6 labeled examples."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-205",
"text": "Note that the SSL algorithm appears to be quite sensitive to the similarity measure used to form the graph on which it is based."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-206",
"text": "In the experiments where we used mutual-information weighted word vector similarity (reg+WV and SSL+WV), we notice that reg+WV remained on par with reg+PSP at high labeled set sizes, whereas SSL+WV appears significantly worse in most of these cases."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-207",
"text": "It is clear that PSP is the more reliable similarity measure."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-208",
"text": "SSL uses the similarity measure in more ways than the metric labeling approaches (i.e., SSL's graph is denser), so it is not surprising that SSL's accuracy would suffer more with an inferior similarity measure."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-209",
"text": "Unfortunately, our SSL approach did not do as well with large labeled set sizes."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-210",
"text": "We believe this Table 2 : 20-trial average unlabeled set accuracy for each author across different labeled set sizes and methods."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-211",
"text": "In each row, we list in bold the best result and any results that cannot be distinguished from it with a paired t-test at the 0.05 level."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-212",
"text": "is due to two factors: a) the baseline SVM regressor trained on a large labeled set can achieve fairly high accuracy for this difficult task without considering pairwise relationships between examples; b) PSP similarity is not accurate enough."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-213",
"text": "Gain in variance reduction achieved by the SSL graph is offset by its bias when labeled data is abundant."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-214",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-215",
"text": "**DISCUSSION**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-216",
"text": "We have demonstrated the benefit of using unlabeled data for rating inference."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-217",
"text": "There are several directions to improve the work: 1. We will investigate better document representations and similarity measures based on parsing and other linguistic knowledge, as well as reviews' sentiment patterns."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-218",
"text": "For example, several positive sentences followed by a few concluding negative sentences could indicate an overall negative review, as observed in prior work (Pang and Lee, 2005) ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-219",
"text": "2."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-220",
"text": "Our method is transductive: new reviews must be added to the graph before they can be classified."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-221",
"text": "We will extend it to the inductive learning setting based on ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-222",
"text": "3. We plan to experiment with cross-reviewer and cross-domain analysis, such as using a model learned on movie reviews to help classify product reviews."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-16",
"text": "Pang and Lee showed that supervised machine learning techniques (classification and regression) work well for rating inference with large amounts of training data."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-17",
"text": "However, review documents often do not come with numerical ratings."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-18",
"text": "We call such documents unlabeled data."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-19",
"text": "Standard supervised machine learning algorithms cannot learn from unlabeled data."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-20",
"text": "Assigning labels can be a slow and expensive process because manual inspection and domain expertise are needed."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-21",
"text": "Often only a small portion of the documents can be labeled within resource constraints, so most documents remain unlabeled."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-22",
"text": "Supervised learning algorithms trained on small labeled sets suffer in performance."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-23",
"text": "Can one use the unlabeled reviews to improve rating-inference?"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-24",
"text": "Pang and Lee (2005) suggested that doing so should be useful."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-25",
"text": "We demonstrate that the answer is 'Yes.' Our approach is graph-based semi-supervised learning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-26",
"text": "Semi-supervised learning is an active research area in machine learning."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-27",
"text": "It builds better classifiers or regressors using both labeled and unlabeled data, under appropriate assumptions (Zhu, 2005; Seeger, 2001 )."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-28",
"text": "This paper contains three contributions:"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-29",
"text": "\u2022 We present a novel adaptation of graph-based semi-supervised learning (Zhu et al., 2003) to the sentiment analysis domain, extending past supervised learning work by Pang and Lee (2005) ;"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-30",
"text": "\u2022 We design a special graph which encodes our assumptions for rating-inference problems (section 2), and present the associated optimization problem in section 3;"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-31",
"text": "\u2022 We show the benefit of semi-supervised learning for rating inference with extensive experimental results in section 4."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-33",
"text": "**A GRAPH FOR SENTIMENT CATEGORIZATION**"
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-34",
"text": "The semi-supervised rating-inference problem is formalized as follows."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-35",
"text": "There are n review documents x 1 . . ."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-36",
"text": "x n , each represented by some standard feature representation (e.g., word-presence vectors)."
},
{
"sent_id": "c384f48d5f04ea8d63bbbb94a3b24b-C001-37",
"text": "Without loss of generality, let the first l \u2264 n documents be labeled with ratings y 1 . . ."
}
],
"y": {
"@EXT@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-29"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-29"
]
},
"@USE@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-47"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-49"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-117"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-128"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-142"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-47",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-49",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-117",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-128",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-142"
]
},
"@MOT@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-49",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-50"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-49"
]
},
"@SIM@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-108"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-140"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-144"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-159"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-108",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-140",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-144",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-159"
]
},
"@DIF@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-112"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-161"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-112",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-161"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-123"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-123"
]
},
"@BACK@": {
"gold_contexts": [
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-166"
],
[
"c384f48d5f04ea8d63bbbb94a3b24b-C001-218"
]
],
"cite_sentences": [
"c384f48d5f04ea8d63bbbb94a3b24b-C001-166",
"c384f48d5f04ea8d63bbbb94a3b24b-C001-218"
]
}
}
},
"ABC_f24dde456e02fdb8e65799685275d2_3": {
"x": [
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-46",
"text": "**APPROACH**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-47",
"text": "In this section, we describe the proposed framework and its variations."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-2",
"text": "In this paper, we apply a general deep learning (DL) framework for the answer selection task, which does not depend on manually defined features or linguistic tools."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-3",
"text": "The basic framework is to build the embeddings of questions and answers based on bidirectional long short-term memory (biLSTM) models, and measure their closeness by cosine similarity."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-4",
"text": "We further extend this basic model in two directions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-5",
"text": "One direction is to define a more composite representation for questions and answers by combining convolutional neural network with the basic framework."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-6",
"text": "The other direction is to utilize a simple but efficient attention mechanism in order to generate the answer representation according to the question context."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-7",
"text": "Several variations of models are provided."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-8",
"text": "The models are examined by two datasets, including TREC-QA and InsuranceQA."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-9",
"text": "Experimental results demonstrate that the proposed models substantially outperform several strong baselines."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-12",
"text": "The answer selection problem can be formulated as follows: Given a question q and an answer candidate pool {a 1 , a 2 , \u00b7 \u00b7 \u00b7 , a s } for this question, we aim to search for the best answer candidate a k , where 1 \u2264 k \u2264 s. An answer is a token sequence with an arbitrary length, and a question can correspond to multiple ground-truth answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-13",
"text": "In testing, the candidate answers for a question may not be observed in the training phase."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-14",
"text": "Answer selection is one of the essential components in typical question answering (QA) systems."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-15",
"text": "It is also a stand-alone task with applications in knowledge base construction and information extraction."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-16",
"text": "The major challenge of this task is that the correct answer might not directly share lexical units with the question."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-17",
"text": "Instead, they may only be semantically related."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-18",
"text": "Moreover, the answers are sometimes noisy and contain a large amount of unrelated information."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-19",
"text": "Recently, deep learning models have obtained a significant success on various natural language processing tasks, such as semantic analysis (Tang et al., 2015) , machine translation (Bahdanau et al., 2015) and text summarization (Rush et al., 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-20",
"text": "In this paper, we propose a deep learning framework for answer selection which does not require any feature engineering, linguistic tools, or external resources."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-21",
"text": "This framework is based on building bidirectional long short term memory (biLSTM) models on both questions and answers respectively, connecting with a pooling layer and utilizing a similarity metric to measure the matching degree."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-22",
"text": "We improve this basic model from two perspectives."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-23",
"text": "Firstly, a simple pooling layer may suffer from the incapability of keeping the local linguistic information."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-24",
"text": "In order to obtain better embeddings for the questions and answers, we build a convolutional neural network (CNN) structure on top of biLSTM."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-25",
"text": "Secondly, in order to better distinguish candidate answers according to the question, we introduce a simple but efficient attention model to this framework for the answer embedding generation according to the question context."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-26",
"text": "We report experimental results for two answer selection datasets: (1) InsuranceQA (Feng et al., 2015) 1 , a recently released large-scale non-factoid QA dataset from the insurance domain."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-27",
"text": "The rest of the paper is organized as follows: Section 2 describes the related work for answer selection; Section 3 provides the details of the proposed models; Experimental settings and results of InsuranceQA and TREC-QA datasets are discussed in section 4 and 5 respectively; Finally, we draw conclusions in section 6."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-29",
"text": "**RELATED WORK**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-30",
"text": "Previous work on answer selection normally used feature engineering, linguistic tools, or external resources."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-31",
"text": "For example, semantic features were constructed based on WordNet in (Yih et al., 2013) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-32",
"text": "This model pairs semantically related words based on word semantic relations."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-33",
"text": "In (Wang & Manning, 2010; Wang et al., 2007) , the answer selection problem is transformed to a syntactical matching between the question/answer parse trees."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-34",
"text": "Some work tried to fulfill the matching using minimal edit sequences between dependency parse trees (Heilman & Smith, 2010; Yao et al., 2013) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-35",
"text": "Recently, discriminative tree-edit features extraction and engineering over parsing trees were automated in (Severyn & Moschitti, 2013) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-36",
"text": "While these methods show effectiveness, they might suffer from the availability of additional resources, the effort of feature engineering and the systematic complexity by introducing linguistic tools, such as parse trees and dependency trees."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-37",
"text": "There were prior methods using deep learning technologies for the answer selection task."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-38",
"text": "The approaches for non-factoid question answering generally pursue the solution on the following directions: Firstly, the question and answer representations are learned and matched by certain similarity metrics (Feng et al., 2015; Yu et al., 2014; dos Santos et al., 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-39",
"text": "Secondly, a joint feature vector is constructed based on both the question and the answer, and then the task can be converted into a classification or learning-to-rank problem (Wang & Nyberg, 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-40",
"text": "Finally, recently proposed models for textual generation can intrinsically be used for answer selection and generation (Bahdanau et al., 2015; Vinyals & Le, 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-41",
"text": "The framework proposed in this work belongs to the first category."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-42",
"text": "There are two major differences between our approaches and the work in (Feng et al., 2015) : (1) The architectures developed in (Feng et al., 2015) are only based on CNN, whereas our models are based on bidirectional LSTMs, which are more capable of exploiting long-range sequential context information."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-43",
"text": "Moreover, we also integrate the CNN structures on the top of biLSTM for better performance."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-44",
"text": "(2) Feng et al. (2015) tackle the question and answer independently, while the proposed structures develop an efficient attentive models to generate answer embeddings according to the question."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-48",
"text": "We first introduce the general framework, which is to build bi-directional LSTM on both questions and their answer candidates, and then use the similarity metric to measure the distance of question answer pairs."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-49",
"text": "In the following two subsections, we extend the basic model in two independent directions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-51",
"text": "**BASIC MODEL: QA-LSTM**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-52",
"text": "Long Short-Term Memory (LSTM): Recurrent Neural Networks (RNN) have been widely exploited to deal with variable-length sequence input."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-53",
"text": "The long-distance history is stored in a recurrent hidden vector which is dependent on the immediate previous hidden vector."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-54",
"text": "LSTM (Hochreiter & Schmidhuber, 1997 ) is one of the popular variations of RNN to mitigate the gradient vanish problem of RNN."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-55",
"text": "Our LSTM implementation is similar to the one in (Graves et al., 2013 ) with minor modification."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-56",
"text": "Given an input sequence x = {x(1), x(2), \u00b7 \u00b7 \u00b7 , x(n)}, where x(t) is an E-dimension word vector in this paper."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-57",
"text": "The hidden vector h(t) ( the size is H ) at the time step t is updated as follows."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-58",
"text": "In the LSTM architecture, there are three gates (input i, forget f and output o), and a cell memory vector c. \u03c3 is the sigmoid function."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-59",
"text": "The input gate can determine how incoming vectors x t alter the state of the memory cell."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-60",
"text": "The output gate can allow the memory cell to have an effect on the outputs."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-61",
"text": "Finally, the forget gate allows the cell to remember or forget its previous state."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-62",
"text": "W \u2208 R H\u00d7E , U \u2208 R H\u00d7H and b \u2208 R H\u00d71 are the network parameters."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-63",
"text": "Bidirectional Long Short-Term Memory (biLSTM): Single direction LSTMs suffer a weakness of not utilizing the contextual information from the future tokens."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-64",
"text": "Bidirectional LSTM utilizes both the previous and future context by processing the sequence on two directions, and generate two independent sequences of LSTM output vectors."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-65",
"text": "One processes the input sequence in the forward direction, while the other processes the input in the reverse direction."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-66",
"text": "The output at each time step is the concatenation of the two output vectors from both directions, ie."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-68",
"text": "**QA-LSTM:**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-69",
"text": "The basic model in this work is shown in Figure 1 ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-70",
"text": "BiLSTM generates distributed representations for both the question and answer independently, and then utilize cosine similarity to measure their distance."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-71",
"text": "Following the same ranking loss in (Feng et al., 2015; Weston et al., 2014; Hu et al., 2014) , we define the training objective as a hinge loss."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-72",
"text": "where a + is a ground truth answer, a \u2212 is an incorrect answer randomly chosen from the entire answer space, and M is constant margin."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-73",
"text": "We treat any question with more than one ground truth as multiple training examples, each for one ground truth."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-74",
"text": "There are three simple ways to generate representations for questions and answers based on the word-level biLSTM outputs: (1) Average pooling; (2) max pooling; (3) the concatenation of the last vectors on both directions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-75",
"text": "The three strategies are compared with the experimental performance in Section 5."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-76",
"text": "Dropout operation is performed on the QA representations before cosine similarity matching."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-77",
"text": "Finally, from preliminary experiments, we observe that the architectures, in which both question and answer sides share the same network parameters, is significantly better than the one that the question and answer sides own their own parameters separately, and converges much faster."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-78",
"text": "As discussed in (Feng et al., 2015) , this is reasonable, because for a shared layer network, the corresponding elements in question and answer vectors represent the same biLSTM outputs."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-79",
"text": "While for the network with separate question and answer parameters, there is no such constraint and the model has doublesized parameters, making it difficult to learn for the optimizer."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-80",
"text": "In the previous subsection, we generate the question and answer representations only by simple operations, such as max or mean pooling."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-81",
"text": "In this subsection, we resort to a CNN structure built on the outputs of biLSTM, in order to give a more composite representation of questions and answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-82",
"text": "The structure of CNN in this work is similar to the one in (Feng et al., 2015) , as shown in Figure 2 ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-83",
"text": "Unlike the traditional forward neural network, where each output is interactive with each input, the convolutional structure only imposes local interactions between the inputs within a filter size m."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-84",
"text": "In this work, for every window with the size of m in biLSTM output vectors, ie."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-85",
"text": ", where t is a certain time step, the convolutional filter F = [F(0) \u00b7 \u00b7 \u00b7 F(m \u2212 1)] will generate one value as follows."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-86",
"text": "where b is a bias, and F and b are the parameters of this single filter."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-87",
"text": "Same as typical CNNs, a max-k pooling layer is built on the top of the convolutional layer."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-88",
"text": "Intuitively, we want to emphasize the top-k values from each convolutional filter."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-89",
"text": "By k-MaxPooling, the maximum k values will be kept for one filter, which indicate the highest degree that a filter matches the input sequence."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-90",
"text": "Finally, there are N parallel filters, with different parameter initialization, and the convolutional layer gets N -dimension output vectors."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-91",
"text": "We get two output vectors with dimension of kN for the questions and answers respectively."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-92",
"text": "In this work, k = 1."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-93",
"text": "k > 1 did not show any obvious improvement in our early experiments."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-94",
"text": "The intuition of this structure is, instead of evenly considering the lexical information of each token as the previous subsection, we emphasize on certain parts of the answer, such that QA-LSTM/CNN can more effectively differentiate the ground truths and incorrect answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-96",
"text": "**ATTENTION-BASED QA-LSTM**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-97",
"text": "In the previous subsection, we described one extension from the basic model, which targets at providing more composite embeddings for questions and answers respectively."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-98",
"text": "In this subsection, we investigate an extension from another perspective."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-99",
"text": "Instead of generating QA representation independently, we leverage a simple attention model for the answer vector generation based on questions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-100",
"text": "The fixed width of hidden vectors becomes a bottleneck, when the bidirectional LSTM models must propagate dependencies over long distances over the questions and answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-101",
"text": "An attention mechanism are used to alleviate this weakness by dynamically aligning the more informative parts of answers to the questions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-102",
"text": "This strategy has been used in many other natural language processing tasks, such as machine translation (Bahdanau et al., 2015; Sutskever et al., 2014) , sentence summarization (Rush et al., 2015) and factoid question answering (Hermann et al., 2015; Sukhbaatar et al., 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-103",
"text": "Inspired by the work in (Hermann et al., 2015) , we develop a very simple but efficient word-level attention on the basic model."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-104",
"text": "Figure 3 shows the structure."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-105",
"text": "Prior to the average or mean pooling, each biLSTM output vector will be multiplied by a softmax weight, which is determined by the question embedding from biLSTM."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-106",
"text": "Specifically, given the output vector of biLSTM on the answer side at time step t, h a (t), and the question embedding, o q , the updated vector h a (t) for each answer token are formulated below."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-107",
"text": "where W am , W qm and w ms are attention parameters."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-108",
"text": "Conceptually, the attention mechanism give more weights on certain words, just like tf-idf for each word."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-109",
"text": "However, the former computes the weights according to question information."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-110",
"text": "The major difference between this approach and the one in (Hermann et al., 2015) is that Hermann et al. (2015) 's attentive reader emphasizes the informative part of supporting facts, and then uses a combined embedding of the query and the supporting facts to predict the factoid answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-111",
"text": "In this work, we directly use the attention-based representations to measure the question/answer distances."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-112",
"text": "Experiments indicate the attention mechanism can more efficiently distinguish correct answers from incorrect ones according to the question text."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-114",
"text": "**QA-LSTM/CNN WITH ATTENTION**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-115",
"text": "The two extensions introduced previously are combined in a simple manner."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-116",
"text": "First, the biLSTM hidden vectors of answers h a (t) are multiplied by s a,q (t), which is computed from the question average pooling vectors o q , and updated to h a (t), illustrated in Eq. 9-11."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-117",
"text": "Then, the original question and updated answer hidden vectors serve as inputs of CNN structure respectively, such that the question context can be used to evaluate the softmax weights of the input of CNN."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-118",
"text": "From the experiments, we observe that the two extensions vary on their contributions on the performance improvement according to different datasets."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-119",
"text": "However, QA-LSTM/CNN with attention can outperform the baselines on both datasets."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-121",
"text": "**INSURANCEQA EXPERIMENTS**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-122",
"text": "Having described a number of models in the previous section, we evaluate the proposed approaches on the insurance domain dataset, InsuranceQA, provided by Feng et al. (2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-123",
"text": "The InsuranceQA dataset provides a training set, a validation set, and two test sets."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-124",
"text": "We do not see obvious categorical differentiation between two tests' questions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-125",
"text": "One can see the details of InsuranceQA data in (Feng et al., 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-126",
"text": "We list the numbers of questions and answers of the dataset in Table 1 ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-127",
"text": "A question may correspond to multiple answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-128",
"text": "The questions are much shorter than answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-129",
"text": "The average length of questions is 7, and the average length of answers is 94."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-130",
"text": "The long answers comparing to the questions post challenges for answer selection task."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-131",
"text": "This corpus contains 24981 unique answers in total."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-132",
"text": "For the development and test sets, the dataset also includes an answer pool of 500 candidate answers for each question."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-133",
"text": "These answer pools were constructed by including the correct answer(s) and randomly selecting candidate from the complete set of unique answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-134",
"text": "The top-1 accuracy of the answer pool is reported."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-136",
"text": "**SETUP**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-137",
"text": "The models in this work are implemented with Theano (Bastien et al., 2012) from scratch, and all experiments are processed in a GPU cluster."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-138",
"text": "We use the accuracy on validation set to locate the best epoch and best hyper-parameter settings for testing."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-139",
"text": "The word embedding is trained by word2vec (Mikolov et al., 2013) , and the word vector size is 100."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-140",
"text": "Word embeddings are also parameters and are optimized as well during the training."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-141",
"text": "Stochastic Gradient Descent (SGD) is the optimization strategy."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-142",
"text": "We tried different margin values, such as 0.05, 0.1 and 0.2, and finally fixed the margin as 0.2."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-143",
"text": "We also tried to include l 2 norm in the training objective."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-144",
"text": "However, preliminary experiments show that regularization factors do not show any improvements."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-145",
"text": "Also, the dimension of LSTM output vectors is 141 for one direction, such that biLSTM has a comparable number of parameters with a single-direction LSTM with 200 dimension."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-146",
"text": "We train our models in mini-batches (the batch size B is 20), and the maximum length L of questions and answers is 200."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-147",
"text": "Any tokens out of this range will be discarded."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-148",
"text": "Because the questions or answers within a mini-batch may have different lengths, we resort to a mask matrix M \u2208 R B\u00d7L to indicate the real length of each token sequence."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-150",
"text": "**BASELINES**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-151",
"text": "For comparison, we report the performances of four baselines in Table 2 : two state-of-the-art non-DL approaches and two variations of a strong DL approach based on CNN as follows."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-153",
"text": "**BAG-OF-WORD:**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-154",
"text": "The idf-weighted sum of word vectors for the question and for all of its answer candidates is used as a feature vector."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-155",
"text": "Similar to this work, the candidates are re-ranked according the cosine similarity to a question."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-156",
"text": "Metzler-Bendersky IR model: A state-of-the-art weighted dependency (WD) model, which employs a weighted combination of term-based and term proximity-based ranking features to score each candidate answer."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-157",
"text": "Architecture-II in (Feng et al., 2015) : Instead of using LSTM, a CNN model is employed to learn a distributed vector representation of a given question and its answer candidates, and the answers are scored by cosine similarity with the question."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-158",
"text": "No attention model is used in this baseline."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-160",
"text": "**ARCHITECTURE-II WITH GEOMETRICMEAN OF EUCLIDEAN AND SIGMOID DOT PRODUCT (GESD):**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-161",
"text": "GESD is used to measure the distance between the question and answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-162",
"text": "This is the model which achieved the best performance in (Feng et al., 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-164",
"text": "**RESULTS AND DISCUSSIONS**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-165",
"text": "In this section, detailed analysis on experimental results are given. or attention model."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-166",
"text": "They vary on how to utilize the biLSTM output vectors to form sentential embeddings for questions and answers in shown in section 3.1."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-167",
"text": "We can observe that just concatenating of the last vectors from both direction (A) performs the worst."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-168",
"text": "It is surprised to see using maxpooling (C) is much better than average pooling (B)."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-169",
"text": "The potential reason is that the max-pooling extracts more local values for each dimension, so that more local information can be reflected on the output embeddings."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-170",
"text": "From Row (D) to (F), CNN layers are built on the top of the biLSTM with different filter numbers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-171",
"text": "We set the filter width m = 2, and we did not see better performance if we increase m to 3 or 4."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-172",
"text": "Row (F) with 4000 filters gets the best validation accuracy, obtained a comparable performance with the best baseline (Row (D) in Table 2 )."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-173",
"text": "Row F shared a highly analogous CNN structure with Architecture II in (Feng et al., 2015) , except that the later used a shallow hidden layer to transform the word embeddings into the input of CNN structure, while Row F take the output of biLSTM as CNN input."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-174",
"text": "Row (G) and (H) corresponds to QA-LSTM with the attention model."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-175",
"text": "(G) connects the output vectors of answers after attention with a max pooling layer, and (H) with an average pooling."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-176",
"text": "In comparison to Model (C), Model (G) shows over 2% improvement on both validation and Test2 sets."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-177",
"text": "With respect to the model with mean pooling layers (B), the improvement from attention is more remarkable."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-178",
"text": "Model (H) is over 8% higher on all datasets compared to (B), and gets improvements from the best baseline by 3%, 2.8% and 1.2% on the validation, Test1 and Test2 sets, respectively."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-179",
"text": "Compared to Architecture II in (Feng et al., 2015) , which involved a large number of CNN filters, (H) model also has fewer parameters."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-180",
"text": "Row (I) corresponds to section 3.4, where CNN and attention mechanism are combined."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-181",
"text": "Although compared to (F), it shows 1% improvement on all sets, we fail to see obvious improvements compared to Model (H)."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-182",
"text": "Although Model (I) achieves better number on Test2, but does not on validation and Test1."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-183",
"text": "We assume that the effective attention might have vanished during the CNN operations."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-184",
"text": "However, both (H) and (I) outperform all baselines."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-185",
"text": "We also investigate the proposed models on how they perform with respect to long answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-186",
"text": "We divide the questions of Test1 and Test2 sets into eleven buckets, according to the average length of their ground truths."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-187",
"text": "In the table of Figure 4 , we list the bucket levels and the number of questions which belong to each bucket, for example, Test1 has 165 questions, whose average ground truth lengths are 55 < L \u2264 60."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-188",
"text": "We select models of (C), (F), (H) and (I) in Table 3 for comparison."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-189",
"text": "Model (C) is without attention and sentential embeddings are formed only by max pooling."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-190",
"text": "Model (F) utilizes CNN, while model (H) and (I) integrate attention."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-191",
"text": "As shown in the left figure in Figure 4 , (C) gets better or close performance compared to other models on buckets with shorter answers (\u2264 50, \u226455, \u226460)."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-192",
"text": "However, as the ground lengths increase, the gap between (C) and other models becomes more obvious."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-193",
"text": "The similar phenomenon is also observed in the right figure for Test2."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-194",
"text": "This suggests the effectiveness of the two extensions from the basic model of QA-LSTM, especially for long-answer questions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-195",
"text": "Feng et al. (2015) report that GESD outperforms cosine similarity in their models."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-196",
"text": "However, the proposed models with GESD as similarity scores do not provide any improvement on accuracy."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-197",
"text": "Buckets \u226450 \u226455 \u226460 \u226465 \u226470 \u226480 \u226490 \u2264100 \u2264120 \u2264160 >160 Wang et al. (2007) 0.6029 0.6852 Heilman & Smith (2010) 0.6091 0.6917 Wang & Manning (2010) 0.6029 0.6852 Yao et al. (2013) 0.6307 0."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-198",
"text": "Table 4 : Test results of baselines on TREC-QA Finally, we replace the cosine similarity with a MLP structure, whose input (282x2-dimension) is the concatenation of question and answer embeddings, and the output is a single similarity score and test the modified models by a variety of hidden layer size (100,500,1000)."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-199",
"text": "We observe that the modified models not only get >10% accuracy decrease, but also converge much slower."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-200",
"text": "One possible explanation is the involvement of more network parameters by MLP makes it more difficult for training, although we believed that MLP might partially avoid the conceptual challenge of projecting questions and answers in the same high-dimensional space, introduced by cosine similarity."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-202",
"text": "**TREC-QA EXPERIMENTS**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-203",
"text": "In this section we detail our experimental setup and results using the TREC-QA dataset."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-204",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-205",
"text": "**DATA, METRICS AND BASELINES**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-206",
"text": "In this paper, we adopt TREC-QA, created by Wang et al. (2007) Following previous work on this task, we use Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) as evaluation metrics, which are calculated using the official evaluation scripts."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-207",
"text": "In Table 4 , we list the performance of some prior work on this dataset, which can be referred to (Wang & Nyberg, 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-208",
"text": "We implemented the Architecture II in (Feng et al., 2015) from scratch."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-209",
"text": "Wang & Nyberg (2015) and Feng et al. (2015) are the best baselines on MAP and MRR respectively."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-210",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-211",
"text": "**SETUP**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-212",
"text": "We keep the configurations same as those in InsuranceQA in section 4.1, except the following differences: First, we set the minibatch size as 10; Second, we set the maximum length of questions and answers as 40 instead of 200."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-213",
"text": "Third, following (Wang & Nyberg, 2015) , We use 300-dimensional vectors that were trained and provided by word2vec 3 ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-214",
"text": "Finally, we use the models from the epoch with the best MAP on the validation set for training."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-215",
"text": "Moreover, although TREC-QA dataset provided negative answer candidates for each training question, we randomly select the negative answers from all the candidate answers in the training set."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-216",
"text": "Table 5 shows the performance of the proposed models."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-217",
"text": "Compared to Model (A), which is with average pooling on top of biLSTM but without attention, Model (B) with attention improves MAP by 0.7% and MRR by approximately 2%."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-218",
"text": "The combination of CNN with QA-LSTM (Model-C) gives greater improvement on both MAP and MRR from Model (A)."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-219",
"text": "Model (D), which combines the ideas of Model (B) and (C), achieves the performance, competitive to the best baselines on MAP, and 2\u223c4% improvement on MRR compared to (Wang & Nyberg, 2015) and (Feng et al., 2015) ."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-220",
"text": "Finally, Model (E), which corresponds to the same model (D) but uses a LSTM hidden vector size of 500, achieves the best results for both metrics and outperforms the baselines."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-222",
"text": "**RESULTS**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-223",
"text": "----------------------------------"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-224",
"text": "**CONCLUSION**"
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-225",
"text": "In this paper, we study the answer selection task by employing a bidirectional-LSTM based deep learning framework."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-226",
"text": "The proposed framework does not rely on feature engineering, linguistic tools or external resources, and can be applied to any domain."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-227",
"text": "We further extended the basic framework on two directions."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-228",
"text": "Firstly, we combine a convolutional neural network into this framework, in order to give more composite representations for questions and answers."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-229",
"text": "Secondly, we integrate a simple but efficient attention mechanism in the generation of answer embeddings according to the question."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-230",
"text": "Finally, two extensions combined together."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-231",
"text": "We conduct experiments using the TREC-QA dataset and the recently published InsuranceQA dataset."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-232",
"text": "Our experimental results demonstrate that the proposed models outperform a variety of strong baselines."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-233",
"text": "In the future, we would like to further evaluate the proposed approaches for different tasks, such as answer quality prediction in Community QA and recognizing textual entailment."
},
{
"sent_id": "f24dde456e02fdb8e65799685275d2-C001-234",
"text": "With respect to the structural perspective, we plan to generate the attention mechanism to phrasal or sentential levels."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"f24dde456e02fdb8e65799685275d2-C001-26"
],
[
"f24dde456e02fdb8e65799685275d2-C001-38"
],
[
"f24dde456e02fdb8e65799685275d2-C001-71"
],
[
"f24dde456e02fdb8e65799685275d2-C001-78"
],
[
"f24dde456e02fdb8e65799685275d2-C001-122"
],
[
"f24dde456e02fdb8e65799685275d2-C001-125"
],
[
"f24dde456e02fdb8e65799685275d2-C001-208"
]
],
"cite_sentences": [
"f24dde456e02fdb8e65799685275d2-C001-26",
"f24dde456e02fdb8e65799685275d2-C001-38",
"f24dde456e02fdb8e65799685275d2-C001-71",
"f24dde456e02fdb8e65799685275d2-C001-78",
"f24dde456e02fdb8e65799685275d2-C001-122",
"f24dde456e02fdb8e65799685275d2-C001-125",
"f24dde456e02fdb8e65799685275d2-C001-208"
]
},
"@DIF@": {
"gold_contexts": [
[
"f24dde456e02fdb8e65799685275d2-C001-42",
"f24dde456e02fdb8e65799685275d2-C001-43",
"f24dde456e02fdb8e65799685275d2-C001-44"
],
[
"f24dde456e02fdb8e65799685275d2-C001-173"
],
[
"f24dde456e02fdb8e65799685275d2-C001-179"
],
[
"f24dde456e02fdb8e65799685275d2-C001-219"
]
],
"cite_sentences": [
"f24dde456e02fdb8e65799685275d2-C001-42",
"f24dde456e02fdb8e65799685275d2-C001-44",
"f24dde456e02fdb8e65799685275d2-C001-173",
"f24dde456e02fdb8e65799685275d2-C001-179",
"f24dde456e02fdb8e65799685275d2-C001-219"
]
},
"@SIM@": {
"gold_contexts": [
[
"f24dde456e02fdb8e65799685275d2-C001-82"
],
[
"f24dde456e02fdb8e65799685275d2-C001-173"
]
],
"cite_sentences": [
"f24dde456e02fdb8e65799685275d2-C001-82",
"f24dde456e02fdb8e65799685275d2-C001-173"
]
},
"@BACK@": {
"gold_contexts": [
[
"f24dde456e02fdb8e65799685275d2-C001-157"
],
[
"f24dde456e02fdb8e65799685275d2-C001-162"
],
[
"f24dde456e02fdb8e65799685275d2-C001-209"
]
],
"cite_sentences": [
"f24dde456e02fdb8e65799685275d2-C001-157",
"f24dde456e02fdb8e65799685275d2-C001-162",
"f24dde456e02fdb8e65799685275d2-C001-209"
]
}
}
},
"ABC_809ad258132199e3eae8add5d1bfdf_3": {
"x": [
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-2",
"text": "The complex compositional structure of language makes problems at the intersection of vision and language challenging."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-3",
"text": "But language also provides a strong prior that can result in good superficial performance, without the underlying models truly understanding the visual content."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-4",
"text": "This can hinder progress in pushing state of art in the computer vision aspects of multi-modal AI."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-5",
"text": "In this paper, we address binary Visual Question Answering (VQA) on abstract scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-6",
"text": "We formulate this problem as visual verification of concepts inquired in the questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-7",
"text": "Specifically, we convert the question to a tuple that concisely summarizes the visual concept to be detected in the image."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-8",
"text": "If the concept can be found in the image, the answer to the question is \"yes\", and otherwise \"no\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-74",
"text": "Visual abstraction + language."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-9",
"text": "Abstract scenes play two roles (1) They allow us to focus on the highlevel semantics of the VQA task as opposed to the low-level recognition problems, and perhaps more importantly, (2) They provide us the modality to balance the dataset such that language priors are controlled, and the role of vision is essential."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-10",
"text": "In particular, we collect fine-grained pairs of scenes for every question, such that the answer to the question is \"yes\" for one scene, and \"no\" for the other for the exact same question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-11",
"text": "Indeed, language priors alone do not perform better than chance on our balanced dataset."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-12",
"text": "Moreover, our proposed approach matches the performance of a state-of-the-art VQA approach on the unbalanced dataset, and outperforms it on the balanced dataset."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-13",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-14",
"text": "**INTRODUCTION**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-15",
"text": "Problems at the intersection of vision and language are increasingly drawing more attention."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-16",
"text": "We are witnessing a move beyond the classical \"bucketed\" recognition paradigm (e.g. label every image with categories) to rich compositional tasks involving natural language."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-17",
"text": "Some of these problems concerning vision and language have proven surprisingly easy to take on with relatively simple techniques."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-18",
"text": "Consider image captioning, which involves generating a * The first two authors contributed equally."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-19",
"text": "Figure 1 : We address the problem of answering binary questions about images."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-20",
"text": "To eliminate strong language priors that shadow the role of detailed visual understanding in visual question answering (VQA), we use abstract scenes to collect a balanced dataset containing pairs of complementary scenes: the two scenes have opposite answers to the same question, while being visually as similar as possible."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-21",
"text": "We view the task of answering binary questions as a visual verification task: we convert the question into a tuple that concisely summarizes the visual concept, which if present, result in the answer of the question being \"yes\", and otherwise \"no\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-22",
"text": "Our approach attends to relevant portions of the image when verifying the presence of the visual concept."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-23",
"text": "sentence describing a given image [12, 6, 10, 26, 21, 19, 35] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-24",
"text": "It is possible to get state of the art results with a relatively coarse understanding of the image by exploiting the statistical biases (inherent in the world and in particular datasets) that are captured in standard language models."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-25",
"text": "For example, giraffes are usually found in grass next to a tree in the MS COCO dataset images [22] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-26",
"text": "Because of this, the generic caption \"A giraffe is standing in grass next to a tree\" is applicable to most images containing a giraffe in the dataset."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-27",
"text": "The machine can confidently generate this caption just by recognizing a \"giraffe\", without recognizing \"grass\", or \"tree\", or \"standing\", or \"next to\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-28",
"text": "In general, captions borrowed from nearest neighbor images result in a surprisingly high performance [8] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-29",
"text": "A more recent task involving vision and language is Visual Question Answering (VQA)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-30",
"text": "A VQA system takes an image and a free-form natural language question about the image as input (e.g. \"What is the color of the girl's shoes?\", or \"Is the boy jumping?\"), and produces a natural language answer as its output (e.g. \"blue\", or \"yes\")."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-31",
"text": "Unlike image captioning, answering questions requires the ability to identify specific details in the image (e.g. color of an object, or activity of a person)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-32",
"text": "There are several recently proposed VQA datasets on real images e.g. [2, 24, 25, 14, 29] , as well as on abstract scenes [2] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-33",
"text": "The latter allows research on semantic reasoning without first requiring the development of highly accurate detectors."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-34",
"text": "Even in this task, however, a simple prior can give the right answer a surprisingly high percentage of the time."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-35",
"text": "For example, in the VQA dataset (with images from MS COCO) [2] , the most common sport answer \"tennis\" is the correct answer for 41% of the questions starting with \"What sport is\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-36",
"text": "Similarly, \"white\" alone is the correct answer for 23% of the questions starting with \"What color are the\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-37",
"text": "Almost half of all questions in the VQA datatset [2] can be answered correctly by a neural network that ignores the image completely and uses the question alone, relying on systematic regularities in the kinds of questions that are asked and what answers they tend to have."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-38",
"text": "This is true even for binary questions, where the answer is either \"yes\" or \"no\", such as \"Is the man asleep?\" or \"Is there a cat in the room?\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-39",
"text": "One would think that without considering the image evidence, both answers would be equally plausible."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-40",
"text": "Turns out, one can answer 68% of binary questions correctly by simply answering \"yes\" to all binary questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-41",
"text": "Moreover, a language-only neural network can correctly answer more than 78% of the binary questions, without even looking at the image."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-42",
"text": "As also discussed in [32] , such dataset bias effects can give a false impression that a system is making progress towards the goal of understanding images correctly."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-43",
"text": "Ideally, we want language to pose challenges involving the visual understanding of rich semantics while not allowing the systems to get away with ignoring the visual information."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-44",
"text": "Similar to the ideas in [15] , we propose to unbias the dataset, which would force machine learning algorithms to exploit image information in order to improve their scores instead of simply learning to game the test."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-45",
"text": "This involves not only having an equal number of \"yes\" and \"no\" answers on the test as a whole, but also ensuring that each particular question is unbiased, so that the system has no reason to believe, without bringing in visual information, that a question should be answered with \"yes\" or \"no.\" In this paper, we focus on binary (yes/no) questions for two reasons."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-46",
"text": "First, unlike open-ended questions (Q: \"what is the man playing?\" A: \"tennis\"), in binary questions (Q: \"is the man playing tennis?\") all relevant semantic informa-tion (including \"tennis\") is available in the question alone."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-47",
"text": "Thus, answering binary questions can be naturally viewed as visual verification of concepts inquired in the question (\"man playing tennis\")."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-48",
"text": "Second, binary questions are easier to evaluate than open-ended questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-49",
"text": "Although our approach of visual verification is applicable to real images (more discussion in Sec. 6), we choose to use abstract images [2, 3, 39, 38, 40] as a test bed because abstract scene images allow us to focus on high-level semantic reasoning."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-50",
"text": "They also allow us to balance the dataset by making changes to the images, something that would be difficult or impossible with real images."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-51",
"text": "Our main contributions are as follows: (1) We balance the existing abstract binary VQA dataset [2] by creating complementary scenes so that all questions 1 have an answer of \"yes\" for one scene and an answer of \"no\" for another closely related scene."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-52",
"text": "We show that a languageonly approach performs significantly worse on this balanced dataset."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-53",
"text": "(2) We propose an approach that summarizes the content of the question in a tuple form which concisely describes the visual concept whose existence is to be verified in the scene."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-54",
"text": "We answer the question by verifying if the tuple is depicted in the scene or not (See Fig. 1 )."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-55",
"text": "We present results when training and testing on the balanced and unbalanced datasets."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-57",
"text": "**RELATED WORK**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-58",
"text": "Visual question answering."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-59",
"text": "Recent work has proposed several datasets and methods to promote research on the task of visual question answering [15, 4, 33, 24, 2, 25, 14, 29] , ranging from constrained settings [15, 24, 29] to freeform natural language questions and answers [4, 33, 2, 25, 14] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-60",
"text": "For example, [15] proposes a system to generate binary questions from templates using a fixed vocabulary of objects, attributes, and relationships between objects."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-61",
"text": "[33] has studied joint parsing of videos and corresponding text to answer queries about videos."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-62",
"text": "[24] studied VQA with synthetic (templated) and human-generated questions, both with the restriction of answers being limited to 16 colors and 894 object categories or sets of categories."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-63",
"text": "A number of recent papers [2, 14, 25, 29] proposed neural network models for VQA composing LSTMs (for questions) and CNNs (for images)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-64",
"text": "[2] introduced a large-scale dataset for free-form and open-ended VQA, along with several natural VQA models."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-65",
"text": "[4] uses crowdsourced workers to answer questions about visual content asked by visually-impaired users."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-66",
"text": "Data augmentation."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-67",
"text": "Classical data augmentation techniques (such as mirroring, cropping) have been widely used in past few years [18, 31] to provide high capacity models additional data to learn from."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-68",
"text": "These transformations are designed to not change the label distribution in the training data."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-69",
"text": "In this work, we \"augment\" our dataset to explicitly change the label distribution."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-70",
"text": "We use human subjects to collect additional scenes such that every question in our dataset has equal number of 'yes' and 'no' answers (to the extent possible)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-71",
"text": "In that sense, our approach can be viewed as semantic data augmentation."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-72",
"text": "Several classification datasets, such as ImageNet [7] try to be balanced. But this is infeasible for the VQA task on real images because of the heavytail of concepts captured by language."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-73",
"text": "This motivates our use of abstract scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-75",
"text": "A number of works have used abstract scenes to focus on high-level semantics and study its connection with other modalities such as language [23, 16, 38, 40, 39, 3, 13, 34] , including automatically describing abstract scenes [16] , generating abstract scenes that depict a description [39] , capturing common sense [13, 23, 34] , learning models of fine-grained interactions between people [3] , and learning the semantic importance of visual features [38, 40] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-76",
"text": "Some of these works have also taken advantage of visual abstraction to \"control\" the distribution of data, for example, [3] collects equal number of examples for each verb/preposition combinations, and [38] have multiple scenes that depict the exact same sentence/caption."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-77",
"text": "Similarly, we balance the dataset by making sure that each question in the dataset has a scene for \"yes\" and another scene for \"no\" to the extent possible."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-78",
"text": "Visual verification."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-79",
"text": "[30, 34] reason about the plausibility of commonsense assertions (men, ride, elephants) by gathering visual evidence for them in real images [30] and abstract scenes [34] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-80",
"text": "In contrast, we focus on visually-grounded image-specific questions like \"Is the man in the picture riding an elephant?\". [39] also reasons about relations between two objects, and maps these relations to visual features."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-81",
"text": "They take as input a description and automatically generate a scene that is compatible with all tuples in the description and is a plausible scene."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-82",
"text": "In our case, we have a single tuple (summary of the question) and we want to verify if it exists in a given image or not, for the goal of answering a free form \"yes/no\" question about the image."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-83",
"text": "Visual attention involves searching and attending to relevant image regions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-84",
"text": "[20, 36] uses alignment/attention for image caption generation."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-85",
"text": "Input is just an image, and they try to describe the entire image and local regions with phrases and sentences."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-86",
"text": "We address a different problem: visual question answering."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-87",
"text": "We are given an image and text (a question) as input."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-88",
"text": "We want to align parts of the question to regions in the image so as to extract detailed visual features of the regions of the image being referred to in the text."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-90",
"text": "**DATASETS**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-91",
"text": "We first describe the VQA dataset for abstract scenes collected by [2] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-92",
"text": "We then describe how we balance this dataset by collecting more scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-94",
"text": "**VQA DATASET ON ABSTRACT SCENES**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-95",
"text": "Abstract library."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-96",
"text": "The clipart library contains 20 \"paperdoll\" human models [3] spanning genders, races, and ages with 8 different expressions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-97",
"text": "The limbs are adjustable to allow for continuous pose variations."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-98",
"text": "In addition to humans, the library contains 99 objects and 31 animals in various poses."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-99",
"text": "The library contains two different scene types -\"indoor\" scenes, containing only indoor objects, e.g. desk, table, etc., and \"outdoor\" scenes, which contain outdoor objects, e.g. pond, tree, etc."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-100",
"text": "The two different scene types are indicated by different background in the scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-101",
"text": "VQA abstract dataset consists of 50K abstract scenes, with 3 questions for each scene, with train/val/test splits of 20K/10K/20K scenes respectively."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-102",
"text": "This results in total 60K train, 30K validation and 60K test questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-103",
"text": "Each question has 10 human-provided ground-truth answers."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-104",
"text": "Questions are categorized into 3 types -'yes/no', 'number', and 'other'."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-105",
"text": "In this paper, we focus on 'yes/no' questions, which gives us a dataset of 36,717 questions-24,396 train and 12,321 val questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-106",
"text": "Since test annotations are not publicly available, it is not possible to find the number of 'yes/no' type questions in test set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-107",
"text": "We use the binary val questions as our unbalanced test set, a random subset of 2,439 training questions as our unbalanced validation set, and rest of the training questions as our unbalanced train set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-109",
"text": "**BALANCING ABSTRACT BINARY VQA DATASET**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-110",
"text": "We balance the abstract VQA dataset by posing a counterfactual task -given an abstract scene and a binary question, what would the scene have looked like if the answer to the binary question was different? While posing such counterfactual questions and obtaining corresponding scenes is nearly impossible in real images, abstract scenes allow us to perform such reasoning."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-111",
"text": "We conducted the following Mechanical Turk study -given an abstract scene, and an associated question from the VQA dataset, we ask subjects to modify the clipart scene such that the answer changes from 'yes' to 'no' (or 'no' to 'yes')."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-112",
"text": "For example, for the question \"Is a cloud covering the sun?\", a worker can move the 'sun' into open space in the scene to change the answer from 'yes' to 'no'."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-113",
"text": "A snapshot of the interface is shown in Fig. 2 ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-114",
"text": "We ask the workers to modify the scene as little as possible."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-115",
"text": "We encourage minimal changes because these complementary scenes can be thought of as hard-negatives/positives to learn subtle differences in the visual signal that are relevant to answering questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-116",
"text": "This signal can be used as additional supervision for training models such as [37, 9, 28, 5] that can leverage explanations provided by the annotator in addition to labels."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-117",
"text": "Our complementary scenes can also be thought of as analogous to good pedagogical techniques where a learner is taught concepts by changing one thing at a time via contrasting (e.g., one fish vs. two fish, red ball vs. blue ball, etc.)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-118",
"text": "Full instructions on our interface can be found in supp."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-119",
"text": "material."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-120",
"text": "Note that there are some (scene, question) pairs that do not lend themselves to easy creation of complementary scenes with the existing clipart library."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-168",
"text": "We extract PRS tuples from all binary questions in the training data."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-121",
"text": "For instance, if the question is \"Is it raining?\", and the answer needs to be changed from 'no' to 'yes', it is not possible to create 'rain' in the current clipart library."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-122",
"text": "Fortunately, these scenes make up a small minority of the dataset (e.g., 6% of the test set)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-123",
"text": "To keep the balanced train and test set comparable to unbalanced ones in terms of size, we collect complementary scenes for \u223chalf of the respective splits -11,760 from train and 6,000 from test set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-124",
"text": "Since Turkers indicated that 2,137 scenes could not be modified to change the answer because of limited clipart library, we do not have complementary scenes for them."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-125",
"text": "In total, we have 10,295 complementary scenes for the train set and 5,328 complementary scenes for test, resulting in balanced train set containing 22,055 samples and balanced test set containing 11,328 samples."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-126",
"text": "We further split a balanced set of 2,202 samples from balanced train set for validation purposes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-127",
"text": "Examples from our balanced dataset are shown in Fig. 1 and Fig. 4 ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-128",
"text": "We use the publicly released VQA evaluation script in our experiments."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-129",
"text": "The evaluation metric uses 10 ground-truth answers for each question to compute performance."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-130",
"text": "To be consistent with the VQA dataset, we collected 10 answers from human subjects using AMT for all complementary scenes in the balanced test set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-131",
"text": "We compare the degree of balance in our unbalanced and balanced datasets."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-132",
"text": "We find that 92.65% of the (scene, question) pairs in the unbalanced test set do not have a corresponding complementary scene (where the answer to the same question is the opposite)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-133",
"text": "Only 20.48% of our balanced test set does not have corresponding complementary scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-134",
"text": "Note that our dataset is not 100% balanced either because there are some scenes which could not be modified to flip the answers to the questions (5.93%) or because the most common answer out of 10 human annotated answers for some questions does not match with the intended answer of the person creating the complementary scene (14.55%) either due to inter-human disagreement, or if the worker did not succeed in creating a good scene."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-136",
"text": "**APPROACH**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-137",
"text": "We present an overview of our approach before describing each step in detail in the following subsections."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-138",
"text": "To answer binary questions about images, we propose a two-step approach: (1) Language Parsing: where the question is parsed into a tuple, and (2) Visual Verification: where we verify whether that tuple is present in the image or not."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-139",
"text": "Our language parsing step summarizes a binary question into a tuple of the form , where P refers to primary object, R to relation and S to secondary object, e.g. for a binary question \"Is there a cat in the room?\", our goal is to extract a tuple of the form: . Tuples need not have all the arguments present."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-140",
"text": "For instance, \"Is the dog asleep\" \u2192 , The primary argument P is always present."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-141",
"text": "Since we only focus on binary questions, this extracted tuple captures the entire visual concept to be verified in the image."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-169",
"text": "Among the three arguments, P and S contain noun phrases."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-142",
"text": "If the concept is depicted in the image, the answer is \"yes\", otherwise the answer is \"no\". Once we extract tuples from questions (details in Sec. 4.1), we align the P and S arguments to objects in the image (Sec. 4.2)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-143",
"text": "We then extract text and image features (Sec"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-145",
"text": "**TUPLE EXTRACTION**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-146",
"text": "In this section, we describe how we extract
tuples from raw questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-147",
"text": "Existing NLP work such as [11] has studied this problem, however, these approaches are catered towards statements, and are not directly applicable to questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-148",
"text": "We only give an overview of our method, more details can be found in supp."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-149",
"text": "material."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-150",
"text": "Parsing: We use the Stanford parser to parse the question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-151",
"text": "Each word is assigned an entity, e.g. nominal subject (\"nsubj\"), direct object (\"dobj\"), etc."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-152",
"text": "We remove all characters other than letters and digits before parsing."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-153",
"text": "Summarizing: As an intermediate step, we first convert a question into a \"summary\", before converting that into a tuple."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-154",
"text": "First, we remove a set of \"stop words\" such as determiners (\"some\", \"the\", etc.) and auxillary verbs (\"is\", \"do\", etc.)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-155",
"text": "Our full list of stop words is provided in supp."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-156",
"text": "material."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-157",
"text": "Next, following common NLP practice, we remove all words before a nominal subject (\"nsubj\") or a passive nominal subject (\"nsubjpass\")."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-158",
"text": "For example, \"Is the woman on couch petting the dog?\" is parsed as \"Is(aux) the(det) woman(nsubj) on(case) couch(nmod) petting(root) the(det) dog(dobj)?\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-159",
"text": "The summary of this question can be expressed as (woman, on, couch, petting, dog)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-160",
"text": "Extracting tuple: Now that we have extracted a summary of each question, next we split it into PRS arguments."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-161",
"text": "Ideally, we would like P and S to be noun phrases (\"woman on couch\", \"dog\") and the relation R to be a verb phrase (\"petting\") or a preposition (\"in\") when the verb is a form of \"to be\"."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-162",
"text": "For example, , or . Thus, we apply the Hunpos Part of Speech (POS) tagger [17] to assign words to appropriate arguments of the tuple."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-163",
"text": "See supp."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-164",
"text": "material for details."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-166",
"text": "**ALIGNING OBJECTS TO PRIMARY (P) AND SECONDARY (S) ARGUMENTS**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-167",
"text": "In order to extract visual features that describe the objects in the scene being referred to by P and S, we need to align each of them with the image."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-170",
"text": "To determine which objects are being referred to by the P and S arguments, we follow the idea in [39] and compute the mutual information 2 between word occurrence (e.g. \"dog\"), and object occurrence (e.g. clipart piece #32)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-171",
"text": "We only consider P and S arguments that occur at least twice in the training set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-172",
"text": "At test time, given an image and a PRS tuple corresponding to a binary question, the object in the image with the highest mutual information with P is considered to be referred by the primary object, and similarly for S. If there is more than one instance of the object category in the image, we assign P/S to a random instance."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-173",
"text": "Note that for some questions with ground-truth answer 'no', it is possible that P or S actually refers to an object that is not present in the image (e.g. Question: \"Is there a cat in the image?\" Answer: \"no\")."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-174",
"text": "In such cases, some other object from images (say clipart #23, which is a table) will be aligned with P/S. However, since the category label ('table' ) of the aligned object is a feature, the model can learn to handle such cases, i.e., learn that when the question mentions 'cat' and the aligned clipart object category is 'table', the answer should be 'no'."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-175",
"text": "We found that this simple mutual information based alignment approach does surprisingly well."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-176",
"text": "This was also found in [39] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-177",
"text": "Fig. 3 shows examples of clipart objects and three words/phrases that have the highest mutual information."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-179",
"text": "**VISUAL VERIFICATION**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-180",
"text": "We have extracted PRS tuples and aligned PS to the clipart objects in the image, we can now compute a score indicating the strength of visual evidence for the concept inquired in the question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-181",
"text": "Our scoring function measures compatibility between image and text features (described in Sec. 4.4)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-182",
"text": "Our model is an ensemble of two similar models-Q-model and Tuple-model, whose common architecture is inspired from a recently proposed VQA approach [2] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-183",
"text": "Specifically, each model takes two inputs (image and question), each along a different branch."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-184",
"text": "The two models (Q-model and Tuple-model) use the same image features, but different language features."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-185",
"text": "Q-model encodes the sequential nature of the question by feeding it to an LSTM and using its 256dim hidden representation as a language embedding, while Tuple-model focuses on the important words in the question and uses concatenation of word2vec [27] embeddings (300dim) of P, R and S as the language features."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-186",
"text": "If P, R or S consist of more than one word, we use the average of the corresponding word2vec embeddings."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-187",
"text": "This 900-dimensional feature vector is passed through a fully-connected layer followed by a tanh non-linearity layer to create a dense 256dim language embedding."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-188",
"text": "The image is represented by rich semantic features, described in Sec. 4.4."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-189",
"text": "Our binary VQA model converts these image features into 256-dim with an inner-product layer, followed by a tanh layer."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-190",
"text": "This inner-product layer learns to map visual features onto the space of text features."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-191",
"text": "Now that both image and text features are in a common space, they are point-wise multiplied resulting in a 256-dim fused language+image representation."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-192",
"text": "This fused vector is then passed through two more fully-connected layers in a Multi-Layered Perceptron (MLP), which finally outputs a 2-way softmax score for the answers 'yes' and 'no'."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-193",
"text": "These predictions from the Q-model and Tuple-model are multiplied to obtain the final prediction."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-194",
"text": "Both the models are learned separately and end-to-end (including LSTM) with a cross-enptropy loss."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-195",
"text": "Our implementation uses Keras [1] ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-196",
"text": "Learning is performed via SGD with a batch-size of 32, dropout probability 0.5, and the model is trained till the validation loss plateaus."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-197",
"text": "At test time, given the question and image features, we can perform visual verification simply by performing forward pass through our network."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-198",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-199",
"text": "**VISUAL FEATURES**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-200",
"text": "We use the same features as [23] for our approach."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-201",
"text": "These visual features describe the objects in the image that are being referred to by the P and S arguments, their interactions, and the context of the scene within which these objects are present."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-202",
"text": "In particular, the feature vector for each scene has 1432 dimensions, which are composed of 563 dimensions for each primary object and secondary object, encoding object category (e.g., cat vs. dog vs. tree), instance (e.g., which particular tree), flip (i.e., facing left or right), absolute location modeled via GMMs, pose (for humans and animals), expression, age, gender and skin color (for humans), 48 dimensions for relative location between primary and secondary objects (modeled via GMMs), and 258 dimensions encoding which other object categories and instances are present in the scene around P and S."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-203",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-204",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-206",
"text": "**BASELINES**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-207",
"text": "We compare our model with several strong baselines including language-only models as well as a state-of-the-art VQA method."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-208",
"text": "Prior: Predicting the most common answer in the training set, for all test questions."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-209",
"text": "The most common answer is \"yes\" in the unbalanced set, and \"no\" in the balanced set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-210",
"text": "Blind-Q+Tuple: A language-only baseline which has a similar architecture as our approach except that each model only accepts language input and does not utilize any visual information."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-211",
"text": "Comparing our approach to Blind-Q+Tuple quantifies to what extent our model has succeeded in leveraging the image to answer questions correctly."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-212",
"text": "SOTA Q+Tuple+H-IMG: This VQA model has a similar architecture as our approach, except that it uses holistic image features (H-IMG) that describe the entire scene layout, instead of focusing on specific regions in the scene as determined by P and S. This model is analogous to the state-ofthe-art models presented in [2, 25, 29, 14] , except applied to abstract scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-213",
"text": "These holistic features include a bag-of-words for clipart objects occurrence (150-dim), human expressions (8-dim), and human poses (7-dim)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-214",
"text": "The 7 human poses refer to 7 clusters obtained by clustering all the human pose vectors (concatenation of (x, y) locations and global angles of all 15 deformable parts of human body) in the training set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-215",
"text": "We extract these 165-dim holistic features for the complete scene and for four quadrants, and concatenate them together to create a 825-dim vector."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-216",
"text": "These holistic image features are similar to decaf features for real images, which are good at capturing what is present where, but (1) do not attend to different parts of the image based on the questions, and (2) may not be capturing intricate interactions between objects."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-217",
"text": "Comparing our model to SOTA Q+Tuple+H-IMG quantifies the improvement in performance by attending to specific regions in the image as dictated by the question being asked, and explicitly capturing the interactions between the relevant objects in the scene."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-218",
"text": "In other words, we quantify the improvement in performance obtained by pushing for a deeper understanding of the image than generic global image descriptors."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-219",
"text": "Thus, we name our model Q+Tuple+A-IMG, where A is for attention."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-220",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-221",
"text": "**EVALUATION ON THE ORIGINAL (UNBALANCED) DATASET**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-222",
"text": "In this subsection, we train all models on the train splits of both the unbalanced and balanced datasets, and test on our unbalanced test set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-223",
"text": "The results are shown in Table 1 ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-224",
"text": "We draw the following key inferences: Vision helps."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-225",
"text": "We observe that models that utilize visual information tend to perform better than \"blind\" model when trained on the balanced dataset."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-226",
"text": "This is because the lack of strong language priors in the balanced dataset forces the models to focus on the visual understanding."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-227",
"text": "Attending to specific regions is important."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-228",
"text": "When trained on the balanced set where visual understanding is critical, our proposed model Q+Tuple+A-IMG, which focuses only on a specific region in the scene, outperforms all the baselines by a large margin."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-229",
"text": "Bias is exploited."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-230",
"text": "As expected, the performance of all models trained on unbalanced dataset is better than the balanced dataset, because these models learn the language biases while training on unbalanced dataset, which are also present in the unbalanced test set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-231",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-232",
"text": "**EVALUATION ON THE BALANCED DATASET**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-233",
"text": "We also evaluate all models trained on the train splits of both the unbalanced and balanced datasets, by testing on the balanced test set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-234",
"text": "The results are summarized in Table 2 ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-235",
"text": "Here are the observations from this experiment: Training on balanced is better."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-236",
"text": "It is clear from Table 2 that both language+vision models trained on balanced data perform better than the models trained on unbalanced data."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-237",
"text": "This may be because the models trained on balanced data have to learn to extract visual information to answer the question correctly, since they are no longer able to exploit Figure 4 : Qualitative results of our approach."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-260",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-238",
"text": "We show input questions, complementary scenes that are subtle (semantic) perturbations of each other, along with tuples extracted by our approach, and objects in the scenes that our model chooses to attend to while answering the question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-239",
"text": "Primary object is shown in red and secondary object is in blue."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-240",
"text": "language biases in the training set."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-241",
"text": "Where as models trained on the unbalanced set are blindsided into learning strong language priors, which are then not available at test time."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-242",
"text": "Blind models perform close to chance."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-243",
"text": "As expected, when trained on unbalanced dataset, the \"blind\" model's performance is significantly lower on the balanced dataset (66%) than on unbalanced (79%)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-244",
"text": "Note that the accuracy is higher than 50% because this is not binary classification accuracy but the VQA accuracy [2] , which provides partial credit when there is inter-human disagreement in the ground-truth answers."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-245",
"text": "Attention helps."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-246",
"text": "When trained on balanced dataset (where language biases are absent), our model Q+Tuple+A-IMG is able to outperform all baselines by a significant margin."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-247",
"text": "Specifically, our model gives improvement in performance relative to the state-of-the-art VQA model from [2] (Q+Tuple+H-IMG), showing that attending to relevant regions and describing them in detail helps, as also seen in Sec. 5.2."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-248",
"text": "Role of balancing."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-249",
"text": "We see clear improvements by reason-ing about vision in addition to language."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-250",
"text": "Note that in addition to the lack of language bias, the visual reasoning is also harder on the balanced dataset because now there are pairs of scenes with fine-grained differences but with opposite answers to the same question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-251",
"text": "So the model really needs to understand the subtle details of the scene to answer questions correctly."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-252",
"text": "Clearly, there is a lot of room for improvement and we hope our balanced dataset will encourage more future work on detailed understanding of visual semantics towards the goal of accurately answering questions about images."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-253",
"text": "Classifying a pair of complementary scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-254",
"text": "We experiment with an even harder setting -a test point consists of a pair of complementary scenes and the associated question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-255",
"text": "Recall, that by construction, the answer to the question is \"yes\" for one image in the pair, and \"no\" for the other."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-256",
"text": "This test point is considered to be correct only when the model is able to predict both its answers correctly."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-257",
"text": "Since language-only models only utilize the textual information in the question ignoring the image, and therefore, predict the same answer for both scenes, their accuracy is zero in this setting 3 ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-258",
"text": "The results of the baselines and our model, trained on balanced and unbalanced datasets, are shown in Table 3 ."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-259",
"text": "We observe that our model trained on the balanced dataset performs the best. And again, our model that focuses on relevant regions in the image to answer the question outperforms the state-of-the-art approach of [2] (Q+Tuple+H-IMG) that does not model attention."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-261",
"text": "**ANALYSIS**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-262",
"text": "Our work involves three steps: tuple extraction, tuple and object alignment, and question answering."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-263",
"text": "We conduct analyses of these three stages to determine the importance of each of the three stages."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-264",
"text": "We manually inspected a random subset of questions, and found the tuple extraction to be accurate 86.3% of the time."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-265",
"text": "Given perfect tuple extraction, the alignment step is correct 95% of the time."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-266",
"text": "Given perfect tuple extraction and alignment, our approach achieves VQA accuracy of 81.06% as compared to 79.2% with imperfect tuple extraction and alignment."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-267",
"text": "Thus, \u223c2% in VQA accuracy is lost due to imperfect tuple extraction and alignment."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-268",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-269",
"text": "**ABLATION STUDY**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-270",
"text": "We conducted an ablation study to analyze the importance of the two kinds of language features-LSTM for question vs. word2vec for tuple."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-271",
"text": "For \"blind\" (language only) models trained and tested on unbalanced datasets, we found that the combination (Q+Tuple) performs better than each of the individual methods."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-272",
"text": "Specifically, Q+Tuple achieves a VQA accuracy of 78.9% as compared to 77.87% (Q-only) and 77.54% (Tuple-only)."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-273",
"text": "Fig. 4 shows qualitative results for our approach."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-274",
"text": "We show a question and two complementary scenes with opposite answers."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-275",
"text": "We find that even though pairs of scenes with opposite ground truth answers to the same questions are visually similar, our model successfully predicts the correct answers for both scenes."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-276",
"text": "Further, we see that our model has learned to attend to the regions of the scene that seem to correspond to the regions that are most relevant to answering the question at hand."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-277",
"text": "The ability to (correctly) predict different answers to scenes that are subtle (semantic) perturbations of each other demonstrates visual understanding."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-278",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-279",
"text": "**QUALITATIVE RESULTS**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-280",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-281",
"text": "**DISCUSSION**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-282",
"text": "The idea of balancing a dataset can be generalized to real images."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-283",
"text": "For instance, we can ask MTurk workers to find images with different answers for a given question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-284",
"text": "The advantage with clipart is that it lets us make the complementary scenes very fine-grained forcing the models to learn subtle differences in visual information."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-285",
"text": "The differences in complementary real images will be coarser and therefore easier for visual models."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-286",
"text": "Overall, there is a trade-off between clipart and real images."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-287",
"text": "Clipart is easier (trivial) for low-level recognition tasks, but is more difficult balanced dataset because it can introduce fine-grained semantic differences."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-288",
"text": "Real is more difficult for low-level recognition tasks, but may be an easier balanced dataset because it will have coarse semantic differences."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-289",
"text": "----------------------------------"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-290",
"text": "**CONCLUSION**"
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-291",
"text": "In this paper, we take a step towards the AI-complete task of Visual Question Answering."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-292",
"text": "Specifically, we tackle the problem of answering binary questions about images."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-293",
"text": "We balance the existing abstract binary VQA dataset by augmenting the dataset with complementary scenes, so that nearly all questions in the balanced dataset have an answer \"yes\" for one scene and an answer \"no\" for another closely related scene."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-294",
"text": "For an approach to perform well on this balanced dataset, it must understand the image."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-295",
"text": "We will make our balanced dataset publicly available."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-296",
"text": "We propose an approach that extracts a concise summary of the question in a tuple form, identifies the region in the scene it should focus on, and verifies the existence of the visual concept described in the question tuple to answer the question."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-297",
"text": "Our approach outperforms the language prior baseline and a state-of-the-art VQA approach by a large margin on the balanced dataset."
},
{
"sent_id": "809ad258132199e3eae8add5d1bfdf-C001-298",
"text": "We also present qualitative results showing that our approach attends to relevant parts of the scene in order to answer the question."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"809ad258132199e3eae8add5d1bfdf-C001-32"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-35",
"809ad258132199e3eae8add5d1bfdf-C001-36",
"809ad258132199e3eae8add5d1bfdf-C001-37"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-59"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-63",
"809ad258132199e3eae8add5d1bfdf-C001-64"
]
],
"cite_sentences": [
"809ad258132199e3eae8add5d1bfdf-C001-32",
"809ad258132199e3eae8add5d1bfdf-C001-35",
"809ad258132199e3eae8add5d1bfdf-C001-37",
"809ad258132199e3eae8add5d1bfdf-C001-59",
"809ad258132199e3eae8add5d1bfdf-C001-63",
"809ad258132199e3eae8add5d1bfdf-C001-64"
]
},
"@USE@": {
"gold_contexts": [
[
"809ad258132199e3eae8add5d1bfdf-C001-49"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-51"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-91"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-182"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-244"
]
],
"cite_sentences": [
"809ad258132199e3eae8add5d1bfdf-C001-49",
"809ad258132199e3eae8add5d1bfdf-C001-51",
"809ad258132199e3eae8add5d1bfdf-C001-91",
"809ad258132199e3eae8add5d1bfdf-C001-182",
"809ad258132199e3eae8add5d1bfdf-C001-244"
]
},
"@EXT@": {
"gold_contexts": [
[
"809ad258132199e3eae8add5d1bfdf-C001-91",
"809ad258132199e3eae8add5d1bfdf-C001-92"
]
],
"cite_sentences": [
"809ad258132199e3eae8add5d1bfdf-C001-91"
]
},
"@SIM@": {
"gold_contexts": [
[
"809ad258132199e3eae8add5d1bfdf-C001-212"
]
],
"cite_sentences": [
"809ad258132199e3eae8add5d1bfdf-C001-212"
]
},
"@DIF@": {
"gold_contexts": [
[
"809ad258132199e3eae8add5d1bfdf-C001-247"
],
[
"809ad258132199e3eae8add5d1bfdf-C001-259"
]
],
"cite_sentences": [
"809ad258132199e3eae8add5d1bfdf-C001-247",
"809ad258132199e3eae8add5d1bfdf-C001-259"
]
}
}
},
"ABC_60b0b54af27a6b04a6708a60834952_3": {
"x": [
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-2",
"text": "Neural approaches to automated essay scoring have recently shown state-of-theart performance."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-3",
"text": "The automated essay scoring task typically involves a broad notion of writing quality that encompasses content, grammar, organization, and conventions."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-4",
"text": "This differs from the short answer content scoring task, which focuses on content accuracy."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-5",
"text": "The inputs to neural essay scoring models -ngrams and embeddings -are arguably well-suited to evaluate content in short answer scoring tasks."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-6",
"text": "We investigate how several basic neural approaches similar to those used for automated essay scoring perform on short answer scoring."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-7",
"text": "We show that neural architectures can outperform a strong nonneural baseline, but performance and optimal parameter settings vary across the more diverse types of prompts typical of short answer scoring."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-10",
"text": "Deep neural network approaches have recently been successfully developed for several educational applications, including automated essay assessment."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-11",
"text": "In several cases, neural network approaches exceeded the previous state of the art on essay scoring (Taghipour and Ng, 2016) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-12",
"text": "The task of automated essay scoring (AES) is generally different from the task of automated short answer scoring (SAS)."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-13",
"text": "Essay scoring generally focuses on writing quality, a multidimensional construct that includes ideas and elaboration, organization, style, and writing conventions such as grammar and spelling (Burstein et al., 2013) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-14",
"text": "Short answer scoring, by contrast, typically focuses only on the accuracy of the content of responses (Burrows et al., 2015) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-15",
"text": "Analyzing the rubrics of prompts from the Automated Student Assessment Prize shared tasks on AES and SAS, while there is some overlap across essay scoring and short answer scoring, there are three main dimensions of differences:"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-16",
"text": "1. Response length."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-17",
"text": "Responses in SAS tasks are typically shorter."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-18",
"text": "For example, while the ASAP-AES data contains essays that average between about 100 and 600 tokens (Shermis, 2014), short answer scoring datasets may have average answer lengths of just several words (Basu et al., 2013) to almost 60 words (Shermis, 2015) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-19",
"text": "2. Rubrics focus on content only in SAS vs. broader writing quality in AES."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-20",
"text": "3. Purpose and genre."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-21",
"text": "AES tasks cover persuasive, narrative, and source-dependent reading comprehension and English Language Arts (ELA), while SAS tasks tend to be from science, math, and ELA reading comprehension."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-22",
"text": "Given these differences, the feature sets for AES and SAS systems are often different, with AES incorporating a larger set of features to capture writing quality (Shermis and Hamner, 2013) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-23",
"text": "Nevertheless, deep learning approaches to AES have thus far demonstrated strong performance with minimal inputs consisting of unigrams and word embeddings."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-24",
"text": "For example, Taghipour and Ng (2016) explore simple LSTM and CNN-based architectures with regression and evaluate on the ASAP-AES data."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-25",
"text": "Alikaniotis et al. (2016) train score-specific word embeddings with several LSTM architectures."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-26",
"text": "Dong and Zhang (2016) demonstrate that a hierarchical CNN architecture produces strong results on the ASAP-AES data."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-27",
"text": "Recently, Zhao et al. (2017) show state-of-the-art performance on the ASAP-AES dataset with a memory network architecture."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-28",
"text": "In this work, we investigate whether deep neural network approaches with similarly minimal feature sets can produce good performance on the SAS task, including whether they can exceed a strong non-neural baseline."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-29",
"text": "Unigram embeddingbased neural network approaches to essay scoring capture content signals from their input features, but the extent to which they capture other aspects of writing quality rubrics has not been established."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-30",
"text": "These approaches as implemented would seem to lend themselves even better to the purely content-focused rubrics in SAS, where content signals should dominate in achieving good humanmachine agreement."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-31",
"text": "On the other hand, recurrent neural networks may derive some of their predictive power in AES from more redundant signals in longer input sequences (as sketched by Taghipour and Ng (2016) )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-32",
"text": "As a result, the shorter responses in SAS may hinder the ability of recurrent networks to achieve state-of-the-art results."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-33",
"text": "To explore the effectiveness of neural network architectures on SAS, we use the basic architecture and parameters of Taghipour and Ng (2016) on three publicly available short answer datasets: ASAP-SAS (Shermis, 2015), Powergrading (Basu et al., 2013) , and SRA (Dzikovska et al., 2016 (Dzikovska et al., , 2013 ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-34",
"text": "While these datasets differ with respect to the length and complexity of student responses, all prompts in the datasets focus on content accuracy."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-35",
"text": "We explore how well the optimal parameters for AES from Taghipour and Ng (2016) fare on these datasets, and whether different architectures and parameters perform better on the SAS task."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-37",
"text": "**DATASETS**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-38",
"text": "The three datasets we use cover different kinds of prompts and vary considerably in the length of the answers as well as their well-formedness."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-39",
"text": "Table 1 shows basic statistics for each dataset."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-40",
"text": "Figures 1, 2 and 3 show examples for each of the datasets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-42",
"text": "**ASAP-SAS**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-43",
"text": "The Automated Student Assessment Prize Short Answer Scoring (ASAP-SAS) dataset 1 contains 10 individual prompts, covering science, biology, 1 https://www.kaggle.com/c/asap-sas and ELA."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-44",
"text": "The prompts were administered to U.S. high school students in several state-level assessments."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-45",
"text": "Each prompt has an average of 2,200 individual responses, typically consisting of one or a few sentences."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-46",
"text": "Responses are scored by two human annotators on a scale from 0 to 2 or 0 to 3 depending on the prompt (Shermis, 2015) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-47",
"text": "Following the guidelines from the Kaggle competition, we always use the score assigned by the first annotator."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-49",
"text": "**POWERGRADING**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-50",
"text": "The Powergrading dataset (Basu et al., 2013) contains 10 individual prompts from U.S. immigration exams with about 700 responses each."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-51",
"text": "Each prompt is accompanied by one or more reference responses."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-76",
"text": "\u2022 0 Table 1 : Overview of the datasets used in this work."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-52",
"text": "As responses are very short (typically a few words -see Figure 2 ) and because the percentage of correct responses is very high, responses in the Powergrading dataset are to some extent repetitive."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-53",
"text": "The Powergrading dataset tests models' ability to perform well on extremely short responses."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-54",
"text": "The Powergrading dataset was originally used for the task of (unsupervised) clustering (Basu et al., 2013) , so that there are no state-ofthe-art scoring results available for this dataset."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-55",
"text": "For simplicity, we use the first out of three binary human-annotated correctness scores."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-57",
"text": "**SRA**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-58",
"text": "The SRA dataset (Dzikovska et al., 2012) became widely known as the dataset used in SemEval-2013 Shared Task 7 \"The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge\" (Dzikovska et al., 2013) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-59",
"text": "It consists of two subsets: Beetle, with student responses from interacting with a tutorial dialogue system, and SciEntsBank (SEB) with science assessment questions."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-60",
"text": "We use two label sets from the shared task: the 2-way labels classify responses as correct or incorrect, while the 5-way labels provide a more fine-grained classification of responses into the categories non domain, correct, partially correct incomplete, contradictory and irrelevant."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-61",
"text": "In contrast with most SAS datasets, the SRA dataset contains a large number of prompts and with relatively few responses per prompt (see Table 1 )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-62",
"text": "Following the procedure from the shared task, we train models for each SRA dataset (Beetle, SEB) across all responses to all prompts."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-63",
"text": "ASAP -Prompt 1 QUESTION: After reading the groups procedure, describe what additional information you would need in order to replicate the experiment."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-64",
"text": "Make sure to include at least three pieces of information."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-66",
"text": "**SCORING RUBRIC FOR A 3 POINT RESPONSE:**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-67",
"text": "The response is an excellent answer to the question."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-68",
"text": "It is correct, complete, and appropriate and contains elaboration, extension, and/or evidence of higher-order thinking and relevant prior knowledge."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-69",
"text": "There is no evidence of misconceptions."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-70",
"text": "Minor errors will not necessarily lower the score."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-71",
"text": "STUDENT RESPONSES:"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-72",
"text": "\u2022 3 points: Some additional information you will need are the material."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-73",
"text": "You also need to know the size of the contaneir to measure how the acid rain effected it. You need to know how much vineager is used for each sample."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-74",
"text": "Another thing that would help is to know how big the sample stones are by measureing the best possible way."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-75",
"text": "\u2022 1 point: After reading the expirement, I realized that the additional information you need to replicate the expireiment is one, the amant of vinegar you poured in each container, two, label the containers before you start yar expirement and three, write a conclusion to make sure yar results are accurate."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-77",
"text": "Since we train prompt-specific models for ASAP-SAS and PG, we report the mean number of responses per set per prompt."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-78",
"text": "For SRA, we train one model per label set across prompts and report the overall number of prompts per set as well as the mean number of responses per prompt per set (in parentheses)."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-80",
"text": "**EXPERIMENTS 3.1 METHOD**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-81",
"text": "We carried out a series of experiments across datasets to discern the effect of specific parameters in the SAS setting."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-82",
"text": "We took the best parameter set from Taghipour and Ng (2016) as our reference since it performed best on the AES data."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-83",
"text": "We looked at the effect of varying several important parameters to discern the effectiveness of each for SAS:"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-84",
"text": "\u2022 the role of the mean-over-time layer, which was crucial for good performance in Taghipour and Ng (2016) \u2022 the utility of pretrained embeddings"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-85",
"text": "\u2022 the contribution of features derived from a convolutional layer"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-86",
"text": "\u2022 the needs for network representational capacity via recurrent hidden layer size"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-87",
"text": "\u2022 the role of bidirectional architectures for short response lengths"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-88",
"text": "\u2022 regression versus classification"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-89",
"text": "\u2022 the effect of attention"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-90",
"text": "To explore the effect of specific parameters, we trained models on the training set and evaluated on the development set only."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-91",
"text": "Following these experiments, we trained a model on the training and development sets and evaluated on the test set."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-92",
"text": "We report prompt-level results for this model in Section 3.6."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-93",
"text": "For evaluation, we use quadratic weighted kappa (QWK) for the ASAP-SAS and Powergrading datasets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-94",
"text": "Because the class labels in the SRA dataset are unordered, we report the weighted F1 score, which was the preferred metric in the Semeval shared task (Dzikovska et al., 2016) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-96",
"text": "**BASELINE**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-97",
"text": "As a baseline system, we use a supervised learner based on a hand-crafted feature set."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-98",
"text": "This baseline is based on DkPro TC (Daxenberger et al., 2014) and relies on support vector classification using Weka (Hall et al., 2009 )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-99",
"text": "We preprocess the data using the ClearNlp Segmenter 2 via DKPro Core (Eckart de Castilho and Gurevych, 2014) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-100",
"text": "The features used in the baseline system comprise a commonly used and effective feature set for the SAS task."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-101",
"text": "We use both binary word and character uni-to trigram occurrence features, using the top 10,000 most frequent ngrams in the training data, as well as answer length, measured by the number of tokens in a response."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-103",
"text": "**NEURAL NETWORKS**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-104",
"text": "We work with the basic neural network architecture explored by Taghipour and Ng (2016) (Figure 4 )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-105",
"text": "3 First, the word tokens of each response are converted to embeddings."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-106",
"text": "Optionally, features are extracted from the embeddings by a convolutional network layer."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-107",
"text": "This output forms the input to an LSTM layer."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-108",
"text": "The hidden states of the LSTM are aggregated in either a \"mean-over-time\" (MoT) layer or attention layer."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-109",
"text": "The MoT layer simply averages the hidden states of the LSTM across the input."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-110",
"text": "We use the same attention mechanism employed in Taghipour and Ng (2016) , which involves taking the dot product of each LSTM hidden state and a vector that is trained with the network."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-111",
"text": "The aggregation layer output is a single vector, which is input to a fully connected layer."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-112",
"text": "This layer computes a scalar (regression) or class label (classification)."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-114",
"text": "**SETUP, TRAINING, AND EVALUATION**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-115",
"text": "The text is lightly preprocessed as input to the neural networks following Taghipour and Ng (2016) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-116",
"text": "The text is tokenized with the standard NLTK tokenizer and lowercased."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-117",
"text": "All numbers are mapped to a single symbol."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-118",
"text": "4 Each response is padded with a dummy token to uniform length, but these dummy tokens are masked out during model training."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-119",
"text": "For the ASAP-SAS and Powergrading datasets, prior to training, we scale all scores of responses to [0, 1] and use these scaled scores as input to the networks."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-120",
"text": "For evaluation, the scaled scores are converted back to their original range."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-121",
"text": "The SRA class labels are used as is."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-122",
"text": "We fix a number of neural network parame-4 It may be the case that relevant content information is thus ignored."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-123",
"text": "However, since many numbers occur with units of measurement, e.g. 1g, we do not have word embeddings for them either and so the embeddings would simply be random initializations."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-124",
"text": "We leave a full exploration of this issue to future work."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-125",
"text": "ters for our experiments."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-126",
"text": "For pretrained embeddings, in preliminary experiments the GloVe 100 dimension vectors (Pennington et al., 2014) performed slightly better than a selection of other offthe-shelf embeddings, and hence we use these for all conditions that involve pretrained embeddings."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-127",
"text": "Embeddings for word tokens that are not found in the embeddings are randomly initialized from a uniform distribution."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-128",
"text": "The convolutional layer uses a window length of 3 or 5 and 50 filters."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-129",
"text": "We use a mean squared error loss for regression models and a cross-entropy loss for classification models."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-130",
"text": "To train the network, we use RMSProp with \u03c1 set to 0.9 and learning rate of 0.001."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-131",
"text": "We clip the norm of the gradient to 10."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-132",
"text": "The fully connected layer's bias is initialized to the mean score for the training data, and the layer is regularized with dropout of 0.5."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-133",
"text": "We use a batch size of 32, which provided a good compromise between performance and runtime in preliminary experiments."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-134",
"text": "To obtain more consistent results and improve predictive performance, we evaluate the models by keeping an exponential moving average of the model's weights during training."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-135",
"text": "The moving average weights w EM A are updated after each batch by"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-136",
"text": "d is a decay rate that is updated dynamically at each batch by taking into account the number of batches so far:"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-138",
"text": "**MIN(DECAY, (1 + #BATCHES)/(10 + #BATCHES))**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-139",
"text": "where decay is a maximum decay rate, which we set to 0.999."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-140",
"text": "This decay rate updating procedure allows the weights to be updated quickly at first while stabilizing across time."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-141",
"text": "All models are trained for 50 epochs for parameter exploration on the development set (Section 3.5) and 50 epochs for the final models on the test set (Section 3.6)."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-142",
"text": "Following Taghipour and Ng (2016) , for our parameter exploration experiments on the development set, we report the best performance across epochs."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-143",
"text": "When we train final models on the combined training and development set and evaluate on the test set, we report the results from the last epoch."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-144",
"text": "During development, we observed that even after employing best practices for ensuring repro-ducibility of results 5 , there was still some small variation between runs of the same parameter settings."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-145",
"text": "The reasons for this variability were not clear."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-147",
"text": "**PARAMETER EXPLORATION RESULTS**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-148",
"text": "Our focus in this section is comparing different architecture and parameter choices for the neural networks with the best parameters from Taghipour and Ng (2016) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-149",
"text": "Table 2 shows the results of our experiments on the development set for ASAP-SAS and Powergrading, and Table 3 shows the corresponding results for SRA."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-150",
"text": "Does the mean-over-time layer improve performance?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-151",
"text": "Taghipour and Ng (2016) demonstrate a large performance gain with the mean-overtime layer that averages the LSTM hidden states across the response tokens."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-152",
"text": "Comparing \"T&N best\" with \"no MoT\" across the datasets, we see mixed results."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-153",
"text": "The mean-over-time layer performs relatively well across datasets, but achieves the best results only on the SRA-SEB dataset."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-154",
"text": "We hypothesized that the mean-over-time layer is helpful when the input consists of longer responses (as was the case for the essay data in Taghipour and Ng (2016) )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-155",
"text": "We computed the Pearson's correlation on the ASAP-SAS data between the difference on each prompt of the two conditions and the mean response length in the development set."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-156",
"text": "However, the correlation was modest at 0.437."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-157",
"text": "Do pretrained embeddings with tuning outperform fixed or randomly initialized embeddings?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-158",
"text": "On all datasets, the pretrained embeddings with tuning (among the \"T&N best\" parameters) performed better than fixed pretrained or learned embeddings."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-159",
"text": "6 Tuned embeddings were especially important for the ASAP-SAS and Powergrading datasets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-160",
"text": "Does a convolutional layer produce useful features for the SAS task?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-161",
"text": "The results for convolutional features are mixed: convolutional features contribute small performance improvements on Powergrading and one of the SRA label sets (SRA SEB 2-way)."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-162",
"text": "Can smaller hidden layers be used for the SAS task?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-163",
"text": "Although LSTMs with smaller hidden states often outperformed the 300-dimensional LSTM in the T&N best parameter set (compare 'T&N best' performance with performance for 'LSTM dims' conditions), the improvements were all quite small."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-164",
"text": "Do bidirectional LSTMs improve performance?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-165",
"text": "Bidirectional LSTM architectures produced solid gains over the T&N best parameters on ASAP-SAS, Powergrading, and two of the four SRA label sets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-166",
"text": "Can classification improve performance?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-167",
"text": "The T&N model used regression."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-168",
"text": "While the labels in SRA allow only for classification, ASAP-SAS and PG work with both regression and classification."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-169",
"text": "However, we found consistently better results using regression."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-170",
"text": "Can attention improve performance?"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-171",
"text": "The attention mechanism we considered in this paper yielded strong performance improvements over the mean-over-time layer on all datasets except SRA-SEB 5-way."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-172",
"text": "The largest improvements were on Powergrading and SRA-Beetle 5-way, where increases were almost 3 points weighted F1."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-173",
"text": "We also report the results of the combinations of individual parameters that performed well on the development data at the bottom of Table 2 and Table 3 ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-174",
"text": "While these combinations performed better than any individual parameter variation on ASAP-SAS and Powergrading, the combination performed worse on three of the four label sets in the SRA data."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-175",
"text": "These results underscore that these parameters do not always produce additive effects in practice."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-176",
"text": "We examined the predictions from the baseline system and the T&N system for the ASAP-SAS development set and conducted a brief error analysis."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-177",
"text": "In general, across the 10 prompts, it can be observed that when the baseline system is incorrect it tends to under-predict the scores, whereas the T&N system tends to slightly over-predict scores when it is incorrect."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-178",
"text": "These effects are typically small, but consistent."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-179",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-180",
"text": "**TEST PERFORMANCE**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-181",
"text": "We selected the top parameter settings on the development set and trained models on the full training set (i.e. training and development sets) for each dataset:"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-182",
"text": "\u2022 ASAP-SAS: 250-dimensional bidirectional LSTM, attention mechanism Table 2 : Parameter experiment results on ASAP-SAS and Powergrading on the development set."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-183",
"text": "\"Baseline\" is the baseline non-neural system. \"T&N best\" is the best-performing parameter set in Taghipour and Ng (2016) : tuned embeddings (here, GLOVE 100 dimensions), 300-dimensional LSTM, unidirectional, mean-over-time layer."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-184",
"text": "Scores are bolded if they outperform the score for the \"T&N best\" parameter setting."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-185",
"text": "\u2022 Powergrading: CNN features with window length 5, 150-dimensional bidirectional LSTM, attention mechanism"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-186",
"text": "\u2022 SRA: Because of the decreased performance of the combined best individual parameters on the development data, we use a 300-dimensional unidirectional LSTM with attention mechanism."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-187",
"text": "These models are \"T&N tuned\" in Table 4 , which appear along with the non-neural baseline system."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-188",
"text": "On ASAP-SAS, the \"T&N tuned\" parameter configuration outperformed the baseline system and the \"T&N best\" parameters."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-189",
"text": "The tuned system does not reach the state-of-the-art Fisher-transformed mean score on the ASAP-SAS dataset (Ramachandran et al., 2015) 7 , which, like the winner of the ASAP-SAS competition (Tandalla, 2012) , employed prompt-specific regular expressions."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-190",
"text": "Other top performing systems used prompt-specific preprocessing and ensemble-based approaches over rich feature spaces (Higgins et al., 2014) ."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-191",
"text": "On the Powergrading dataset, the \"T&N tuned\" system did not match the performance of the baseline system, consistent with the results on the development set (Table 2 )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-192",
"text": "It appears that on the very short and redundant data in this dataset, the character-and n-gram based system can learn somewhat more efficiently than the neural systems."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-193",
"text": "On the SRA datasets, the \"T&N tuned\" model outperformed the baseline and the \"T&N best\" settings on average across prompts, by a larger margin than the other datasets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-194",
"text": "On the SRA data, as on the ASAP-SAS data, a gap remains between the tuned model's performance and the state of the art."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-195",
"text": "On SRA, this may be partly due to the use of \"question indicator\" features by the top performing systems (Heilman and Madnani, 2013; Ott et al., 2013 )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-196",
"text": "The performance improvement over the baseline system was larger on the development sets than on the test sets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-197",
"text": "Part of the reason for this is that the test set evaluation procedure likely did not choose the best-performing epoch for the neural models."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-198",
"text": "Table 3 : Parameter experiment results on SRA datasets on the development set."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-199",
"text": "\"wF1\" is the weighted F1 score."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-200",
"text": "\"Baseline\" is the baseline non-neural system. \"T&N best\" is the best-performing parameter set in Taghipour & Ng (2016) : tuned embeddings (here, GLOVE 100 dimensions), 300-dimensional LSTM, unidirectional, mean-over-time layer."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-201",
"text": "Scores are bolded if they outperform the score for the \"T&N best\" parameter setting."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-202",
"text": "----------------------------------"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-203",
"text": "**DISCUSSION**"
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-204",
"text": "Our results establish that the basic neural architecture of pretrained embeddings with tuning across model training and LSTMs is a reasonably effective architecture for the short answer content scoring task."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-205",
"text": "The architecture performs well enough to exceed a non-neural content scoring baseline system in most cases."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-206",
"text": "Given the diversity of prompts in SAS, there was a good deal of variation in the effectiveness of parameter choices in this neural architecture."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-207",
"text": "Still, some basic trends emerged."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-208",
"text": "First, pretrained embeddings tuned across model training were crucial for competitive performance on most datasets."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-209",
"text": "Second, neural models for SAS generally benefit from similar size hidden dimensions as models for AES."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-210",
"text": "Only the Powergrading dataset, with very short answers and a small vocabulary for each prompt, benefitted from a significantly smaller LSTM dimensionality."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-211",
"text": "The relationship between task, rubrics, vocabulary size, and the representational capacity of neural models for SAS need further exploration."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-212",
"text": "Third, a mean-over-time aggregation mechanism on top of the LSTM generally performed well, but notably this mechanism was not nearly as important as in the AES task."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-213",
"text": "Mean-over-time produced competitive results on many prompts, but contrary to Taghipour and Ng (2016) , bidirectional LSTMS and attention produced some of the best results, which is consistent with results for neural models on other text classification tasks (e.g., Longpre et al. (2016) )."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-214",
"text": "Research is needed to explain these emerging differences in effective neural architectures for AES vs. SAS, including model-specific factors such as the interaction of an LSTM's integration of features over time and the redundancy of predictive signals in essays vs. short answers, along with data-specific factors such as the consistency of human scoring, the demands of different rubrics, and the homogeneity or diversity of prompts in each setting."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-215",
"text": "At the same time, different from the AES task, the family of neural architectures explored here needs further augmenting to achieve state-of-the-art results on the SAS task."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-216",
"text": "Moreover, more experiments are needed to document how well neural systems perform relative to highly optimized non-neural systems."
},
{
"sent_id": "60b0b54af27a6b04a6708a60834952-C001-217",
"text": "While further parameter optimizations and different architectures may yield better results, it may be the case that the SAS task of content scoring with relatively short response sequences requires neural approaches to employ a larger set of features (Pado, 2016) or a greater level of prompt-specific tuning, or pairing with methods from active learning (Horbach and Palmer, 2016) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"60b0b54af27a6b04a6708a60834952-C001-11"
],
[
"60b0b54af27a6b04a6708a60834952-C001-24"
],
[
"60b0b54af27a6b04a6708a60834952-C001-31"
],
[
"60b0b54af27a6b04a6708a60834952-C001-154"
],
[
"60b0b54af27a6b04a6708a60834952-C001-183"
],
[
"60b0b54af27a6b04a6708a60834952-C001-200"
]
],
"cite_sentences": [
"60b0b54af27a6b04a6708a60834952-C001-11",
"60b0b54af27a6b04a6708a60834952-C001-24",
"60b0b54af27a6b04a6708a60834952-C001-31",
"60b0b54af27a6b04a6708a60834952-C001-154",
"60b0b54af27a6b04a6708a60834952-C001-183",
"60b0b54af27a6b04a6708a60834952-C001-200"
]
},
"@MOT@": {
"gold_contexts": [
[
"60b0b54af27a6b04a6708a60834952-C001-31",
"60b0b54af27a6b04a6708a60834952-C001-32",
"60b0b54af27a6b04a6708a60834952-C001-33"
]
],
"cite_sentences": [
"60b0b54af27a6b04a6708a60834952-C001-31",
"60b0b54af27a6b04a6708a60834952-C001-33"
]
},
"@USE@": {
"gold_contexts": [
[
"60b0b54af27a6b04a6708a60834952-C001-33"
],
[
"60b0b54af27a6b04a6708a60834952-C001-35"
],
[
"60b0b54af27a6b04a6708a60834952-C001-82"
],
[
"60b0b54af27a6b04a6708a60834952-C001-104"
],
[
"60b0b54af27a6b04a6708a60834952-C001-110"
],
[
"60b0b54af27a6b04a6708a60834952-C001-115"
],
[
"60b0b54af27a6b04a6708a60834952-C001-142"
],
[
"60b0b54af27a6b04a6708a60834952-C001-148"
]
],
"cite_sentences": [
"60b0b54af27a6b04a6708a60834952-C001-33",
"60b0b54af27a6b04a6708a60834952-C001-35",
"60b0b54af27a6b04a6708a60834952-C001-82",
"60b0b54af27a6b04a6708a60834952-C001-104",
"60b0b54af27a6b04a6708a60834952-C001-110",
"60b0b54af27a6b04a6708a60834952-C001-115",
"60b0b54af27a6b04a6708a60834952-C001-142",
"60b0b54af27a6b04a6708a60834952-C001-148"
]
},
"@DIF@": {
"gold_contexts": [
[
"60b0b54af27a6b04a6708a60834952-C001-154",
"60b0b54af27a6b04a6708a60834952-C001-155",
"60b0b54af27a6b04a6708a60834952-C001-156"
],
[
"60b0b54af27a6b04a6708a60834952-C001-213"
]
],
"cite_sentences": [
"60b0b54af27a6b04a6708a60834952-C001-154",
"60b0b54af27a6b04a6708a60834952-C001-213"
]
},
"@FUT@": {
"gold_contexts": [
[
"60b0b54af27a6b04a6708a60834952-C001-213",
"60b0b54af27a6b04a6708a60834952-C001-214"
]
],
"cite_sentences": [
"60b0b54af27a6b04a6708a60834952-C001-213"
]
}
}
},
"ABC_a5f33403d23cdc0532547266f1841a_3": {
"x": [
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-2",
"text": "When reading a text, it is common to become stuck on unfamiliar words and phrases, such as polysemous words with novel senses, rarely used idioms, Internet slang, or emerging entities."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-89",
"text": "Wikidata items with no description or no contexts are ignored."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-3",
"text": "At first, we attempt to figure out the meaning of those expressions from their context, and ultimately we may consult a dictionary for their definitions."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-4",
"text": "However, rarelyused senses or emerging entities are not always covered by the hand-crafted definitions in existing dictionaries, which can cause problems in text comprehension."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-5",
"text": "This paper undertakes a task of describing (or defining) a given expression (word or phrase) based on its usage context, and presents a novel neuralnetwork generator for expressing its meaning as a natural language description."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-6",
"text": "Experimental results on four datasets (including WordNet, Oxford and Urban Dictionaries, and Wikipedia) demonstrate the effectiveness of our method over previous methods for definition generation (Noraset et al., 2017; Gadetsky et al., 2018) and non-standard English explanation (Ni and Wang, 2017) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-9",
"text": "When we read news text with emerging entities, text in unfamiliar domains, or text in foreign languages, we often encounter expressions (words or phrases) whose senses we are unsure of."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-10",
"text": "In such cases, we may first try to examine other usages of the same expression in the text, in order to infer its meaning from this context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-11",
"text": "Failing to do so, we may consult a dictionary, and in the case of polysemous words, choose an appropriate meaning based on the context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-12",
"text": "Acquiring novel word senses via dictionary definitions is known to be more effective than contextual guessing (Fraser, 1998; Chen, 2012) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-13",
"text": "However, very often, hand-crafted dictionaries do not contain definitions for rare or novel phrases/words, and we eventually give up on un- derstanding them completely, leaving us with only a shallow reading of the text."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-14",
"text": "There are several natural language processing (NLP) tasks that can roughly address this problem of unfamiliar word senses, all of which are incomplete in some way."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-15",
"text": "Word sense disambiguation (WSD) can basically only handle words (or senses) that are registered in a dictionary a priori."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-16",
"text": "Paraphrasing can suggest other ways of describing a word while keeping its meaning, but those paraphrases are generally context-insensitive and may not be sufficient for understanding."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-17",
"text": "To address this problem, Ni and Wang (2017) has proposed a task of describing a phrase in a given context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-18",
"text": "However, they follow the strict assumption that the target phrase is unknown and there is only a single local context available for the phrase, which makes the task of generating an accurate and coherent definition difficult (perhaps as difficult as a human comprehending the phrase itself)."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-19",
"text": "On the other hand, Noraset et al. (2017) attempted to generate a definition of a word from its word embedding induced from massive text, followed by Gadetsky et al. (2018) that refers to a local context to define a polysemous word with a local context by choosing relevant dimensions of their embeddings."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-20",
"text": "Although these research efforts revealed that both local and global contexts of words are useful in generating their definitions, none of these studies exploited both local and global contexts directly."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-21",
"text": "In this study, we tackle a task of describing (defining) a phrase when given its local context as (Ni and Wang, 2017) , while allowing access to other usage examples via word embeddings trained from massive text (global contexts) (Noraset et al., 2017; Gadetsky et al., 2018) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-22",
"text": "We present LOG-Cad, a neural network-based description generator (Figure 1 ) to directly solve this task."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-23",
"text": "Given a word with its context, our generator takes advantage of the target word's embedding, pre-trained from massive text (global contexts), while also encoding the given local context, combining both to generate a natural language description."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-24",
"text": "The local and global contexts complement one another and are both essential; global contexts are crucial when local contexts are short and vague, while the local context is crucial when the target phrase is polysemous, rare, or unseen."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-25",
"text": "Considering various contexts where we need definitions of phrases, we evaluated our method with four datasets including WordNet (Noraset et al., 2017) for general words, the Oxford dictionary (Gadetsky et al., 2018) for polysemous words, Urban Dictionary (Ni and Wang, 2017) for rare idioms or slangs, and a newlycreated Wikipedia dataset for entities."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-26",
"text": "Experimental results demonstrate the effectiveness of our method against the three baselines stated above (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-27",
"text": "Our contributions are as follows:"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-28",
"text": "\u2022 We set up a general task of defining phrases given their contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-29",
"text": "This task is a generalization of three related tasks (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) and involves various situations where we need definitions of phrases."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-30",
"text": "\u2022 We build a large-scale dataset from Wikipedia and Wikidata for the proposed task."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-31",
"text": "\u2022 We propose a method for generating natural language definitions for phrases with contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-32",
"text": "\u2022 Empirical results are strong; this method achieves the state-of-the-art performance for our new dataset and the three existing datasets used in the related studies (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-33",
"text": "We will release the dataset to the public as well as all of the code to promote reproducibility of the experiments."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-35",
"text": "**CONTEXT-AWARE DESCRIPTION GENERATION**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-36",
"text": "In what follows, we define our task of describing a phrase or word in a specific context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-37",
"text": "Given expression X trg with its context X = {x 1 , \u00b7 \u00b7 \u00b7 , x I }, our task is to output a description Y = {y 1 , ..., y T }."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-38",
"text": "Here, X trg can be a single word or a short phrase and is included in X. Y is a definition-like concrete and concise phrase/sentence that describes the expression X trg ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-39",
"text": "For example, given a phrase \"sonic boom\" with its context \"the shock wave may be caused by sonic boom or by explosion,\" the task is to generate a description such as \"sound created by an object moving fast.\" If the given context has been changed to \"this is the first official tour to support the band's latest studio effort, 2009's Sonic Boom,\" then the appropriate output would be \"album by Kiss.\""
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-40",
"text": "The process of description generation can be modeled with a conditional language model as"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-41",
"text": "p(y t |y 1 , ..., y t\u22121 , X, X trg )."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-42",
"text": "(1) 3 LOG-CaD: Local & Global Context-aware Description Generator"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-43",
"text": "We propose LOG-CaD, a neural model that generates the description of a given phrase or word by using its local and global contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-44",
"text": "In the rest of this section, we first describe our idea of utilizing local and global contexts in the description generation task, then present the details of our model."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-45",
"text": "Local & Global Contexts for Description Generation In this paper, we refer to the explicit contextual information included in a single sentence as \"local context,\" and the implicit contextual information in the word/phrase embedding trained in an unsupervised manner on largescale corpora as \"global context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-46",
"text": "\" Previous work on the definition generation task (Noraset et al., 2017) has shown that global contexts can be useful clues when generating definitions of unknown words."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-47",
"text": "The intuition behind their method is that words with similar meanings tend to have similar definitions in a dictionary."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-48",
"text": "This can be seen as an extension of the Distributional Hypothesis (Harris, 1954; Firth, 1957) , which states words that share semantic meanings tend to appear in similar contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-49",
"text": "Additionally, work on the WSD task (Navigli, 2009) , novel sense detection (Erk, 2006; Lau et al., 2014) , and the non-standard word explanation task (Ni and Wang, 2017) have revealed that local contexts surrounding the word can help disambiguate its sense."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-50",
"text": "Based on these studies, we propose to incorporate both local and global contexts to describe an unknown expression."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-51",
"text": "Model Figure 1 shows an illustration of our LOG-CaD model."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-52",
"text": "Similarly to the standard encoder-decoder model with attention (Bahdanau et al., 2015; Luong and Manning, 2016) , it consists of two modules: a context encoder and a description decoder."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-53",
"text": "The challenge here is that the decoder needs to be conditioned not only on the local context, but also on its global context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-54",
"text": "To incorporate the different types of contexts, we propose to use a GATE function (Noraset et al., 2017) to dynamically control how the global and local contexts influence the generation of the description."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-55",
"text": "We use bi-directional and uni-directional LSTMs (Hochreiter and Schmidhuber, 1997) as our context encoder and description decoder (Figure 1 ), respectively."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-56",
"text": "Given a sentence X and a phrase X trg , the context encoder generates a sequence of continuous vectors"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-57",
"text": "where x i denotes the word embedding of word x i ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-58",
"text": "Then, the description decoder computes the conditional probability of a description Y with Eq. (1), which can be approximated with another LSTM as"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-59",
"text": "where s t is a hidden state of the decoder LSTM, and y t\u22121 is a jointly-trained word embedding of the previous output word y t\u22121 ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-60",
"text": "Considering the fact that the local context can be relatively long (e.g. around 20 words on average in the Wikipedia dataset that will be introduced in the next section) it is hard for a decoder to focus on important words in local contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-61",
"text": "In order to deal with this problem, ATTENTION(\u00b7) function in Eq. (4) decides which words in the local context X to focus on at each time step."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-62",
"text": "d t can be computed with an attention mechanism (Luong and Manning, 2016) as"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-63",
"text": "where U h and U s are matrices that map the encoder and decoder hidden states into a common space, respectively."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-64",
"text": "In order to capture prefixes and suffixes in X trg , we construct character-level CNNs (Eq. (5)) following (Noraset et al., 2017) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-65",
"text": "Note that the input to the CNNs is a sequence of words in X trg , which are concatenated with special character \" ,\" such as \"sonic boom.\" Following Noraset et al. (2017) , we set the kernels of length 2-6 and size 10, 30, 40, 40, 40 respectively with a stride of 1 to obtain a 160-dimensional vector c trg ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-66",
"text": "In addition to the local context and the character-information, we also utilize the global context obtained from massive text."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-67",
"text": "We achieve this by two different strategies proposed by Noraset et al. (2017) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-68",
"text": "First, we feed phrase embedding x trg to initialize the decoder as"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-69",
"text": "Here, phrase embedding x trg is calculated by simply summing up all the embeddings of words that consistute the phrase X trg ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-70",
"text": "Note that we use a random-initialized vector if no pre-trained embedding is available for the words in X trg ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-71",
"text": "As described in the previous section, we use both local and global contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-72",
"text": "In order to capture the interaction between two contexts and the description decoder, we adopt a GATE(\u00b7) function (Eq. (6) Table 2 : Domains, expressions to be described, and the coverage of pre-trained word embeddings of the four datasets."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-73",
"text": "context d t , and character-level information c trg as"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-74",
"text": "where \u03c3(\u00b7), \u2299 and ; denote sigmoid function, element-wise multiplication, and vector concatenation, respectively."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-75",
"text": "Ws and bs are weight matrices and bias terms."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-76",
"text": "Here, the update gate z t controls how much the original hidden state s t is to be changed, and the reset gate r t controls how much the information from f t contributes to word generation at each time step."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-78",
"text": "**WIKIPEDIA DATASET**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-79",
"text": "One of our goals is to describe infrequent/rare words and phrases such as proper nouns in a variety of domains depending on their surrounding context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-80",
"text": "However, among the three existing datasets, WordNet and Oxford dictionary mainly target the descriptions of relatively common words, and thus are non-ideal test beds for this goal."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-81",
"text": "On the other hand, although the Urban Dictionary dataset contains descriptions of rarelyused phrases as well, the domain of its targeted words and phrases is limited to Internet slang."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-82",
"text": "Therefore, in order to confirm that our model can generate the description of rarely-used phrases as well as words, we constructed a new dataset for context-aware phrase description generation from Wikipedia 1 and Wikidata 2 which contain a wide variety of entity descriptions with contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-83",
"text": "Table 1 and Table 2 show the properties and statistics of the new dataset and the three existing datasets, respectively."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-84",
"text": "The overview of the data extraction process is shown in Figure 2 ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-85",
"text": "Similarly to the WordNet dataset, each entry in the dataset consists of (1) a phrase, (2) its description, and (3) context (a sentence)."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-86",
"text": "For preprocessing, we applied Stanford Tokenizer 3 to the descriptions of Wikidata items and the articles in Wikipedia."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-87",
"text": "Next, we removed phrases in parentheses from the Wikipedia articles, since they tend to be paraphrasing in other languages and work as noise."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-88",
"text": "To obtain the contexts of each item in Wikidata, we extracted the sentence which has a link referring to the item through all the first paragraphs of Wikipedia articles and replaced the phrase of the links with a special token [TRG] ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-90",
"text": "This utilization of links makes it possible to resolve the ambiguity of words and phrases in a sentence without human annotations, which is one of the major advantages of using Wikipedia."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-91",
"text": "Note that we used only links whose anchor texts are identical to the title of the Wikipedia articles, since the users of Wikipedia sometimes link mentions to related articles."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-93",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-94",
"text": "We evaluate our method by applying it to describe words in WordNet (Miller, 1995) and Oxford Dictionary, 4 phrases in Urban Dictionary 5 and Wiki- data."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-95",
"text": "6 For all of these datasets, a given word or phrase has an inventory of senses with corresponding definitions and usage examples."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-96",
"text": "These definitions are regarded as ground-truth descriptions."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-97",
"text": "Datasets To evaluate our model on the word description task on WordNet, we followed Noraset et al. (2017) and extracted data from WordNet 7 using the dict-definition 8 toolkit."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-98",
"text": "Each entry in the data consists of three elements: (1) a word, (2) its definition, and (3) a usage example of the word."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-99",
"text": "We split this dataset to obtain Train, Validation, and Test sets."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-100",
"text": "If a word has multiple definitions/examples, we treat them as different entries."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-101",
"text": "Note that the words are mutually exclusive across the three sets."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-102",
"text": "The only difference between our dataset and theirs is that we extract the tuples only if the words have their usage examples in WordNet."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-103",
"text": "Since not all entries in WordNet have usage examples, our dataset is a small subset of (Noraset et al., 2017 ) (see Table 1 )."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-104",
"text": "In addition to WordNet, we use the Oxford Dictionary following (Gadetsky et al., 2018) , the Urban Dictionary following (Ni and Wang, 2017) and our Wikipedia dataset described in the previous section."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-105",
"text": "In order to control the experiments on four datasets, we use the same pre-trained CBOW 9 vectors as global context following (Noraset et al., 6 Dataset will be made available upon publication."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-106",
"text": "7 https://wordnet.princeton.edu/ 8 https://github.com/NorThanapon/dict-definition 9 GoogleNews-vectors-negative300.bin.gz at https://code.google.com/archive/p/word2vec/ Table 3 : Hyperparameters of the models 2017)."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-107",
"text": "If the expression to be described consists of multiple words, its phrase embedding is calculated by simply summing up all the CBOW vectors of words in the phrase, such as \"sonic\" and \"boom.\" (See Figure 1) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-108",
"text": "If pre-trained CBOW embeddings are unavailable, we instead use a special [UNK] vector (which is randomly initialized with a uniform distribution) as word embeddings."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-109",
"text": "Note that our pre-trained embeddings only cover 26.79% of the words in the expressions to be described in our Wikipedia dataset, while it covers all words in WordNet dataset (See Table 2 )."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-110",
"text": "Even if no reliable word embeddings are available, all models can capture the character information through character-level CNNs (See Figure 1) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-112",
"text": "**MODELS**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-113",
"text": "We implemented four methods including three baselines: (1) Global, (2) Local, (3) I-Attention, and our proposed model, (4) LOGCaD. The Global model is our reimplementation of the strongest model (S + G + CH) in (Noraset et al., 2017) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-114",
"text": "It can access the embedding (global context) of the phrase to be described, but has no ability to read the usage examples (local context)."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-115",
"text": "The Local model is the reimplementation of the best model (dual encoder) in (Ni and Wang, 2017) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-116",
"text": "In order to make a fair comparison of the effectiveness of local and global contexts, we slightly modify the original implementation of (Ni and Wang, 2017 ); as the character-level encoder in the Local model, we adopt CNNs that are exactly the same as the other two models instead of the original LSTMs."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-117",
"text": "The I-Attention is our reimplementation of the best model (S + I-Attention) in (Gadetsky et al., 2018) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-118",
"text": "Similar to our model, it uses both local and global contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-119",
"text": "Unlike our model, however, their model cannot directly use the local context to predict the words in descriptions."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-120",
"text": "This is because the IAttention model indirectly uses the local context only to filter out unrelated information in phrase embeddings."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-121",
"text": "All four models (Table 3) are implemented with the PyTorch framework."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-122",
"text": "10"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-123",
"text": "Results Table 4 shows the performance of the models."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-124",
"text": "We can see that the LOG-CaD model consistently outperforms the three baselines in all four datasets."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-125",
"text": "This result indicates that using both local and global contexts helps describe the words correctly."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-126",
"text": "While the I-Attention model also uses local and global contexts, its performance was always lower than the LOG-CaD model."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-127",
"text": "This phenomenon shows that using local context to predict description is more effective than using it to disambiguate the meanings in global context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-128",
"text": "In particular, the low BLEU scores of Global and I-Attention models on Wikipedia dataset suggest that it is necessary to learn to ignore the noisy information in global context if the coverage of pre-trained word embeddings is extremely low (see the third and fourth rows in Table 2 )."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-129",
"text": "We suspect that the Urban Dictionary task is too difficult and the results are unreliable considering its extremely low BLEU scores and high ratio of unknown tokens in generated descriptions."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-130",
"text": "Table 5 through Table 8 show the input/output examples of the words/phrases in WordNet and Wikipedia datasets, respectively."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-131",
"text": "When comparing the two datasets shown in the two tables, the quality of generated descriptions of Wikipedia dataset is significantly better than that of WordNet dataset."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-132",
"text": "The main reason for this result is that the size of training data of the Wikipedia dataset is 64x larger than the WordNet dataset (Table 1) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-133",
"text": "For all examples in both datasets in Table 5 through Table 8 , the Global model can only generate a single description for each input phrase because it cannot access any local context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-134",
"text": "In the Wikipedia dataset, both the Local and LOG-CaD models can describe the word/phrase considering its local context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-135",
"text": "For example, both the Local and LOG-CaD models could generate \"american\" in the description for \"daniel o'neill\" given \"united states\" in Context #1, while they could generate \"british\" given \"belfast\" in Context #2."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-136",
"text": "On the other hand, the I-Attention model could not describe the two phrases, taking into account the local contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-137",
"text": "We will present an analysis of this phenomenon in the next section."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-139",
"text": "**DISCUSSION**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-140",
"text": "In this section, we present some analyses on how the local and global contexts contribute to the description generation task."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-141",
"text": "First, we discuss how the local context helps the models to describe a phrase."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-142",
"text": "Then, we analyze the impact of global context under the situation where local context is unreliable."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-143",
"text": "q-lets and co. is a filipino and english informative children 's show on q in the philippines ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-144",
"text": "she was a founding producer of the cbc radio one show \" q \" ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-145",
"text": "the q awards are the uk 's annual music awards run by the music magazine \" q \" ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-146",
"text": "charles fraser-smith was an author and one-time missionary who is widely credited as being the inspiration for ian fleming 's james bond quartermaster q ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-147",
"text": "Reference: philippine tv network canadian radio show british music magazine fictional character from james bond Table 6 : The generated descriptions for \"q\" in Wikipedia."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-148",
"text": "Input: gracious"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-150",
"text": "**CONTEXT: #1 #2**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-151",
"text": "gracious even to unexpected visitors thanks to the gracious gods Reference: characterized by charm , good taste , and generosity of spirit disposed to bestow favors"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-153",
"text": "**GLOBAL: GRACIOUS AND GRACIOUS**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-154",
"text": "Local: to or relating to or characteristic of a profession having a sweet manner I-Attention: a feeling of thankfulness having a strong liking LOG-CaD: to be given to a particular purpose having the greatest degree of a good degree Table 7 : The generated descriptions for \"gracious\" in WordNet."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-156",
"text": "**HOW DO THE MODELS UTILIZE LOCAL CONTEXTS?**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-157",
"text": "Local context helps us (1) disambiguate polysemous words and (2) infer the meanings of unknown expressions."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-158",
"text": "In this section, we will discuss the two roles of local context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-159",
"text": "Considering that the pre-trained word embeddings are obtained from word-level cooccurrences in a massive text, more information is mixed up into a single vector as the more senses the word has."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-160",
"text": "While Gadetsky et al. (2018) designed the I-Attention model to filter out unrelated meanings in the global context given local context, they did not discuss the impact the number of senses has on the performance of definition generation."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-161",
"text": "To understand the influence of the ambiguity of phrases to be defined on the generation performance, we did an analysis on our Wikipedia dataset."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-162",
"text": "Figure 3(a) shows that the description generation task becomes harder as the phrases to be described become more ambiguous."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-163",
"text": "In particular, when a phrase has an extremely large number of senses, (i.e., #senses \u2265 4), the Global model drops its performance significantly."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-164",
"text": "This result indicates that the local context is necessary to disambiguate the meanings in global context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-165",
"text": "As shown in Table 2 , a large proportion of the phrases in our Wikipedia dataset includes unknown words (i.e., only 26.79% of words have their pre-trained embeddings)."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-166",
"text": "This fact indicates Figure 3: Impact of various parameters of a phrase to be described on BLEU scores of the generated descriptions."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-167",
"text": "that the global context in the dataset is extremely noisy."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-168",
"text": "Then our next question is, how does the lack of information from global context affect the performance of phrase description?"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-169",
"text": "Figure 3( b) shows the impact of unknown words in the phrases to be described on the performance."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-170",
"text": "As we can see from the result, the advantage of LOG-CaD and Local models over Global and I-Attention models becomes larger as the unknown words increases."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-171",
"text": "This result suggests that we need to fully utilize local contexts especially in practical applications where the phrases to be defined have many unknown words."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-172",
"text": "6.2 How do the models utilize global contexts?"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-173",
"text": "As discussed earlier, local contexts are important to describe expressions, but how about global contexts?"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-174",
"text": "Assuming a situation where we cannot obtain much information from local contexts (e.g., infer the meaning of \"boswellia\" from a short local context \"Here is a boswellia\"), global contexts should be essential to understand the meaning."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-175",
"text": "To confirm this hypothesis, we analyzed the impact of the length of local contexts on BLEU scores."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-176",
"text": "Figure 3(c) shows that when the length of local context is extremely short (l \u2264 10), the LOGCaD model becomes much stronger than the Local model."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-177",
"text": "This result indicates that not only local context but also global context help models describe the meanings of phrases."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-179",
"text": "**RELATED WORK**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-180",
"text": "In this study, we address a task of describing a given phrase/word with its context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-181",
"text": "In what follows, we explain several tasks that are related to our task."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-182",
"text": "Our task is closely related to word sense disambiguation (WSD) (Navigli, 2009) , which identifies a pre-defined sense for the target word with its context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-183",
"text": "Although we can use it to solve our task by retrieving the definition sentence for the sense identified by WSD, it requires a substantial amount of training data to handle a different set of meanings of each word, and cannot handle words (or senses) which are not registered in the dictionary."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-184",
"text": "Although some studies have attempted to detect novel senses of words for given contexts (Erk, 2006; Lau et al., 2014) , they do not provide definition sentences."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-185",
"text": "Our task avoids these difficulties in WSD by directly generating descriptions for phrases or words with their contexts."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-186",
"text": "It also allows us to flexibly tailor a fine-grained definition for the specific context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-187",
"text": "Paraphrasing (Androutsopoulos and Malakasiotis, 2010; Madnani and Dorr, 2010) (or text simplification (Siddharthan, 2014) ) can be used to rephrase words with unknown senses."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-188",
"text": "However, the target of paraphrase acquisition are words (or phrases) with no specified context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-189",
"text": "Although several studies (Connor and Roth, 2007; Max, 2009; Max et al., 2012) consider sub-sentential (context-sensitive) paraphrases, they do not intend to obtain a definition-like description as a paraphrase of a word."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-190",
"text": "Recently, Noraset et al. (2017) introduced a task of generating a definition sentence of a word from its pre-trained embedding."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-191",
"text": "Since their task does not take local contexts of words as inputs, their method cannot generate an appropriate definition for a polysemous word for a specific context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-192",
"text": "To cope with this problem, Gadetsky et al. (2018) have proposed a definition generation method that works with polysemous words in dictionaries."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-193",
"text": "They present a model that utilizes local context to filter out the unrelated meanings from a pre-trained word embedding in a specific context."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-194",
"text": "While their method use local context only for disambiguating the meanings that are mixed up in word embeddings, the information from local contexts cannot be utilized if the pre-trained embeddings are unavailable or unreliable."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-195",
"text": "On the other hand, our method can fully utilize the local context through an attentional mechanism, even if the reliable word embeddings are unavailable."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-196",
"text": "Focusing on non-standard English words (or phrases), Ni and Wang (2017) generated their explanations solely from sentences with those words."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-197",
"text": "Their model does not take advantage of global contexts (word embeddings induced from massive text) as was used in Noraset et al. (2017) ."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-198",
"text": "Our task of describing phrases with its given context is a generalization of these three tasks (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) , and the proposed method naturally utilizes both local and global contexts of a word in question."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-199",
"text": "----------------------------------"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-200",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-201",
"text": "This paper sets up a task of generating a natural language description for a word/phrase with a specific context, aiming to help us acquire unknown word senses when reading text."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-202",
"text": "We approached this task by using a variant of encoder-decoder models that capture the given local context by an encoder and global contexts by the target word's embedding induced from massive text."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-203",
"text": "Experimental results on three existing datasets and one novel dataset built from Wikipedia dataset confirmed that the use of both local and global contexts is the key to generating appropriate contextsensitive description in various situations."
},
{
"sent_id": "a5f33403d23cdc0532547266f1841a-C001-204",
"text": "We plan to modify our model to use multiple contexts in text to improve the quality of descriptions, considering the \"one sense per discourse\" hypothesis (Gale et al., 1992) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a5f33403d23cdc0532547266f1841a-C001-19"
],
[
"a5f33403d23cdc0532547266f1841a-C001-46"
],
[
"a5f33403d23cdc0532547266f1841a-C001-190"
],
[
"a5f33403d23cdc0532547266f1841a-C001-197"
]
],
"cite_sentences": [
"a5f33403d23cdc0532547266f1841a-C001-19",
"a5f33403d23cdc0532547266f1841a-C001-46",
"a5f33403d23cdc0532547266f1841a-C001-190",
"a5f33403d23cdc0532547266f1841a-C001-197"
]
},
"@MOT@": {
"gold_contexts": [
[
"a5f33403d23cdc0532547266f1841a-C001-19",
"a5f33403d23cdc0532547266f1841a-C001-20",
"a5f33403d23cdc0532547266f1841a-C001-21"
]
],
"cite_sentences": [
"a5f33403d23cdc0532547266f1841a-C001-19"
]
},
"@USE@": {
"gold_contexts": [
[
"a5f33403d23cdc0532547266f1841a-C001-25"
],
[
"a5f33403d23cdc0532547266f1841a-C001-54"
],
[
"a5f33403d23cdc0532547266f1841a-C001-64",
"a5f33403d23cdc0532547266f1841a-C001-65"
],
[
"a5f33403d23cdc0532547266f1841a-C001-67"
],
[
"a5f33403d23cdc0532547266f1841a-C001-97"
],
[
"a5f33403d23cdc0532547266f1841a-C001-113"
],
[
"a5f33403d23cdc0532547266f1841a-C001-197",
"a5f33403d23cdc0532547266f1841a-C001-198"
]
],
"cite_sentences": [
"a5f33403d23cdc0532547266f1841a-C001-25",
"a5f33403d23cdc0532547266f1841a-C001-54",
"a5f33403d23cdc0532547266f1841a-C001-64",
"a5f33403d23cdc0532547266f1841a-C001-65",
"a5f33403d23cdc0532547266f1841a-C001-67",
"a5f33403d23cdc0532547266f1841a-C001-97",
"a5f33403d23cdc0532547266f1841a-C001-113",
"a5f33403d23cdc0532547266f1841a-C001-197"
]
},
"@DIF@": {
"gold_contexts": [
[
"a5f33403d23cdc0532547266f1841a-C001-25",
"a5f33403d23cdc0532547266f1841a-C001-26"
]
],
"cite_sentences": [
"a5f33403d23cdc0532547266f1841a-C001-25"
]
}
}
},
"ABC_e9a7e0d6d09fb2a2dd1972d6d16682_3": {
"x": [
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-2",
"text": "Transition-based approaches have shown competitive performance on constituent and dependency parsing of Chinese."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-3",
"text": "Stateof-the-art accuracies have been achieved by a deterministic shift-reduce parsing model on parsing the Chinese Treebank 2 data (Wang et al., 2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-4",
"text": "In this paper, we propose a global discriminative model based on the shift-reduce parsing process, combined with a beam-search decoder, obtaining competitive accuracies on CTB2."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-5",
"text": "We also report the performance of the parser on CTB5 data, obtaining the highest scores in the literature for a dependencybased evaluation."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-8",
"text": "Transition-based statistical parsing associates scores with each decision in the parsing process, selecting the parse which is built by the highest scoring sequence of decisions (Briscoe and Carroll, 1993; Nivre et al., 2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-9",
"text": "The parsing algorithm is typically some form of bottom-up shiftreduce algorithm, so that scores are associated with actions such as shift and reduce."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-10",
"text": "One advantage of this approach is that the parsing can be highly efficient, for example by pursuing a greedy strategy in which a single action is chosen at each decision point."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-11",
"text": "The alternative approach, exemplified by Collins (1997) and Charniak (2000) , is to use a chart-based algorithm to build the space of possible parses, together with pruning of lowprobability constituents and the Viterbi algorithm to find the highest scoring parse."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-12",
"text": "For English dependency parsing, the two approaches give similar results (McDonald et al., 2005; Nivre et al., 2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-13",
"text": "For English constituent-based parsing using the Penn Treebank, the best performing transitionbased parser lags behind the current state-of-theart (Sagae and Lavie, 2005) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-14",
"text": "In contrast, for Chinese, the best dependency parsers are currently transition-based (Duan et al., 2007; Zhang and Clark, 2008) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-15",
"text": "For constituent-based parsing using the Chinese Treebank (CTB), Wang et al. (2006) have shown that a shift-reduce parser can give competitive accuracy scores together with high speeds, by using an SVM to make a single decision at each point in the parsing process."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-16",
"text": "In this paper we describe a global discriminative model for Chinese shift-reduce parsing, and compare it with Wang et al.'s approach."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-39",
"text": "For these sentences, Wang's parser will be unable to produce the unary-branching roots because the parsing process terminates as soon as the root is found."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-17",
"text": "We apply the same shift-reduce procedure as Wang et al. (2006) , but instead of using a local classifier for each transition-based action, we train a generalized perceptron model over complete sequences of actions, so that the parameters are learned in the context of complete parses."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-18",
"text": "We apply beam search to decoding instead of greedy search."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-19",
"text": "The parser still operates in linear time, but the use of beam-search allows the correction of local decision errors by global comparison."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-20",
"text": "Using CTB2, our model achieved Parseval F-scores comparable to Wang et al.'s approach."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-21",
"text": "We also present accuracy scores for the much larger CTB5, using both a constituent-based and dependency-based evaluation."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-22",
"text": "The scores for the dependency-based evaluation were higher than the state-of-the-art dependency parsers for the CTB5 data."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-23",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-24",
"text": "**THE SHIFT-REDUCE PARSING PROCESS**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-25",
"text": "The shift-reduce process used by our beam-search decoder is based on the greedy shift-reduce parsers of Sagae and Lavie (2005) and Wang et al. (2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-26",
"text": "The process assumes binary-branching trees; section 2.1 explains how these are obtained from the arbitrary-branching trees in the Chinese Treebank."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-27",
"text": "The input is assumed to be segmented and POS tagged, and the word-POS pairs waiting to be processed are stored in a queue."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-28",
"text": "A stack holds the partial parse trees that are built during the parsing process."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-29",
"text": "A parse state is defined as a stack,queue pair."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-30",
"text": "Parser actions, including SHIFT and various kinds of REDUCE, define functions from states to states by shifting word-POS pairs onto the stack and building partial parse trees."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-31",
"text": "The actions used by the parser are:"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-32",
"text": "\u2022 SHIFT, which pushes the next word-POS pair in the queue onto the stack;"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-33",
"text": "\u2022 REDUCE-unary-X, which makes a new unary-branching node with label X; the stack is popped and the popped node becomes the child of the new node; the new node is pushed onto the stack;"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-34",
"text": "\u2022 REDUCE-binary-{L/R}-X, which makes a new binary-branching node with label X; the stack is popped twice, with the first popped node becoming the right child of the new node and the second popped node becoming the left child; the new node is pushed onto the stack;"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-35",
"text": "\u2022 TERMINATE, which pops the root node off the stack and ends parsing."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-36",
"text": "This action is novel in our parser."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-37",
"text": "Sagae and Lavie (2005) and Wang et al. (2006) only used the first three transition actions, setting the final state as all incoming words having been processed, and the stack containing only one node."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-38",
"text": "However, there are a small number of sentences (14 out of 3475 from the training data) that have unary-branching roots."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-40",
"text": "We define a separate action to terminate parsing, allowing unary reduces to be applied to the root item before parsing finishes."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-41",
"text": "The trees built by the parser are lexicalized, using the head-finding rules from Zhang and Clark (2008) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-42",
"text": "The left (L) and right (R) versions of the REDUCE-binary rules indicate whether the head of Figure 2 : the binarization algorithm with input T the new node is to be taken from the left or right child."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-43",
"text": "Note also that, since the parser is building binary trees, the X label in the REDUCE rules can be one of the temporary constituent labels, such as NP * , which are needed for the binarization process described in Section 2.1."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-44",
"text": "Hence the number of left and right binary reduce rules is the number of constituent labels in the binarized grammar."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-45",
"text": "Wang et al. (2006) give a detailed example showing how a segmented and POS-tagged sentence can be incrementally processed using the shift-reduce actions to produce a binary tree."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-46",
"text": "We show this example in Figure 1 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-48",
"text": "**THE BINARIZATION PROCESS**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-49",
"text": "The algorithm in Figure 2 is used to map CTB trees into binarized trees, which are required by the shift-reduce parsing process."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-50",
"text": "For any tree node with more than two child nodes, the algorithm works by first finding the head node, and then processing its right-hand-side and left-hand-side, respectively."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-51",
"text": "The head-finding rules are taken from Zhang and Clark (2008) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-52",
"text": "Y = X 1 ..X m represents a tree node Y with child nodes X 1 ...X m (m \u2265 1)."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-53",
"text": "The label of the newly generated node Y * is based on the constituent label of the original node Y , but marked with an asterix."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-54",
"text": "Hence binarization enlarges the set of constituent labels."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-55",
"text": "We call the constituents marked with * temporary constituents."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-56",
"text": "The binarization process is reversible, in that output from the shift-reduce parser can be unbinarized into CTB format, which is required for evaluation."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-58",
"text": "**RESTRICTIONS ON THE SEQUENCE OF ACTIONS**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-59",
"text": "Not all sequences of actions produce valid binarized trees."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-60",
"text": "In the deterministic parser of Wang et al. (2006) , the highest scoring action predicted by the classifier may prevent a valid binary tree from being built."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-61",
"text": "In this case, Wang et al. simply return a partial parse consisting of all the subtrees on the stack."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-62",
"text": "In our parser a set of restrictions is applied which guarantees a valid parse tree."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-63",
"text": "For example, two simple restrictions are that a SHIFT action can only be applied if the queue of incoming words Variables: state item item = (S, Q), where S is stack and Q is incoming queue; the agenda agenda; list of state items next; Algorithm:"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-64",
"text": "for item \u2208 agenda: if item.score = agenda.bestScore and item.isFinished:"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-65",
"text": "next.push(item."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-66",
"text": "TakeAction(move)) agenda = next.getBBest() Outputs: rval Figure 3 : the beam-search decoding algorithm is non-empty, and the binary reduce actions can only be performed if the stack contains at least two nodes."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-67",
"text": "Some of the restrictions are more complex than this; the full set is listed in the Appendix."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-69",
"text": "**DECODING WITH BEAM SEARCH**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-70",
"text": "Our decoder is based on the incremental shiftreduce parsing process described in Section 2."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-71",
"text": "We apply beam-search, keeping the B highest scoring state items in an agenda during the parsing process."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-72",
"text": "The agenda is initialized with a state item containing the starting state, i.e. an empty stack and a queue consisting of all word-POS pairs from the sentence."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-73",
"text": "At each stage in the decoding process, existing items from the agenda are progressed by applying legal parsing actions."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-74",
"text": "From all newly generated state items, the B highest scoring are put back on the agenda."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-75",
"text": "The decoding process is terminated when the highest scored state item in the agenda reaches the final state."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-76",
"text": "If multiple state items have the same highest score, parsing terminates if any of them are finished."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-77",
"text": "The algorithm is shown in Figure 3 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-79",
"text": "**MODEL AND LEARNING ALGORITHM**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-80",
"text": "We use a linear model to score state items."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-81",
"text": "Recall that a parser state is a stack,queue pair, with the stack holding subtrees and the queue holding incoming words waiting to be processed."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-82",
"text": "The score Inputs: training examples (x i , y i ) Initialization: set w = 0 Algorithm:"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-83",
"text": "where \u03a6(Y ) is the global feature vector from Y , and w is the weight vector defined by the model."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-84",
"text": "Each element from \u03a6(Y ) represents the global count of a particular feature from Y ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-85",
"text": "The feature set consists of a large number of features which pick out various configurations from the stack and queue, based on the words and subtrees in the state item."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-86",
"text": "The features are described in Section 4.1."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-87",
"text": "The weight values are set using the generalized perceptron algorithm (Collins, 2002) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-88",
"text": "The perceptron algorithm is shown in Figure 4 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-89",
"text": "It initializes weight values as all zeros, and uses the current model to decode training examples (the parse function in the pseudo-code)."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-90",
"text": "If the output is correct, it passes on to the next example."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-91",
"text": "If the output is incorrect, it adjusts the weight values by adding the feature vector from the goldstandard output and subtracting the feature vector from the parser output."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-92",
"text": "Weight values are updated for each example (making the process online) and the training data is iterated over T times."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-93",
"text": "In order to avoid overfitting we used the now-standard averaged version of this algorithm (Collins, 2002) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-94",
"text": "We also apply the early update modification from Collins and Roark (2004) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-95",
"text": "If the agenda, at any point during the decoding process, does not contain the correct partial parse, it is not possible for the decoder to produce the correct output."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-96",
"text": "In this case, decoding is stopped early and the weight values are updated using the highest scoring partial parse on the agenda."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-97",
"text": "Table 1 shows the set of feature templates for the model."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-98",
"text": "Individual features are generated from Description Feature templates Unigrams S0tc, S0wc, S1tc, S1wc, S2tc, S2wc, S3tc, S3wc, N0wt, N1wt, N2wt, N3wt, S0lwc, S0rwc, S0uwc, S1lwc, S1rwc, S1uwc, Bigrams S0wS1w, S0wS1c, S0cS1w, S0cS1c, S0wN0w, S0wN0t, S0cN0w, S0cN0t, N0wN1w, N0wN1t, N0tN1w, N0tN1t S1wN0w, S1wN0t, S1cN0w, S1cN0t,"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-100",
"text": "**FEATURE SET**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-101",
"text": "Separator S0wp, S0wcp, S0wq, S0wcq, S1wp, S1wcp, S1wq, S1wcq"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-102",
"text": "S0cS1cp, S0cS1cq Table 1 : Feature templates these templates by first instantiating a template with particular labels, words and tags, and then pairing the instantiated template with a particular action."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-103",
"text": "In the table, the symbols S 0 , S 1 , S 2 , and S 3 represent the top four nodes on the stack, and the symbols N 0 , N 1 , N 2 and N 3 represent the first four words in the incoming queue."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-104",
"text": "S 0 L, S 0 R and S 0 U represent the left and right child for binary branching S 0 , and the single child for unary branching S 0 , respectively; w represents the lexical head token for a node; c represents the label for a node."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-105",
"text": "When the corresponding node is a terminal, c represents its POS-tag, whereas when the corresponding node is non-terminal, c represents its constituent label; t represents the POS-tag for a word."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-106",
"text": "The context S 0 , S 1 , S 2 , S 3 and N 0 , N 1 , N 2 , N 3 for the feature templates is taken from Wang et al. (2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-107",
"text": "However, Wang et al. (2006) used a polynomial kernel function with an SVM and did not manually create feature combinations."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-129",
"text": "The tests were performed using the development test data and gold-standard POStags."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-108",
"text": "Since we used the linear perceptron algorithm we manually combined Unigram features into Bigram and Trigram features."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-109",
"text": "The \"Bracket\" row shows bracket-related features, which were inspired by Wang et al. (2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-110",
"text": "Here brackets refer to left brackets including \"\uff08\", \"\"\" and \"\u300a\" and right brackets including \"\uff09\", \"\"\" and \"\u300b\"."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-111",
"text": "In the table, b represents the matching status of the last left bracket (if any) on the stack."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-112",
"text": "It takes three different values: 1 (no matching right bracket has been pushed onto stack), 2 (a matching right bracket has been pushed onto stack) and 3 (a matching right bracket has been pushed onto stack, but then popped off)."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-113",
"text": "The \"Separator\" row shows features that include one of the separator punctuations (i.e. \"\uff0c\", \"\u3002\", \"\u3001\" and \"\uff1b\") between the head words of S 0 and S 1 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-114",
"text": "These templates apply only when the stack contains at least two nodes; p represents a separator punctuation symbol."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-115",
"text": "Each unique separator punctuation between S 0 and S 1 is only counted once when generating the global feature vector."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-116",
"text": "q represents the count of any separator punctuation between S 0 and S 1 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-117",
"text": "Whenever an action is being considered at each point in the beam-search process, templates from Table 1 are matched with the context defined by the parser state and combined with the action to generate features."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-118",
"text": "Negative features, which are the features from incorrect parser outputs but not from any training example, are included in the model."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-119",
"text": "There are around a million features in our experiments with the CTB2 dataset."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-120",
"text": "Wang et al. (2006) used a range of other features, including rhythmic features of S 0 and S 1 (Sun and Jurafsky, 2003) , features from the most recently found node that is to the left or right of S 0 and S 1 , the number of words and the number of punctuations in S 0 and S 1 , the distance between S 0 and S 1 and so on."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-121",
"text": "We did not include these features in our parser, because they did not lead to improved performance during development experiments."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-123",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-124",
"text": "The experiments were performed using the Chinese Treebank 2 and Chinese Treebank 5 data."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-125",
"text": "Standard data preparation was performed before the experiments: empty terminal nodes were removed; any non-terminal nodes with no children were removed; any unary X \u2192 X nodes resulting from the previous steps were collapsed into one X node."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-126",
"text": "For all experiments, we used the EVALB tool 1 for evaluation, and used labeled recall (LR), labeled precision (LP ) and F 1 score (which is the harmonic mean of LR and LP ) to measure parsing accuracy."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-127",
"text": "Figure 5 shows the accuracy curves using different beam-sizes for the decoder."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-128",
"text": "The number of training iterations is on the x-axis with F -score on the y-axis."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-130",
"text": "The figure shows the benefit of using a beam size greater than 1, with comparatively little accuracy gain being obtained beyond a beam size of 8."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-131",
"text": "Hence we set the beam size to 16 for the rest of the experiments."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-133",
"text": "**THE INFLUENCE OF BEAM-SIZE**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-135",
"text": "**TEST RESULTS ON CTB2**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-136",
"text": "The experiments in this section were performed using CTB2 to allow comparison with previous work, with the CTB2 data extracted from Chinese Treebank 5 (CTB5 Table 3 : Accuracies on CTB2 with gold-standard POS-tags own implementation of the perceptron-based tagger from Collins (2002) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-137",
"text": "The results of various models measured using sentences with less than 40 words and using goldstandard POS-tags are shown in Table 3 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-138",
"text": "The rows represent the model from Bikel and Chiang (2000) , Bikel (2004) , the SVM and ensemble models from Wang et al. (2006) , and our parser, respectively."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-139",
"text": "The accuracy of our parser is competitive using this test set."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-140",
"text": "The results of various models using automatically assigned POS-tags are shown in Table 4 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-141",
"text": "The rows in the table represent the models from Bikel and Chiang (2000), Levy and Manning (2003) , Xiong et al. (2005) , Bikel (2004), Chiang and Bikel (2002) , the SVM model from Wang et al. (2006) and the ensemble system from Wang et al. (2006) , and the parser of this paper, respectively."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-142",
"text": "Our parser gave comparable accuracies to the SVM and ensemble models from Wang et al. (2006) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-143",
"text": "However, comparison with Table 3 shows that our parser is more sensitive to POS-tagging errors than some of the other models."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-144",
"text": "One possible reason is that some of the other parsers, e.g. Bikel (2004) , use the parser model itself to resolve tagging ambiguities, whereas we rely on a POS tagger to accurately assign a single tag to each word."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-145",
"text": "In fact, for the Chinese data, POS tagging accuracy is not very high, with the perceptron-based tagger achieving an accuracy of only 93%."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-146",
"text": "The beam-search decoding framework we use could accommodate joint parsing and tagging, although the use of features based on the tags of incoming words complicates matters somewhat, since these features rely on tags having been assigned to all words in a pre-processing step."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-147",
"text": "We leave this problem for future work."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-148",
"text": "In a recent paper, Petrov and Klein (2007) Table 4 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-149",
"text": "However, we did not include their scores in the table because they used a different training set from CTB5, which is much larger than the CTB2 training set used by all parsers in the table."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-150",
"text": "In order to make a comparison, we split the data in the same way as Petrov and Klein (2007) and tested our parser using automatically assigned POS-tags."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-151",
"text": "It gave LR and LP of 82.0% and 80.9% for sentences with less than 40 words and 77.8% and 77.4% for all sentences, significantly lower than Petrov and Klein (2007) , which we partly attribute to the sensitivity of our parser to pos tag errors (see Table 5 )."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-153",
"text": "**THE EFFECT OF TRAINING DATA SIZE**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-154",
"text": "CTB2 is a relatively small corpus, and so we investigated the effect of adding more training data from CTB5."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-155",
"text": "Intuitively, more training data leads to higher parsing accuracy."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-156",
"text": "By using increased amount of training sentences (Table 6 ) from CTB5 with the same development test data (Table 2) , we draw the accuracy curves with different number of training iterations ( Figure 6 )."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-157",
"text": "This experiment confirmed that the accuracy increases with the amount of training data."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-158",
"text": "Another motivation for us to use more training data is to reduce overfitting."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-159",
"text": "We invested considerable effort into feature engineering using CTB2, and found that a small variation of feature templates (e.g. changing the feature template S 0 cS 1 c from Table 1 to S 0 tcS 1 tc) can lead to a comparatively large change (up to 1%) in the accuracy."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-160",
"text": "One possible reason for this variation is the small size of the CTB2 training data."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-161",
"text": "When performing experiments using the larger set B from Table 5 presents the performance of the parser on CTB5."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-162",
"text": "We adopt the data split from Zhang and Clark (2008) , as shown in Table 7 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-163",
"text": "We used the same parser configurations as Section 5.2."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-164",
"text": "As an additional evaluation we also produced dependency output from the phrase-structure trees, using the head-finding rules, so that we can also compare with dependency parsers, for which the highest scores in the literature are currently from our previous work in Zhang and Clark (2008) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-165",
"text": "We compare the dependencies read off our constituent parser using CTB5 data with the dependency parser from Zhang and Clark (2008) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-166",
"text": "The same measures are taken and the accuracies with gold-standard POS-tags are shown in Table 8 ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-167",
"text": "Our constituent parser gave higher accuracy than the dependency parser."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-168",
"text": "It is interesting that, though the constituent parser uses many fewer feature templates than the dependency parser, the features do include constituent information, which is unavailable to dependency parsers."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-170",
"text": "**RELATED WORK**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-171",
"text": "Our parser is based on the shift-reduce parsing process from Sagae and Lavie (2005) and Wang et al. (2006) , and therefore it can be classified as a transition-based parser (Nivre et al., 2006 )."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-193",
"text": "A discriminative model allows consistent training of a wide range of different features."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-172",
"text": "An important difference between our parser and the Wang et al. (2006) parser is that our parser is based on a discriminative learning model with global features, whilst the parser from Wang et al. (2006) is based on a local classifier that optimizes each individual choice."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-173",
"text": "Instead of greedy local decoding, we used beam search in the decoder."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-174",
"text": "An early work that applies beam search to constituent parsing is Ratnaparkhi (1999) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-175",
"text": "The main difference between our parser and Ratnaparkhi's is that we use a global discriminative model, whereas Ratnaparkhi's parser has separate probabilities of actions chained together in a conditional model."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-176",
"text": "Both our parser and the parser from Collins and Roark (2004) use a global discriminative model and an incremental parsing process."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-177",
"text": "The major difference is the use of different incremental parsing processes."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-178",
"text": "To achieve better performance for Chinese parsing, our parser is based on the shiftreduce parsing process."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-179",
"text": "In addition, we did not include a generative baseline model in the discriminative model, as did Collins and Roark (2004) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-180",
"text": "Our parser in this paper shares similarity with our transition-based dependency parser from Zhang and Clark (2008) in the use of a discriminative model and beam search."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-181",
"text": "The main difference is that our parser in this paper is for constituent parsing."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-182",
"text": "In fact, our parser is one of only a few constituent parsers which have successfully applied global discriminative models, certainly without a generative baseline as a feature, whereas global models for dependency parsing have been comparatively easier to develop."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-184",
"text": "**CONCLUSION**"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-185",
"text": "The contributions of this paper can be summarized as follows."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-186",
"text": "First, we defined a global discriminative model for Chinese constituent-based parsing, continuing recent work in this area which has focused on English (Clark and Curran, 2007; Carreras et al., 2008; Finkel et al., 2008) ."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-187",
"text": "Second, we showed how such a model can be applied to shiftreduce parsing and combined with beam search, resulting in an accurate linear-time parser."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-188",
"text": "In standard tests using CTB2 data, our parser achieved comparable Parseval F-score to the state-of-theart systems."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-189",
"text": "Moreover, we observed that more training data lead to improvements on both accuracy and stability against feature variations, and reported performance of the parser using CTB5 data."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-190",
"text": "By converting constituent-based output to dependency relations using standard head-finding rules, our parser also obtained the highest scores for a CTB5 dependency evaluation in the literature."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-191",
"text": "Due to the comparatively low accuracy for Chinese POS-tagging, the parsing accuracy dropped significantly when using automatically assigned POS-tags rather than gold-standard POS-tags."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-192",
"text": "In our further work, we plan to investigate possible methods of joint POS-tagging and parsing under the discriminative model and beam-search framework."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-194",
"text": "We showed in Zhang and Clark (2008) that it was possible to combine graph and transition-based dependency parser into the same global discriminative model."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-195",
"text": "Our parser framework in this paper allows the same integration of graph-based features."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-196",
"text": "However, preliminary experiments with features based on graph information did not show accuracy improvements for our parser."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-197",
"text": "One possible reason is that the transition actions for the parser in this paper already include graph information, such as the label of the newly generated constituent, while for the dependency parser in Zhang and Clark (2008) , transition actions do not contain graph information, and therefore the use of transition-based features helped to make larger improvements in accuracy."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-198",
"text": "The integration of graph-based features for our shift-reduce constituent parser is worth further study."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-199",
"text": "The source code of our parser is publicly available at http://www.sourceforge.net/projects/zpar."
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-200",
"text": "2"
},
{
"sent_id": "e9a7e0d6d09fb2a2dd1972d6d16682-C001-201",
"text": "\u2022 the binary reduce actions can only be performed when the stack contains at least two nodes, with at least one of the two nodes on top of stack (with R being the topmost and L being the second) being non-temporary; \u2022 if L is temporary with label X * , the resulting node must be labeled X or X * and leftheaded (i.e. to take the head word from L); similar restrictions apply when R is temporary; \u2022 when the incoming queue is empty and the stack contains only two nodes, binary reduce can be applied only if the resulting node is non-temporary; \u2022 when the stack contains only two nodes, temporary resulting nodes from binary reduce must be left-headed; \u2022 when the queue is empty and the stack contains more than two nodes, with the third node from the top being temporary, binary reduce can be applied only if the resulting node is non-temporary; \u2022 when the stack contains more than two nodes, with the third node from the top being temporary, temporary resulting nodes from binary reduce must be left-headed; \u2022 the terminate action can be performed when the queue is empty, and the stack size is one."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-15"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-37"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-45"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-60"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-107"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-120"
]
],
"cite_sentences": [
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-15",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-37",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-45",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-60",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-107",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-120"
]
},
"@MOT@": {
"gold_contexts": [
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-15",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-16"
]
],
"cite_sentences": [
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-15"
]
},
"@USE@": {
"gold_contexts": [
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-17"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-25"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-45",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-46"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-106"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-141"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-171"
]
],
"cite_sentences": [
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-17",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-25",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-45",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-106",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-141",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-171"
]
},
"@DIF@": {
"gold_contexts": [
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-17"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-107",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-108"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-120",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-121"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-142",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-143"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-172"
]
],
"cite_sentences": [
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-17",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-107",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-120",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-142",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-172"
]
},
"@EXT@": {
"gold_contexts": [
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-37",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-38",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-39",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-40"
]
],
"cite_sentences": [
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-37"
]
},
"@SIM@": {
"gold_contexts": [
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-109"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-138"
],
[
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-142"
]
],
"cite_sentences": [
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-109",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-138",
"e9a7e0d6d09fb2a2dd1972d6d16682-C001-142"
]
}
}
},
"ABC_0c3f9588b6f587d04c286384ca24e0_3": {
"x": [
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-227",
"text": "**CONCLUSION**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-2",
"text": "Supertagging is a sequence prediction task where each word is assigned a piece of complex syntactic structure called a supertag."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-3",
"text": "We provide a novel approach to multi-task learning for Tree Adjoining Grammar (TAG) supertagging by deconstructing these complex supertags in order to define a set of related but auxiliary sequence prediction tasks."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-4",
"text": "Our multi-task prediction framework is trained over the exactly same training data used to train the original supertagger where each auxiliary task provides an alternative view on the original prediction task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-5",
"text": "Our experimental results show that our multi-task approach significantly improves TAG supertagging with a new state-of-the-art accuracy score of 91.39% on the Penn treebank supertagging dataset."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-8",
"text": "A treebank for lexicalized tree-adjoining grammar (TAG) (Joshi and Schabes, 1997) consists of annotated sentences where each word is provided a complex tree structure called a supertag and the overall parse of the sentence combines these supertags into a parse tree."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-9",
"text": "Supertagging is a task that learns a sequence prediction task from this annotated data and is able to then assign the most likely sequence of supertags to an input sequence of words (Bangalore and Joshi, 1999) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-10",
"text": "Once the right supertag is assigned then parsing is a much easier task and may not even be needed for many applications where information about syntax is needed but a full parse is unnecessary."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-11",
"text": "Supertagging has been shown to be useful for both Tree Adjoining Grammar (TAG) (Bangalore and Joshi, 1999) and combinatory categorial grammar (CCG) (Hockenmaier and Steedman, 2007) parsing."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-12",
"text": "In this paper we aim to improve the state-of-the-art for the task of learning a TAG supertagger from an annotated treebank (Kasai et al., 2018) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-180",
"text": "We experiment with pre-trained GloVe word embeddings of three different sizes: 100, 200 and 300."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-13",
"text": "We observe that supertag prediction does not take full advantage of the complex structural information contained within each supertag."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-14",
"text": "Neural models have been used to learn embeddings over these supertags and thereby share weights among similar supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-15",
"text": "Friedman et al. (2017) provide tree-structured neural models over supertags which can learn interesting relationships between supertags but the approach does not lead to higher supertagging accuracy."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-16",
"text": "Our main contribution is to provide several novel ways to deconstruct supertags to create multiple alternative auxiliary tasks, which we then combine using a multi-task prediction framework and we show that this can lead to a significant improvement in supertagging accuracy."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-17",
"text": "Multi-task learning (MTL) (Caruana, 1997 ) learns multiple heterogenous tasks in parallel with a shared representation so that what is learned for one task can be shared for another task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-18",
"text": "In most cases the improvement is due to weight sharing between different tasks (Collobert and Weston, 2008; Luong et al., 2015) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-19",
"text": "While some combinations may not provide any benefit in MTL (Bingel and S\u00f8gaard, 2017) and the improvements might be simply due to training on more data."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-20",
"text": "However, MTL can be effective even when using large pretrained models (Liu et al., 2019) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-21",
"text": "Unlike most other work in multi-task learning with neural models we do not use different annotated datasets for each task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-22",
"text": "Similar to the approach to combining different representations for phrase structure parsing in (Vilares et al., 2019) we also construct multiple tasks from exactly the same training data set."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-23",
"text": "Our approach is also distinct in that we take advantage of the structure of the supertags by deconstructing the tree structure implicit in each supertag."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-24",
"text": "Our experimental results show that our novel multi-task learning framework leads to a new state-of-the-art accuracy score of 91.39% for TAG supertagging on the Penn Treebank dataset (Marcus et al., 1993; Chen et al., 2006) which is a significant improvement over the previous multi-task result for supertagging that combines supertagging with graph-based parsing (Kasai et al., 2018) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-26",
"text": "**THE SUPERTAGGING TASK**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-27",
"text": "Supertagging assigns complex structural descriptions to each word in the sentence."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-28",
"text": "The complex structural descriptions come from grammar formalisms that are more expressive than context-free grammars for phrase structure trees or dependency trees."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-29",
"text": "In Tree Adjoining Grammar (TAG), the supertags are tree fragments that can express various syntactic facts such as transitive verb, wh-extraction, relative clauses, appositive clauses, light verbs, prepositional phrase attachment and many other syntactic phenomena."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-30",
"text": "In combinatory categorial grammar (CCG) the supertags are types and their type-raised variants which also capture similar syntactic phenomena as in TAG supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-31",
"text": "Supertagging can be viewed as \"almost parsing\" (Bangalore and Joshi, 1999) and can provide the benefits of syntactic parsing without a full parser."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-32",
"text": "In this paper we focus on the TAG supertagging task, however, our proposed methods can likely be used to improve CCG supertagging as well."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-33",
"text": "Supertagging is a relatively simple linear time sequence prediction task similar to part of speech tagging."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-34",
"text": "Supertagging can be useful in many applications such as machine translation, grammatical error detection, disfluency prediction, and many others while being a much simpler task than full parsing."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-35",
"text": "In addition, for both TAG and CCG, supertagging is an essential first step to parsing so any improvements in supertag prediction will benefit parsing as well."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-36",
"text": "For all these reasons, in this paper we focus on the supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-37",
"text": "TAG and CCG can be parsed using graph-parsing methods in O(n 3 ) but the complexity of unrestricted parsing for both formalisms is O(n 6 ) which is prohibitive on real-world data."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-38",
"text": "Neural linear-time transition based parsers are still not accurate enough to compete with the state-of-the-art supertagging models or parsers that use supertagging as the initial step (Chung et al., 2016; Kasai et al., 2018) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-39",
"text": "An example of the supertagging task for Tree Adjoining Grammars (TAGs) is shown in Fig. 1 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-40",
"text": "The \u2193 symbol on a leaf node represents a substitu-tion node which can be expanded by a tree rooted in the same label, e.g. t3 rooted in NP substitutes into the NP\u2193 node in t46."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-41",
"text": "The * symbol on the leaf node of a tree t represents an adjunction node (also called a footnode) and signifies that t can be inserted into an internal node of another tree with the same label, e.g. t103 adjoins into the AP node in t46."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-42",
"text": "The node is called the head and represents the node where the word token is inserted into the tree."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-43",
"text": "The table on the right shows how many different supertags are possible for each word in the sentence."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-44",
"text": "Three factors make supertagging a challenging task for sequence prediction: much more severe token level ambiguity when compared to other like part-of-speech tagging, a large number of distinct supertag types (4727 distinct supertags in our dataset, including an unknown supertag) and a complex internal structure for each supertag."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-46",
"text": "**BASELINE SUPERTAGGING MODEL**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-47",
"text": "For our baseline supertagging model we use the state-of-the-art model that currently has the highest accuracy on the Penn treebank dataset (Kasai et al., 2018) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-48",
"text": "For the supertagging model the main contribution of Kasai et al. (2018) was two-fold: the first was to add a character CNN for modeling word embeddings using subword features, and the second was to add highway connections to add more layers to a standard bidirectional LSTM."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-49",
"text": "The output layer was a standard multi-layer perceptron that had a softmax output over the set of supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-50",
"text": "Another extension to the standard sequence prediction model in Kasai et al. (2018) was to combine supertagging with graph-based parsing."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-51",
"text": "In this paper, we focus on the supertagging model and compare only on supertagging accuracy."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-52",
"text": "The neural model for supertagging that we use as a baseline uses graph-based parsing as an auxiliary task and has the current highest accuracy score on the Penn treebank (90.81%)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-53",
"text": "The model has three main components: the input layer, the bidirectional LSTM component, and the output layer which computes a softmax over the set of supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-54",
"text": "The input to the model is a sequence of words and the output is a sequence of supertags, one per word, which makes it a standard tagging aka sequence prediction task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-56",
"text": "**INPUT LAYER**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-57",
"text": "Each word in the input sequence is converted into a word embedding in the input layer."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-58",
"text": "Following"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-60",
"text": "**TOKEN**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-61",
"text": "#supertags The 5 answer 14 seems 20 perfectly 5 clear 32 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-62",
"text": "16 Figure 1 : An example that explains the supertagging task for Tree Adjoining Grammars (TAGs)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-63",
"text": "For the sentence \"The answer seems perfectly clear .\" the correct supertag for each word is shown above."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-64",
"text": "The table on the right shows how many different supertags are possible for each word in the sentence."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-65",
"text": "See Section 2 for more details on the notation used to define the supertags and how the supertags can be combined to form a parse tree."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-66",
"text": "(Kasai et al., 2018) we use two components in the word embedding:"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-67",
"text": "\u2022 a 30-dimensional character level embedding vector computed using a char-CNN which captures the morphological information (Santos and Zadrozny, 2014; Chiu and Nichols, 2016; Ma and Hovy, 2016; Kasai et al., 2018) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-68",
"text": "Each character is encoded as a 30-dimensional vector, and then we apply 30 convolutional filters with a window size of 5."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-69",
"text": "This produces a 30-dimensional character embedding."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-70",
"text": "\u2022 a 100/200/300 size word embedding which is initialized using GloVe (Pennington et al., 2014) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-71",
"text": "For words that do not appear in GloVe, we randomly initialized the word embedding."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-72",
"text": "A start of sentence token and an end of sentence token is added into the beginning and ending position of each sentence, but is not included in the computation of loss and accuracy."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-73",
"text": "Unlike (Kasai et al., 2018) we do not use predicted part of speech (POS) tags as part of the input sequence."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-74",
"text": "In our experiments, the improvement was negligible and there was a significant overhead of having to do part of speech predictions at test time."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-76",
"text": "**BILSTM LAYER**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-77",
"text": "The core of this base model is a bidirectional recurrent neural network, in particular a Long Short-Term Memory neural network (Graves and Schmidhuber, 2005) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-78",
"text": "For the hyperparameters, we use the settings in Kasai et al. (2018) in order to ensure a fair comparison."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-79",
"text": "Unlike (Kasai et al., 2018) we do not use highway connections in our model."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-80",
"text": "We did experiment with the addition of highway connections but we found no improvement in accuracy over the baseline BiLSTM-only model with a significant increase in training time."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-81",
"text": "The bidirectional representation has 1024 units, a combination of the 512 forward and backward units each."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-82",
"text": "Dropout layers (Gal and Ghahramani, 2016; Srivastava et al., 2014) are inserted between the input and BiLSTM layer, between BiLSTM layers, and between recurrent time steps."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-83",
"text": "The dropout rate used was 0.5."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-84",
"text": "We used 2-3 BiLSTM layers."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-85",
"text": "Kasai et al. (2018) provide some reasons why > 3 layers do not provide any additional accuracy even with highway connections."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-87",
"text": "**OUTPUT LAYER**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-88",
"text": "We concatenate hidden vectors from both directions of the last layer of BiLSTM and pass it into a multilayer perceptron (MLP)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-89",
"text": "In practice a single layer perceptron performs just as well in this task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-90",
"text": "The number of input neurons of the single layer perceptron equals 1024 (2 \u00d7 512) and the output vector size equals the number of labels for each specific task: 4727 for the main supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-92",
"text": "**DECONSTRUCTING SUPERTAGS**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-93",
"text": "The error analysis of our baseline BiLSTM model is shown in Fig. 1 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-94",
"text": "We observed some consistent ways in which the baseline model confused the correct supertag with the incorrect one."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-95",
"text": "We also observed that the baseline BiLSTM model can achieve over 97% 3-best accuracy on the supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-96",
"text": "This means it should be possible to boost the accuracy by rescoring the alternatives that already exist in the n-best output of the baseline supertagger."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-97",
"text": "Rather than a re-ranking frame- work we used a multi-task learning framework in order to boost the scores of correct supertags over the error-prone supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-98",
"text": "The auxiliary tasks we created based on our error analysis are as follows."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-100",
"text": "**AUXILIARY TASKS**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-102",
"text": "**HEAD**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-103",
"text": "Consider the trees t2 and t36 in Table 1 . t2 is headed by a noun head N and t36 is headed by an adjective A. The label of the head node is a useful auxiliary task for disambiguation."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-104",
"text": "We define a function HEAD(t) to get the head node (marked by a diamond) of supertag t. There are 29 distinct HEAD labels."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-106",
"text": "**ROOT**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-107",
"text": "Consider the trees t4 and t13 in Table 1 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-108",
"text": "t4 modifies an NP node while t13 modifies a VP node."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-109",
"text": "This is a case of preposition attachment ambiguity."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-110",
"text": "The label of the root node is a useful auxiliary task for disambiguation."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-111",
"text": "We define a function ROOT(t) to get the root node of supertag t. There are 48 distinct ROOT labels."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-113",
"text": "**TYPE**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-114",
"text": "Consider the trees tCO and t13 in Table 1 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-115",
"text": "tCO is a supertag that does not use adjunction (this type of supertag is called an initial tree)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-116",
"text": "In contrast, t13 modifies an internal VP node in another supertag (this type of supertag is called an auxiliary tree)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-117",
"text": "In addition a left auxiliary tree modifies from the left while a right auxiliary tree modifies from the right."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-118",
"text": "To make this task more sensitive we also include the node label of the root (for initial trees) or footnode which is the node marked with * (for left/right auxiliary trees)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-119",
"text": "We define a function TYPE(t) to obtain the type of each supertag."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-120",
"text": "There are 67 distinct types."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-122",
"text": "**SKETCH**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-123",
"text": "In many cases, the overall shape of the supertag is useful for disambiguation, ignoring the node labels."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-124",
"text": "The following example keeps the tree structure of the supertag but removes the node labels:"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-125",
"text": "Tree sketches help disambiguation (see t81 in Table 5)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-126",
"text": "We define a function SKETCH(t) that returns the sketch."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-127",
"text": "There are 602 distinct supertag sketches."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-129",
"text": "**SPINE**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-130",
"text": "The spine of a supertag is the path from the root node to the head node (marked by )."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-131",
"text": "The following example keeps only the path from root to head and produces a spine supertag: Table 5 )."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-132",
"text": "We use a function SPINE(t) to return the spine of supertag t. There are 1372 distinct supertag spines."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-134",
"text": "**MULTI-TASK FRAMEWORK**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-135",
"text": "Unlike most other work in multi-task learning with neural models we do not use different datasets for each task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-136",
"text": "We use exactly the same training data set but we construct multiple tasks with alternate output labels by automatically deconstructing the supertags (the output labels in the original task)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-137",
"text": "These alternate output labels are easier to predict than the full set of supertags, and these new output labels are related to the original supertag in a linguistically relevant way."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-138",
"text": "As a result, we train on the same training set but with alternate output labels, each forming a different task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-139",
"text": "We then combine these multiple tasks in order to improve the performance in the original supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-140",
"text": "The usual criticism of a fair comparison between multi-task and single-task learning is that the multi-task setting simply uses more labeled data instances (typically with different data sources) and as a result a fair comparison between a multi-task and a single-task setting should involve large pre-trained models trained using a language modelling objective (such as ELMO (Peters et al., 2018) or BERT (Devlin et al., 2018) )."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-141",
"text": "In our case, because we re-use the same training set for multi-task learning, we have made sure our experimental settings exactly match the previous best state-of-the-art method for supertagging (Kasai et al., 2018) and we use the same pre-trained word embeddings to ensure a fair comparison."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-142",
"text": "We train six different neural sequence prediction models independently on the supertagging task, root node prediction (ROOT), head node prediction (HEAD), tree type prediction (TYPE), tree sketch prediction (SKETCH) and tree spine prediction (SPINE) tasks."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-143",
"text": "For each task, we use the state-of-the-art baseline supertagging model as defined in Section 3."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-144",
"text": "The only change is that the output size for softmax is changed to reflect the number of output labels in each task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-145",
"text": "We obtain very high accuracies for each of the tasks."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-146",
"text": "For example, on the dev set we obtain the following accuracies: ROOT = 97.04%, HEAD = 93.37%, TYPE = 93.14%, SKETCH = 93.74% and SPINE = 91.00%."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-147",
"text": "We train the model, including the word embedding (which is initialized using a pre-trained embedding) and character-level CNNs by optimizing the negative log-likelihood of the predicted sequences of output labels."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-148",
"text": "The output labels for each task is different: supertag, root node, head node, tree type, sketch, spine."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-149",
"text": "Training is done using minibatches."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-150",
"text": "The main hyperparameters are as follows: we use the ADAM optimizer with a batch size of 100 and learning rate = 0.001 (Kingma and Ba, 2015)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-151",
"text": "After every training epoch, we evaluate the model on the dev set, if the accuracy on dev set has not been improved for five consecutive epochs, training stops."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-152",
"text": "The maximum number of epochs is 70."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-153",
"text": "After obtaining the best model trained with = 0.001, we further fine-tune the best model using = 0.0001 for at most 10 more epochs."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-154",
"text": "By conducting this step, we have seen 0.1% to 0.2% accuracy improvement depending on the task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-155",
"text": "After obtaining the best trained model on each of the multiple tasks we combine the multiple tasks together in order to create a decoder for the supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-181",
"text": "With our multi-task approach, all base models gain significant improvements compared to a single supertagging base model between 0.4% to 0.65%."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-156",
"text": "We first run the baseline supertagger to obtain the distribution P STAG and using this distribution we select the top-K output supertags for each word in each sentence in the dev or test data."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-157",
"text": "We experiment with different values of K but we know that even K=3 gives 97% accuracy for the supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-158",
"text": "For each dev or test sentence we also compute the output softmax distributions for each task, P HEAD , P ROOT , P TYPE , P SKETCH , P SPINE ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-159",
"text": "Each of these probabilities are defined as a sequence prediction task over the auxiliary tasks using the functions defined in Section 4.1."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-160",
"text": "P HEAD (t) = P (HEAD(t)) P ROOT (t) = P (ROOT(t)) P TYPE (t) = P (TYPE(t)) P SKETCH (t) = P (SKETCH(t)) P SPINE (t) = P (SPINE(t))"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-161",
"text": "We compute the argmax sequence of supertags t * 1 , t * 2 , . ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-162",
"text": ". , t * T by scoring each supertag t * i individ-ually from the top-K list by combining the probabilities from the different tasks as follows:"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-163",
"text": "S is the top-K set of supertags for each word in the input sequence."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-164",
"text": "The hyperparameters \u03b1 i can be tuned."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-165",
"text": "However we found in our experiments that the results were not very sensitive to the values, and the uniform distribution over all the tasks performed the best."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-166",
"text": "The model and decoding step for our multi-task model is shown in Fig. 2 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-167",
"text": "We also experiment with a commonly used multi-task model where some or all of the components are shared between the different (unlike our approach).."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-169",
"text": "**DATASET**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-170",
"text": "We use the dataset that has been widely used by previous work in supertagging and TAG parsing (Bangalore et al., 2009; Chung et al., 2016; Friedman et al., 2017; Kasai et al., , 2018 ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-171",
"text": "We use the grammar and the TAG-annotated WSJ Penn Tree Bank extracted by Chen et al. (2006) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-172",
"text": "As in previous work, we use Sections 01-22 as the training set, Section 00 as the dev set, and Section 23 as the test set."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-173",
"text": "The training, dev, and test sets comprise 39832, 1921, and 2415 sentences; 950028, 46451, 56683 tokens, respectively."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-174",
"text": "The TAG-annotated version of Penn treebank (Chen and Shankar, 2001) includes 4727 distinct supertags (including an unknown supertag) and the grammar file of all supertags is downloaded from http://mica.lif.univ-mrs.fr/. There are 69 auxiliary tree TYPEs, 40 distinct types of ROOT node and 30 different types of HEAD node, 602 tree SKETCHes and 1372 tree SPINEs."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-176",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-177",
"text": "For our experiments, we implemented all of the models we discussed above in PyTorch (Paszke et al., 2017) ."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-178",
"text": "We have various hyperparameters and Table 2 shows the results obtained from the different model configurations which were described in Section 3."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-179",
"text": "The table also includes the results from the multi-task model and decoder described in Section 4."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-182",
"text": "We also varied the parameter K which picks the top-K supertags from the baseline model for use with the multi-task model."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-183",
"text": "Table 3 that increasing K helps up to a point."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-184",
"text": "After K=10 there is no further improvement."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-185",
"text": "We obtain a new state-of-the-art result of 91.39% which is significantly better than the 90.81% result which combines supertagging with the parsing task and so is using more labeled training information used by our supertagger models."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-186",
"text": "Table 4 shows the result of task ablation for each task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-187",
"text": "We can see that adding a new task always improves the results."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-188",
"text": "The best result is obtained by using all five auxiliary tasks."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-189",
"text": "We computed a significance score on the accuracy of our best model BiL-STM3+CNN+GloVe200 with and without multi-task learning."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-190",
"text": "On the dev set, using McNemar's significance test we found that the multi-task model is significantly better than the baseline model with a p-value of 0.0062; on the test set, the p-value is 0.0064."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-191",
"text": "We evaluated our own implementation of the baseline BiLSTM-only model and even with highway connections we only obtained 89.25% on the dev set compared to the built-in BiLSTM implementation in Pytorch (without highway connections) which obtains 89.94%."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-192",
"text": "Table 5 shows some examples about how each of auxiliary tasks can help in the correction of supertag prediction."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-193",
"text": "Examples of each task are selected if a considerable number of predictions of each example are corrected after applying the multi-task model."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-194",
"text": "While the multi-task model can correct many wrong predictions made by the baseline model, the multi-task model may also override some correct predictions."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-196",
"text": "**TASK CONTRIBUTION**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-197",
"text": "The first row is an example of the prediction of head node that helps differentiate two similar supertags, t2 and t36."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-198",
"text": "In the dev set, there are 24 words of which ground truth supertags are t2, wrongly predicted as t36 by a single base model; 25 words of which ground truth supertags are t36, wrongly predicted as t2."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-199",
"text": "All of those words are Kasai et al. (2018) refers to highway connections, and POS refers to the use of predicted part-of-speech tags as inputs."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-200",
"text": "We do not use HW or POS in our models as they do not provide any benefit."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-201",
"text": "correctly predicted by the multi-task model."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-202",
"text": "The ROOT, TYPE, SKETCH and SPINE are all the same for t2 and t36, the only difference is the HEAD value, N for t2 and A for t36."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-203",
"text": "The model for the HEAD task correctly predicts the head node of those words which is further improved using our multi-task approach."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-204",
"text": "The second row demonstrates how the tree sketch can help discriminate supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-205",
"text": "t81 and t27 have exactly the same ROOT, HEAD, SPINE (S-VP-V) and TYPE (Init), the only difference between these two supertags is the tree structure."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-206",
"text": "The third to fifth rows are examples of the effect of multiple auxiliary tasks in getting the prediction right."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-207",
"text": "The third row is an example of the prediction of TYPE and SKETCH that can help differentiate supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-208",
"text": "The TYPE of t3 is Init, while t38 has TYPE Left+NP."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-209",
"text": "They also have dif-ferent tree sketches."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-210",
"text": "There are 11 words of which supertags are wrongly predicted as t3 by a single supertagging model, but correctly predicted as t38 by the multi-task model; also, 3 words of which supertags are wrongly predicted as t38 by a single supertagging model, but correctly predicted as t3 by the multi-task model."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-211",
"text": "The forth row is an example of how the prediction of the ROOT can help differentiate supertags."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-212",
"text": "The ROOT of t3 is NP, while t18 has ROOT N (N is also its head node)."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-213",
"text": "For the last row, t132 and t20 have the same root node(S), head node(Punct) and tree type (Right+S) but they are different in the tree spine (S-Punct for t20 and S-PRN-Punct for t132) and SKETCH."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-214",
"text": "The joint effort of various models plays a significant role in getting the prediction right."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-215",
"text": "Bangalore et al. (2009) and Chung et al. (2016) trained a feature based classification model for TAG supertags, that extract features using lexical, part-of-speech attributes from the left and right context in a 6-word window and the lexical, orthographic (e.g. capitalization, prefix, suffix, digit) and part-of-speech attributes of the word being supertagged."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-216",
"text": "Neural network based supertagging models in TAG (Kasai et al., 2018) and CCG (Xu Lewis et al., 2016; Xu, 2016; Vaswani et al., 2016) have shown substantial improvement in performance, but the supertagging models are all quite similar as they all use a bi-directional RNN feeding into a prediction layer."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-217",
"text": "Structural features of supertags are heavily used in pre-neural statistical parsing methods (Bangalore et al., 2009 ) and proved to be useful."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-218",
"text": "The use of supertag structure was explored in (Friedman et al., 2017) where they adopt grammar features into a tree-structured neural model over the supertags but this model was unable to beat the state-of-the-art."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-219",
"text": "(Kasai et al., 2018) combines supertagging with parsing which does provide state-of-the-art accuracy but at the expense of computational complexity."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-220",
"text": "extends the BiLSTM model with predicted part-of-speech tags and suffix embeddings as inputs, then Kasai et al. (2018) further extends the BiLSTM model with highway connection as well as character CNN as input, and jointly train the supertagging model with parsing model and this work had the state-of-the-art accuracy before our paper on the Penn treebank dataset."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-221",
"text": "Friedman et al. (2017) investigated a recursive treebased vector representation of TAG supertags, but while their model can learn useful facts about supertags, about how one can be related to another, there was no performance improvement as a result of their model on the supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-222",
"text": "Xu et al. (2015) uses RNN for the CCG supertagging task, Lewis et al. (2016) adopted the LSTM structure into this task, while Vaswani et al. (2016) also introduced another variation of Bi-LSTM into this task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-223",
"text": "Xu (2016) then proposed an attention-based Bi-LSTM supertagging model."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-224",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-225",
"text": "**RELATED WORK**"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-226",
"text": "----------------------------------"
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-228",
"text": "In this paper we have introduced a novel multitask framework for the TAG supertagging task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-229",
"text": "The approach involved a novel multi-task learning framework which led to a new state-of-the-art accuracy score of 91.39% for TAG supertagging on the Penn treebank dataset."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-230",
"text": "Our multi-task prediction framework is trained over the exactly same training data used to train the original supertagger where each auxiliary task provides an alternative view on the original pre-diction task."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-231",
"text": "In the future we would like to explore further tasks to integrate into our multi-task sequence prediction framework."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-232",
"text": "We also believe that the idea of our multi-task framework can be applied into similar tasks such as CCG supertagging task of which the labels themselves contains the latent information."
},
{
"sent_id": "0c3f9588b6f587d04c286384ca24e0-C001-233",
"text": "We would also like to investigate how to semi-automatically generate new tasks which can be of further help in the multi-task setting."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"0c3f9588b6f587d04c286384ca24e0-C001-12"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-47",
"0c3f9588b6f587d04c286384ca24e0-C001-48"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-66",
"0c3f9588b6f587d04c286384ca24e0-C001-67"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-78"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-141"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-170"
]
],
"cite_sentences": [
"0c3f9588b6f587d04c286384ca24e0-C001-12",
"0c3f9588b6f587d04c286384ca24e0-C001-47",
"0c3f9588b6f587d04c286384ca24e0-C001-48",
"0c3f9588b6f587d04c286384ca24e0-C001-66",
"0c3f9588b6f587d04c286384ca24e0-C001-67",
"0c3f9588b6f587d04c286384ca24e0-C001-78",
"0c3f9588b6f587d04c286384ca24e0-C001-141",
"0c3f9588b6f587d04c286384ca24e0-C001-170"
]
},
"@DIF@": {
"gold_contexts": [
[
"0c3f9588b6f587d04c286384ca24e0-C001-24"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-73"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-79"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-199",
"0c3f9588b6f587d04c286384ca24e0-C001-200"
]
],
"cite_sentences": [
"0c3f9588b6f587d04c286384ca24e0-C001-24",
"0c3f9588b6f587d04c286384ca24e0-C001-73",
"0c3f9588b6f587d04c286384ca24e0-C001-79",
"0c3f9588b6f587d04c286384ca24e0-C001-199"
]
},
"@BACK@": {
"gold_contexts": [
[
"0c3f9588b6f587d04c286384ca24e0-C001-38"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-216"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-219"
],
[
"0c3f9588b6f587d04c286384ca24e0-C001-220"
]
],
"cite_sentences": [
"0c3f9588b6f587d04c286384ca24e0-C001-38",
"0c3f9588b6f587d04c286384ca24e0-C001-216",
"0c3f9588b6f587d04c286384ca24e0-C001-219",
"0c3f9588b6f587d04c286384ca24e0-C001-220"
]
},
"@EXT@": {
"gold_contexts": [
[
"0c3f9588b6f587d04c286384ca24e0-C001-50"
]
],
"cite_sentences": [
"0c3f9588b6f587d04c286384ca24e0-C001-50"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"0c3f9588b6f587d04c286384ca24e0-C001-199"
]
],
"cite_sentences": [
"0c3f9588b6f587d04c286384ca24e0-C001-199"
]
}
}
},
"ABC_33096f1e855d23046cb4cbfe95eef0_3": {
"x": [
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-2",
"text": "Abstract."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-3",
"text": "Image captioning, an open research issue, has been evolved with the progress of deep neural networks."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-4",
"text": "Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are employed to compute image features and generate natural language descriptions in the research."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-5",
"text": "In previous works, a caption involving semantic description can be generated by applying additional information into the RNNs."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-6",
"text": "In this approach, we propose a distinctive-attribute extraction (DaE) which explicitly encourages significant meanings to generate an accurate caption describing the overall meaning of the image with their unique situation."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-7",
"text": "Specifically, the captions of training images are analyzed by term frequency-inverse document frequency (TF-IDF), and the analyzed semantic information is trained to extract distinctive-attributes for inferring captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-8",
"text": "The proposed scheme is evaluated on a challenge data, and it improves an objective performance while describing images in more detail."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-11",
"text": "Automatically to describe or explain the overall situation of an image, an image captioning scheme is a very powerful and effective tool [1, 2, 3] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-12",
"text": "The issue is an open research area in computer vision and machine learning [1, 2, 3, 4, 5, 6] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-13",
"text": "In recent years, recurrent neural networks (RNNs) implemented by long short-term memory (LSTM) especially show good performances in sequence data processing and they are widely used as decoders to generate a natural language description from an image in many methods [3, 4, 5, 6, 7] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-14",
"text": "High-performance approaches on convolutional neural networks (CNNs) have been proposed [8, 9] , which are employed to represent the input image with a feature vector for the caption generation [3, 4, 5] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-15",
"text": "Additionally, an attention representation that reflects the human visual system has been applied to obtain salient features from an entire image [3] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-16",
"text": "The approach adopted in previous work provides different weights in an image effectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-17",
"text": "High-level semantic concepts of the image are effective to describe a unique situation and a relation between objects in an image [4, 10] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-18",
"text": "Extracting specific arXiv:1807.09434v1 [cs.CV] 25 Jul 2018 semantic concepts encoded in an image, and applying them into RNN network has improved the performance significantly [4] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-19",
"text": "Detecting semantic attributes are a critical part because the high-level semantic information has a considerable effect on the performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-20",
"text": "A recent work applied contrastive learning scheme into image captioning to generate distinctive descriptions of images [5] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-21",
"text": "In this paper, we propose a Distinctive-attribute Extraction (DaE) which explicitly encourages semantically unique information to generate a caption that describes a significant meaning of an image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-22",
"text": "Specifically, it employs term frequency-inverse document frequency (TF-IDF) scheme [11] to evaluate a semantic weight of each word in training captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-23",
"text": "The distinctive-attributes of images are predicted by a model trained with the semantic information, and then they are applied into RNNs to generate descriptions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-24",
"text": "The main contributions of this paper are as follows: (i) We propose the semantics extraction method by using the TF-IDF caption analysis."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-25",
"text": "(ii) We propose a scheme to compute distinctive-attribute by the model trained with semantic information."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-26",
"text": "(iii) We perform quantitative and qualitative evaluations, demonstrating that the proposed method improves the performance of a base caption generation model by a substantial margin while describing images more distinctively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-27",
"text": "This manuscript is organized as follows: In Section 2, the related schemes are explained."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-28",
"text": "The proposed scheme and its implementation are described in Section 3, and the experimental results are compared and analyzed in Section 4."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-29",
"text": "Finally, in Section 5, the algorithm is summarized, and a conclusion and discussions are presented."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-31",
"text": "**RELATED WORK**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-32",
"text": "Combinations of CNNs and RNNs have been widely used for the image captioning networks [1, 2, 3, 4, 12, 13] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-33",
"text": "An end-to-end neural network consisting of a vision CNN followed by a language generating RNN was proposed [1] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-34",
"text": "CNN was used as an image encoder, and an output of its last hidden layer is fed into the RNN decoder that generates sentences."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-35",
"text": "Donahue et al. [2] proposed Long-term Recurrent Convolutional Networks(LRCN), which can be employed to visual time-series modeling such as generation of description."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-36",
"text": "LRCN also used outputs of a CNN as LSTM inputs, which finally produced a description."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-37",
"text": "Recent approaches can be grouped into two paradigms."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-38",
"text": "Top-down includes attention-based mechanisms, and many of the bottom-up methods used semantic concepts."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-39",
"text": "As approaches using the attention, Xu et al. [3] introduced an attention-based captioning model, which can attend to salient parts of an image while generating captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-40",
"text": "Liu et al. [6] tried to correct attention maps by human judged region maps."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-41",
"text": "Different levels of correction were made dependent on an alignment between attention map and the ground truth region."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-42",
"text": "Some other works extracted semantic information and applied them as additional inputs to the image captioning networks."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-43",
"text": "Fang et al. [12] used Multiple Instance Learning (MIL) to train word detectors with words that commonly occur in captions, includ-ing nouns, verbs, and adjectives."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-44",
"text": "The word detector outputs guided a language model to generate description to include the detected words."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-45",
"text": "Wu et al. [13] also clarified the effect of the high-level semantic information in visual to language problems such as the image captioning and the visual question answering."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-46",
"text": "They predicted attributes by treating the problem as a multi-label classification."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-47",
"text": "The CNN framework was used, and outputs from different proposal sub-regions are aggregated."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-48",
"text": "Gan et al. [4] proposed Semantic Concept Network (SCN) integrating semantic concept to a LSTM network."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-49",
"text": "SCN factorized each weight matrix of the attribute integrated the LSTM model to reduce the number of parameters."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-50",
"text": "We employed SCN-LSTM as a language generator to verify the effectiveness of our method."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-51",
"text": "More recently, Dai et al. [5] studied the distinctive aspects of the image description that had been overlooked in previous studies."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-52",
"text": "They said that distinctiveness is closely related to the quality of captions, The proposed method Contrastive Learning(CL) explicitly encouraged the distinctiveness of captions, while maintaining the overall quality of the generated captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-53",
"text": "In addition to true image-caption pairs, this method used mismatched pairs which include captions describing other images for learning."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-54",
"text": "Term frequency-inverse document frequency(TF-IDF) is widely used in text mining, natural language processing, and information retrieval."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-55",
"text": "TF indicates how often a word appears in the document."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-56",
"text": "This measure employs a simple assumption that frequent terms are significant [11, 14] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-57",
"text": "A concept of IDF was first introduced as \"term specificity\" by Jones [15] in 1972."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-58",
"text": "The intuition was a word which occurs in many documents is not a good discriminator and should be given small weight [15, 16] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-59",
"text": "Weighting schemes are often composed of both TF and IDF terms."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-61",
"text": "**DISTINCTIVE-ATTRIBUTE EXTRACTION**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-62",
"text": "In this paper, we describe the semantic information processing and extraction method, which affects the quality of generated captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-63",
"text": "Inspired by the concept of Contrastive Learning (CL) [5] , we propose a method to generate captions that can represent the unique situation of the image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-64",
"text": "However, different from CL that improved target method by increasing the training set, our method lies in the bottom-up approaches using semantic attributes."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-65",
"text": "We assign more weights to the attributes that are more informative and distinctive to describe the image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-67",
"text": "**OVERALL FRAMEWORK**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-68",
"text": "In this section, we explain overall process of our Distinctive-attribute Extraction(DaE) method."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-69",
"text": "As illustrated in Figure 1 , there are two main steps, one is semantic information extraction, and the other is the distinctive-attribute prediction."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-70",
"text": "We use TF-IDF scheme to extract meaningful information from reference captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-71",
"text": "In Section 3.2, the method is discussed in detail and it contains a scheme to construct a vocabulary from the semantic information."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-72",
"text": "After extracting the semantic information from training sets, we learn distinctive-attribute prediction model with image-information pairs."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-73",
"text": "The model will be described in Section 3.3."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-74",
"text": "After getting distinctive-attribute from images, we apply these attributes to an caption generation network to verify their effect."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-75",
"text": "We used SCN-LSTM [4] as a decoder which is a tag integrated network."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-76",
"text": "Image features and distinctive-attributes predicted by the proposed model are served as inputs of the model."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-77",
"text": "The SCN-LSTM unit with attribute integration and factorization [17] is represented as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-78",
"text": "where z = 1 (t = 1) \u00b7 C v . denotes the element-wise multiply operator."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-79",
"text": "For = i, f, o, c,x"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-80",
"text": "where D p indicates distinctive-attribute predicted by the proposed model described in Section 3.3."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-81",
"text": "Similar to [4, 13, 18] , the objective function is composed of the conditional log-likelihood on the image feature and the attribute as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-82",
"text": "where I n , f (\u00b7), and X indicates the nth image, an image feature extraction function, and the caption, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-83",
"text": "N denotes the number of training images."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-84",
"text": "The length\u2212T caption, X, is represented by a sequence of words; x 0 , x 1 , x 2 , . . . , x T ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-85",
"text": "Modeling joint probability over the words with chain rule, log term is redefined as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-87",
"text": "**SEMANTIC INFORMATION EXTRACTION BY TF-IDF**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-88",
"text": "Most of the previous methods constituted semantic information, that was a ground truth attribute, as a binary form [4, 12, 13, 19] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-89",
"text": "They first determined vocabulary using K most common words in the training captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-90",
"text": "The vocabulary included nouns, verbs, and adjectives."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-91",
"text": "If the word in the vocabulary existed in reference captions, the corresponding element of an attribute vector became 1."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-92",
"text": "Attribute predictors found probabilities that the words in the vocabulary are related to given image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-93",
"text": "Different from previous methods, we weight semantic information according to their significance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-94",
"text": "There are a few words that can be used to describe the peculiar situation of an image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-95",
"text": "They allow one image to be distinguished from others."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-96",
"text": "These informative and distinctive words are weighted more, and the weight scores are estimated from reference captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-97",
"text": "We used the TF-IDF scheme which was widely used in text mining tasks for extracting the semantic importance of the word."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-98",
"text": "Captions are gathered for each image, for example, five sentences are given in MS COCO image captioning datasets [20, 21] , and they are treated as one document."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-99",
"text": "The total number of documents must be the same as the number of images on a dataset."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-100",
"text": "Figure 2 represents samples of COCO image captioning, pairs of an image and captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-101",
"text": "In 2(a), there is a common word \"surfboard\" in 3 out of 5 captions, which is a key-word that characterizes the image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-102",
"text": "Intuitively, this kind of words [20, 21] should get high scores."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-103",
"text": "We apply TF to implement this concept and use average TF metric T F av which is expressed as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-104",
"text": "where T F (w, d) denotes the number of times a word w occurs in a document d. We divide T F (w, d) by N c which is the number of captions for an image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-105",
"text": "There is another common word \"man\" in captions in Figure 2 (a)."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-106",
"text": "TF score of the word \"man\" must be same as that of the word \"surfboard\" because it appears 3 times."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-107",
"text": "However, \"man\" appears a lot in other images."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-108",
"text": "Therefore, that is a less meaningful word for distinguishing one image from another."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-109",
"text": "To reflect this, we apply inverse document frequency (IDF) term weighting."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-110",
"text": "IDF metric for the word w can be written as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-111",
"text": "where N d is the total number of documents, and DF (w) is the number of documents that contain the word w. \"1\" is added in denominator and numerator to prevent zero-divisions [22] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-112",
"text": "Then TF-IDF is derived by multiplying two metrics as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-113",
"text": "We apply L2 normalization to TF-IDF vectors of each image for training performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-114",
"text": "Consequently, the values are normalized into the range of 0 and 1."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-115",
"text": "The semantic information vector which is the ground truth distinctive-attribute vector can be represented as"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-116",
"text": "where D g,iw indicates ground truth D for image index i and for word w in vocabulary."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-117",
"text": "d denotes a document which is a set of reference captions for an image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-118",
"text": "The next step is to construct vocabulary with the words in captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-119",
"text": "It is essential to select the words that make up the vocabulary which ultimately affects captioning performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-120",
"text": "The vocabulary should contain enough particular words to represent each image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-121",
"text": "At the same time, the semantic information should be trained well for prediction accuracy."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-122",
"text": "In the perspective of vocabulary size, Gan [4] and Fang [12] selected 1000 words and Wu [13] selected 256 words, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-123",
"text": "They all selected vocabulary among nouns, verbs, and adjectives."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-124",
"text": "We determine the words to be included in the vocabulary based on the IDF scores."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-125",
"text": "We do not distinguish between verbs, nouns, adjectives, and other parts of speech."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-126",
"text": "The larger the IDF value of a word is, the smaller the number of documents, i.e., the number of image data, which include the word."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-127",
"text": "In this case, the word is said to be unique, but a model with this kind of inputs is challenging to be trained."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-128",
"text": "We observe the performance of the semantic attribute prediction model and overall captioning model while changing the IDF value threshold."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-129",
"text": "In addition, we compare the results with applying stemming before extracting TF-IDF."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-130",
"text": "We assume that words with the same stem mostly mean same or relatively close concepts in a text."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-131",
"text": "For example, \"looking\" and \"looks\" are mapped to the same word \"look\" after stemming."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-132",
"text": "Wu [13] did a similar concept, manually changing their vocabulary to be not plurality sensitive."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-133",
"text": "We used Porter Stemmer algorithm [23] which is implemented in Natural Language Toolkit (NLTK) [24] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-134",
"text": "For each image, distinctive-attribute vectors are inferred by a prediction model."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-135",
"text": "Figure 3 summarizes the distinctive-attribute prediction network."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-136",
"text": "We use ResNet-152 [9] architecture for CNN layers which have been widely used in vision tasks."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-137",
"text": "The output of the 2048-way pool5 layer from ResNet-152 [9] is fed into a stack of fully connected layers."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-138",
"text": "This ResNet output is also reused in SCN-LSTM network as described in Section 3.1."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-139",
"text": "Training data for each image consist of input image I and ground truth distinctive-attribute"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-141",
"text": "**DISTINCTIVE-ATTRIBUTE PREDICTION MODEL**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-142",
"text": "where N w is the number of the words in vocabulary and i is the index of the image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-143",
"text": "Our goal is to predict attribute scores as similar as possible to D g ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-144",
"text": "The cost function to be minimized is defined as mean squared error:"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-145",
"text": "where"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-146",
"text": "is predictive attribute score vector for ith image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-147",
"text": "M denotes the number of training images."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-148",
"text": "Convolutional layers are followed by four fully-connected (FC) layers: the first three have 2048 channels each, the fourth contains N w channels."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-149",
"text": "We use ReLU [25] as nonlinear activation function for all FC."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-150",
"text": "We adopt batch normalization (BN) [26] right after each FC and before activation."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-151",
"text": "The training is regularized by dropout with ratio 0.3 for the first three FCs."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-152",
"text": "Each FC is initialized with a Xavier initialization [27] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-153",
"text": "We note that our network does not contain softmax as a final layer, different from other attribute predictors described in previous papers [4, 13] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-154",
"text": "Hence, we use the output of an activation function of the fourth FC layer as the final predictive score D p,i ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-156",
"text": "**EXPERIMENT**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-157",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-158",
"text": "**DATASETS**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-159",
"text": "Our results are evaluated on the popular MS COCO dataset [20, 21] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-160",
"text": "The dataset contains 82,783 images for training and 40,504 for validation."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-161",
"text": "Due to annotations for test set is not available, we report results with the widely used split [10] which contain 5,000 images for validation and test, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-162",
"text": "We applied the same splits to both semantic attribute prediction network and SCN-LSTM network."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-163",
"text": "We infer the results of the actual COCO test set consisting of 40,775 images and also evaluate them on the COCO evaluation server [21] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-165",
"text": "**TRAINING**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-166",
"text": "The model described in Section 3.3 is used for distinctive-attribute prediction and the training procedures of it are implemented in Keras [28] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-167",
"text": "To implement TF-IDF schemes for meaningful information extraction, we used scikit-learn toolkit [22] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-168",
"text": "The mini-batch size is fixed at 128 and Adam's optimization [29] with learning rate 3 \u00d7 10 \u22123 is used and stopped after 100 epochs."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-169",
"text": "For the prediction model, we train 5 identical models with different initializations, and then ensemble by averaging their outcomes."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-170",
"text": "Attributes of training and validation sets are inferred from the prediction model and applied to the SCN-LSTM model training."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-171",
"text": "In order to analyze the effect of semantic information extraction method on overall performance, various experiments were conducted."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-172",
"text": "A vocabulary selection in the semantic information affects training performance, which ultimately affects caption generation performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-173",
"text": "We use various combinations of vocabularies for the experiment and report both quantitative and qualitative evaluations."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-174",
"text": "First, we apply IDF thresholding to eliminate the words from vocabulary which have small values than the threshold th IDF ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-175",
"text": "We use seven different th IDF s for the experiment."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-176",
"text": "Secondly, we apply stemming for words before extracting TF-IDF and IDF thresholding."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-177",
"text": "After semantic information vectors are extracted, they are fed into the prediction model in pairs with images."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-178",
"text": "The training results with the different vectors will be reported in Sec 4.4."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-179",
"text": "SCN-LSTM training procedure generally follows [4] except for the dimension of the input attribute vector."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-180",
"text": "We use the public implementation [30] of this method opened by Gan who is the author of the published paper [4] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-181",
"text": "For an image feature, we take out the output of the 2048-way pool5 layer from ResNet-152 which is pre-trained on the ImageNet dataset [31] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-182",
"text": "Word embedding vectors are initialized with the word2vec vectors proposed by [32] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-183",
"text": "The number of hidden units and the number of factors are both set to 512."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-184",
"text": "We set batch size as 64 and use gradient clipping [33] and dropout [34] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-185",
"text": "Early stopping was applied for validation sets with the maximum number of epochs 20."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-186",
"text": "Adam optimizer [29] was used with learning rate 2 \u00d7 10 \u22124 ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-187",
"text": "In testing, we use beam search for caption generation and select the top 5 best words at each LSTM step as the candidates."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-188",
"text": "We average inferred probability for 5 identical SCN-LSTM model as [4] did."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-190",
"text": "**EVALUATION PROCEDURES**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-191",
"text": "We use the macro-average F1 metric to compare the performance of the proposed distinctive-attribute prediction model."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-192",
"text": "The output attribute of previous methods [4, 12, 13, 19] represent probabilities, on the other hand, that of the proposed method are the distinctiveness score itself."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-193",
"text": "We evaluate the prediction considering it as a multi-label and multi-class classification problem. ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-194",
"text": "In case the value 0.0 occupies most of the elements, it disturbs accurately comparing the performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-195",
"text": "Therefore, we exclude those elements intentionally in the comparison."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-196",
"text": "Each word in attribute vocabulary is regarded as one class, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-197",
"text": "The macro-averaged F1 score is computed globally by counting the total number of true positives, false negatives, true negatives, and false positives."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-198",
"text": "The widely used metrics, BLEU-1,2,3,4 [35] , METEOR [36] , ROUGL-L [37] , CIDEr [38] are selected to evaluate overall captioning performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-199",
"text": "The code released by the COCO evaluation server [21] is used for computation."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-200",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-201",
"text": "**RESULTS**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-202",
"text": "Firstly, we compared our method with SCN [30] that uses the extracted attribute according to their semantic concept detection method."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-203",
"text": "We evaluate both results on the online COCO testing server and list them in Table 1 ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-204",
"text": "The pre-trained weights of SCN are provided by the author."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-205",
"text": "We downloaded and used them for an inference according to the author's guide."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-206",
"text": "For the proposed method, we used vocabulary after stemming and set threshold IDF value as 7 in this evaluation."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-207",
"text": "The vocabulary size of the proposed scheme is 938, which is smaller than that of SCN [30] with 999."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-208",
"text": "Accordingly, weight matrices dimensions of the proposed method are smaller than that of SCN in SCN-LSTM structures."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-209",
"text": "Results of both methods are derived from ensembling 5 models, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-210",
"text": "DaE improves the performance of SCN-LSTM by significant margins across all metrics."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-211",
"text": "Specifically, Table 2 ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-212",
"text": "In 40-refs, our method surpasses the performance of AddaptiveAttention + CL which is the state-of-the-art in terms of four BLEU scores."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-213",
"text": "For the qualitative evaluation, tags extracted by the semantic concept detection of the SCN and description generated using them are illustrated as shown in Table 6 ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-214",
"text": "Moreover, distinctive-attributes extracted by DaE and a caption are shown in the lower row."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-215",
"text": "The attributes extracted using DaE include important words to represent the situation in an image; as a result, the caption generated by using them are represented more in detail compared with those of SCN."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-216",
"text": "Scores in the right parentheses of the tags and distinctive-attributes have different meanings, the former is probabilities, and the latter is distinctiveness values of words by the proposed scheme."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-217",
"text": "We listed the top eight attributes in descending order."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-218",
"text": "In the case of DaE, words after stemming with Porter Stemmer [23] are displayed as they are."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-219",
"text": "The result of OURS in (a), \"A woman cutting a piece of fruit with a knife\", explains exactly what the main character does."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-220",
"text": "In the SCN, the general word 'food' get a high probability, on the other hand, DaE extracts more distinctive words such as 'fruit' and \"apple\"."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-221",
"text": "For verbs, \"cut\", which is the most specific action that viewers would be interested in, gets high distinctiveness score."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-222",
"text": "In the case of (b), \"wine\" and \"drink\" are chosen as the words with the first and the third highest distinctiveness through DaE. Therefore, the characteristic phrase \"drinking wine\" is added."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-223",
"text": "To analyze DaE in more detail, we conduct experiments with differently constructed vocabularies, as explained in Section 4.2."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-224",
"text": "We used splits on COCO training and validation sets as done in the work of [10] ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-225",
"text": "Table 4 (a) presents the results of experiments with vocabularies after stemming."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-226",
"text": "We set seven different IDF threshold values, th IDF , from 5 to 11."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-227",
"text": "The vocabulary contains only the words whose IDF is bigger than th IDF ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-228",
"text": "Setting the IDF threshold value to 5 means that only the words appearing in over 1/10 4 of the entire images are treated, according to 12."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-229",
"text": "The number of vocabulary words is shown in the second row of Table 4 (a)."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-230",
"text": "For example, the number of words in V ocab 5 is 276 out of total 5,663 words after stemming in reference captions."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-231",
"text": "Semantic information of the images are extracted corresponding to this vocabulary, and we use them to learn the proposed prediction model."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-232",
"text": "The performance, macro-averaged F1, of the prediction model evaluated by test splits is shown in the third row."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-233",
"text": "The lower the th IDF , that is, the vocabulary is composed of the more frequent words, provides the better prediction performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-234",
"text": "Each extracted distinctive-attribute is fed into SCN-LSTM to generate a caption, and the evaluation result, CIDEr, is shown in the fourth row."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-235",
"text": "The CIDErs increase from V ocab 5 to V ocab 7 , and then monotonically decrease in the rest."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-236",
"text": "In other words, the maximum performance is derived from V ocab 7 to 0.996."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-237",
"text": "The vocabulary size and the prediction performance are in a trade-off in this experiment."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-238",
"text": "With the high th IDF value, captions can be generated with various vocabularies, but the captioning performance is not maximized because the performance of distinctive-attribute prediction is relatively low."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-239",
"text": "Table 5 ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-240",
"text": "Several cases that more diverse and accurate captions are generated using V ocab9 than using V ocab6, although their CIDErs are similar V ocab 6 and V ocab 9 have almost the same CIDEr."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-241",
"text": "At this time, If the vocabulary contains more words, it is possible to represent the captions more diversely and accurately for some images."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-242",
"text": "Table 5 shows examples corresponding to this case."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-243",
"text": "For the case of (a), the V ocab 6 does not include the word \"carriag\", but the V ocab 9 contains the words and is extracted as the word having the seventh highest value through DaE. This led the phrase \"pulling a carriage\" to be included the caption, well describing the situation."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-244",
"text": "\"Tamac\" in (b), and \"microwav\" in (c) plays a similar role."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-245",
"text": "Table 4 (b) presents experimental results without stemming."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-246",
"text": "The captioning performance is highest at V ocab 7 ."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-247",
"text": "The value was 0.911, which is lower than the maximum value of the experiments with stemming."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-248",
"text": "When stemming is applied, the distinctiveness and significance of a word can be better expressed because it is mapped to the same word even if the tense and form are different."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-249",
"text": "The size of vocabulary required to achieve the same performance is less when stemming is applied."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-250",
"text": "It means that the number of parameters needed for the captioning model is small and the computational complexity is low."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-251",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-252",
"text": "**CONCLUSION**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-253",
"text": "In this study, we propose a Distinctive-attribute Extraction (DaE) method for image captioning."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-254",
"text": "In particular, the proposed scheme consists of the semantic attribute extraction and semantic attribute prediction."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-255",
"text": "To obtain the semantic attributes, TF-IDF of trained captions is computed to extract meaningful information from them."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-256",
"text": "Then, the distinctive-attribute vectors for an image are computed by regularizing TF-IDF of each word with the L2 normalized TF-IDF of the image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-257",
"text": "The attribute prediction model is trained by the extracted attributes and used to infer the semantic-attribute for generating a natural language description."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-258",
"text": "DaE improves the performance of SCN-LSTM scheme by signicant margins across all metrics, moreover, distinctive captions are generated."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-259",
"text": "Specifically, CIDEr scores on the COCO evaluation server are improved from 0.967 to 0.981 in 5-refs and from 0.971 to 0.990 in 40-refs, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-260",
"text": "The proposed method can be applied to other base models that use attribute to improve their performance."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-261",
"text": "Therefore, we believe that the proposed scheme can be a useful tool for effective image caption scheme."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-262",
"text": "----------------------------------"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-263",
"text": "**SUPPLEMENTARY MATERIAL**"
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-264",
"text": "In the experiment, we compared our method with SCN [4, 30] that uses extracted tags according to their semantic concept detection method."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-265",
"text": "To evaluate the proposed method with more pictures, we compare the predicted semantic attributes by using SCN and the proposed scheme."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-266",
"text": "The results are listed in Table 5."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-267",
"text": "The attribute in SCN and the proposed method (DaE) is called as tag and distinctive-attribute, respectively."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-268",
"text": "The tag represents probabilities, on the other hand, the attribute from DaE is distinctiveness score itself."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-269",
"text": "We listed the top eight attributes in descending order."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-270",
"text": "In the case of DaE, words after stemming are displayed as they are."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-271",
"text": "The captions obtained using image features and extracted semantic information are also compared in the table."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-272",
"text": "In (a), a child is feeding grass to a giraffe through a fence."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-273",
"text": "The caption generated by SCN includes \"dog\" that does not exist in the picture and is inaccurate."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-274",
"text": "However, as a result of DaE, the word \"giraff\" gets a higher score than the \"dog\" and is reflected in the generated caption."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-275",
"text": "In addition, DaE detects the verb \"feed\", which represents the main situation of the image, and the exact phrase \"feeding a giraffe through a fence\" is produced."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-276",
"text": "In (b), \"red truck\" and \"snow\" are recognized as \"fire hydrant\" and \"water,\" respectively, by SCN."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-277",
"text": "Those words creating the phrase \"hydrant spraying water\" that does not fit a situation of the image."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-278",
"text": "On the other hand, DaE extracts exact nouns, verb and adjective such as \"truck\", \"snow,\" \"drive,\" and \"red.\""
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-279",
"text": "In (c), DaE detects the banana located in a small part of the image with the highest score among the distinctive-attributes."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-280",
"text": "\"Banana\" is combined with another well-detected word \"hold\" to create a participial construction: \"holding a banana.\""
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-281",
"text": "In (d), the situation is that a man is taking selfi through a mirror."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-282",
"text": "DaE detects the stemmed word \"hi\" corresponding to \"himself.\" On the other hand, the tag vocabulary set of SCN does not contain the words such as \"himself\" or \"self.\" Besides, SCN recognizes the camera or phone as a Nintendo."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-283",
"text": "In (e), the general caption \"A close up of a sandwich on a plate.\" is generated by SCN, on the other hand, the caption generated using the proposed method contains a distinctive phrase \"cut in half\" due to the extracted distinctive-attributes \"cut\" and \"half.\""
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-284",
"text": "In (f), there is a bull in the center of the picture."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-285",
"text": "The vocabulary of SCN does not contain the word \"bull\", but the vocabulary of our method contains the word, even though the vocabulary size is smaller."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-286",
"text": "This specific word is extracted through DaE and reflected in the caption."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-287",
"text": "In (g), DaE detects that the picture is a \"store\" or a \"shop,\" and accurately figures out the situation that the clock is \"displayed\" over the \"window.\" On the other hand, SCN extracts words that are general and inappropriate to the situation, such as \"building\" and \"outdoor.\""
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-288",
"text": "In (h), there is a red stop sign next to a man."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-289",
"text": "DaE extracts both \"sign\" and its message \"stop.\" In addition, \"sunglass\" is extracted to generate a caption that well represents an appearance of the man."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-290",
"text": "On the other hand, the caption generated by SCN includes expressions such as \"man in a blue shirt\" and \"holding a sign\" that is not the situation of the picture."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-291",
"text": "In (i), DaE extracts the word \"frost\" that exists only in its vocabulary and does not exist in the vocabulary of SCN."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-292",
"text": "And the elaborate caption was created containing the word."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-293",
"text": "The caption \"A close up of a cake on a plate,\" which is generated by SCN, is relatively general."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-294",
"text": "In (j), DaE extracts key objects and place such as \"microwav\", \"kitchen\", \"sink\", etc. And the captions generated by them are more detailed than captions generated by the tags of SCN."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-295",
"text": "In (k), a man is standing in front of a computer monitor or laptops."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-296",
"text": "DaE detects \"comput\" and \"laptop,\" which are not detected by SCN, and generates more accurate caption than that using the tags of SCN."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-297",
"text": "In (l), a pair of scissors placed in a plastic packing case is taken close up."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-298",
"text": "DaE extracts \"scissor\" which is the main object of the picture as the highest score."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-299",
"text": "The word \"pair\" which is used when counting the scissor, is extracted as the second highest score."
},
{
"sent_id": "33096f1e855d23046cb4cbfe95eef0-C001-300",
"text": "On the other hand, the main object of the caption generated by SCN is \"cell phone\" that does not exist in the picture."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"33096f1e855d23046cb4cbfe95eef0-C001-12"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-13",
"33096f1e855d23046cb4cbfe95eef0-C001-14"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-17",
"33096f1e855d23046cb4cbfe95eef0-C001-18"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-32"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-48"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-122"
]
],
"cite_sentences": [
"33096f1e855d23046cb4cbfe95eef0-C001-12",
"33096f1e855d23046cb4cbfe95eef0-C001-13",
"33096f1e855d23046cb4cbfe95eef0-C001-14",
"33096f1e855d23046cb4cbfe95eef0-C001-17",
"33096f1e855d23046cb4cbfe95eef0-C001-18",
"33096f1e855d23046cb4cbfe95eef0-C001-32",
"33096f1e855d23046cb4cbfe95eef0-C001-48",
"33096f1e855d23046cb4cbfe95eef0-C001-122"
]
},
"@USE@": {
"gold_contexts": [
[
"33096f1e855d23046cb4cbfe95eef0-C001-48",
"33096f1e855d23046cb4cbfe95eef0-C001-49",
"33096f1e855d23046cb4cbfe95eef0-C001-50"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-75"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-88"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-180"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-188"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-264"
]
],
"cite_sentences": [
"33096f1e855d23046cb4cbfe95eef0-C001-48",
"33096f1e855d23046cb4cbfe95eef0-C001-75",
"33096f1e855d23046cb4cbfe95eef0-C001-88",
"33096f1e855d23046cb4cbfe95eef0-C001-180",
"33096f1e855d23046cb4cbfe95eef0-C001-188",
"33096f1e855d23046cb4cbfe95eef0-C001-264"
]
},
"@SIM@": {
"gold_contexts": [
[
"33096f1e855d23046cb4cbfe95eef0-C001-81"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-179"
]
],
"cite_sentences": [
"33096f1e855d23046cb4cbfe95eef0-C001-81",
"33096f1e855d23046cb4cbfe95eef0-C001-179"
]
},
"@DIF@": {
"gold_contexts": [
[
"33096f1e855d23046cb4cbfe95eef0-C001-122",
"33096f1e855d23046cb4cbfe95eef0-C001-123",
"33096f1e855d23046cb4cbfe95eef0-C001-124",
"33096f1e855d23046cb4cbfe95eef0-C001-125"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-153"
],
[
"33096f1e855d23046cb4cbfe95eef0-C001-192"
]
],
"cite_sentences": [
"33096f1e855d23046cb4cbfe95eef0-C001-122",
"33096f1e855d23046cb4cbfe95eef0-C001-153",
"33096f1e855d23046cb4cbfe95eef0-C001-192"
]
}
}
},
"ABC_c327812b2369a1dfc8e2ce4077b997_3": {
"x": [
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-2",
"text": "A sentence aligned parallel corpus is an important prerequisite in statistical machine translation."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-3",
"text": "However, manual creation of such a parallel corpus is time consuming, and requires experts fluent in both languages."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-4",
"text": "Automatic creation of a sentence aligned parallel corpus using parallel text is the solution to this problem."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-5",
"text": "In this paper, we present the first ever empirical evaluation carried out to identify the best method to automatically create a sentence aligned Sinhala-Tamil parallel corpus."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-6",
"text": "Annual reports from Sri Lankan government institutions were used as the parallel text for aligning."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-7",
"text": "Despite both Sinhala and Tamil being under-resourced languages, we were able to achieve an F-score value of 0.791 using a hybrid approach that makes use of a bilingual dictionary."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-10",
"text": "Sentence and word aligned parallel corpora are extensively used for statistical machine translation (AlOnaizan et al., 1999; Callison-Burch, 2004 ) and in multilingual natural language processing (NLP) applications (Kaur and Kaur, 2012) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-11",
"text": "In recent years, parallel corpora have become more widely available and serve as a source for data-driven NLP tasks for languages such as English and French (Hallebeek, 2000; Kaur and Kaur, 2012) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-12",
"text": "A parallel corpus is a collection of text in one or more languages with their translation into another language or languages that have been stored in a machine-readable format (Hallebeek, 2000) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-13",
"text": "A parallel corpus can be aligned either at sentence level or word level."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-14",
"text": "Sentence and word alignment of parallel corpus is the identification of the corresponding sentences and words (respectively) in both halves of the parallel text."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-15",
"text": "Sentence alignment could be of various combinations including one to one where one sentence maps to one sentence in the other corpus, one to many where one sentence maps to more than one sentences in the other corpus, many to many where many sentences map to many sentences in the other corpus or even one to zero where there is no mapping for a particular sentence in the other corpus."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-16",
"text": "For statistical machine translation, the more the number of parallel sentence pairs, the higher the quality of translation (Koehn, 2010) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-17",
"text": "However, manual alignment of a large number of sentences is time consuming, and requires personnel fluent in both languages."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-18",
"text": "Automatic sentence alignment of a parallel corpus is the widely accepted solution for this problem."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-19",
"text": "Already many sentence alignment techniques have been implemented for some languages pairs such as English-French (Gale and Church, 1993; Brown et al., 1991; Chen, 1993; Braune and Fraser 2010; Lamraoui and Langlais, 2013) , English-German (Gale and Church, 1993) English-Chinese (Wu, 1994; Chuang and Yeh, 2005) and Hungarian-English (Varga et al., 2005; T\u00f3th et al., 2008) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-20",
"text": "However, none of these techniques have been evaluated for Sinhala and Tamil, the two official languages in Sri Lanka."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-21",
"text": "This paper presents the first ever study on automatically creating a sentence aligned parallel corpus for Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-22",
"text": "Sinhala and Tamil are both under-resourced languages, and research implementing basic NLP tool such as POS taggers and morphological analysers is at its inception stage (Herath et al., 2004; Hettige and Karunananda, 2006; Anandan et al., 2002) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-23",
"text": "Therefore, not all the aforementioned sentence alignment techniques are applicable in the context of Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-24",
"text": "With this limitation in mind, an extensive literature study was carried out to identify the applicable sentence alignment techniques for Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-25",
"text": "We implemented six such methods, and evaluated their performance using a corpus of 1300 sentences based on the precision, recall, and F-measure using annual reports of Sri Lankan government departments as the source text."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-26",
"text": "The highest F-measure value of 0.791 was obtained for Varga et al.'s (2005) Hunalign method, the hybrid method that combined the use of a bilingual dictionary with the statistical method by Gale and Church (1993) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-27",
"text": "The rest of the paper is organized as follows."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-28",
"text": "Section 2 identifies related work in this area."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-29",
"text": "Section 3 describes how different techniques were employed in the alignment process, and section 4 presents the results for these techniques."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-30",
"text": "Section 5 contains a discussion of these results while section 6 presents the conclusion and future work."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-32",
"text": "**RELATED WORK**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-33",
"text": "Automatic sentence alignment techniques can be broadly categorized into three classes: statistical, linguistic, and hybrid methods."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-34",
"text": "Statistical methods use quantitative measures (such as sentence size, sentence character number) to create an alignment relationship; linguistic methods use linguistic knowledge gained from sources such as morphological analyzers, bilingual dictionaries, and word list pairs, to relate sentences; hybrid methods combine the statistical and linguistic methods to achieve accurate statistical information (Sim\u00f5es, 2004) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-35",
"text": "Gale and Church (1993) , and Brown et al. (1991) have introduced statistical methods for aligning sentences that have been successfully used for European languages, including English-French, EnglishGerman, English-Polish, English-Spanish (McEnery et al., 1997) , English-Dutch and Dutch -French (Paulussen et al, 2013) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-37",
"text": "**STATISTICAL METHODS**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-38",
"text": "These methods have also been used with Non-European languages such as English -Chinese (McEnery and Oakes, 1996) , Italian-Japanese (Zotti et al, 2014) , English-Arabic (Alkahtani et al, 2015) , and English-Malay (Yeong et al, 2016) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-39",
"text": "The general idea of these methods is that the closer in length two sentences are, the more likely they align."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-40",
"text": "Brown et al.'s (1991) method aligns sentences based on sentence length measured using word count."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-41",
"text": "Here anchor points are used for alignment."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-42",
"text": "Gale and Church use the number of characters as the length measure."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-43",
"text": "While the parameters such as mean and variance for Gale and Church's (1993) method are considered language independent for European languages, tuning these for non-'European language pairs has improved results (Zotti et al, 2014) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-44",
"text": "Both these methods have given good accuracy in alignment; however they require some form of initial alignment or anchor points."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-45",
"text": "Method by Chuang and Yeh (2005) exploits the statistically ordered matching of punctuation marks in the two languages English and Chinese to achieve high accuracy in sentence alignment compared with using the length-based methods alone."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-47",
"text": "**LINGUISTIC METHODS**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-48",
"text": "Linguistic methods exploit the linguistic characteristics of the source and target languages such as morphology and sentence structure to improve the alignment process."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-49",
"text": "However linguistic methods are not used independently but have been introduced in conjunction with statistical methods, forming hybrid methods as described in the next section."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-51",
"text": "**HYBRID METHODS**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-52",
"text": "Statistical methods such as that of Brown et al., (1991) , and Gale and Church (1991) require either corpus-dependent anchor points, or prior alignment of paragraphs to obtain better accuracy."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-53",
"text": "Hybrid methods make use of statistical as well as linguistic features of the sentences obtaining better accuracy in documents with or without these types of prior alignments."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-54",
"text": "Hence hybrid methods are widely used to achieve higher accuracy in alignment."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-55",
"text": "The methods by Wu (1994) , Chen (1993) , Moore (2002) , Varga et al. (2005) , Sennrich and Volk (2011), Lamraoui and Langlais (2013) , Braune and Fraser (2010) , T\u00f3th et al. (2008) and M\u00fajdricza-Maydt et al. (2013) are some of them."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-56",
"text": "The method used by Wu (1994) is a modification of Gale and Church's (1993) length-based statistical method for the task of aligning English with Chinese."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-57",
"text": "It uses a bilingual external lexicon with lexicon cues to improve the alignment accuracy."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-58",
"text": "Dynamic programming optimization has been used for the alignment of the lexicon extensions."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-59",
"text": "However, the computation and memory costs grow linearly with the number of lexical cues."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-60",
"text": "The method by Chen (1993) is a word-correspondence-based model that gives a better accuracy than length based methods, however, it was reported to be much slower than the algorithms of Brown et al., (1991) and Gale and Church (1993) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-61",
"text": "Moore's (2002) method aligns the corpus using a modified version of Brown et al.'s (1991) sentence-length-based model in the first pass."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-62",
"text": "It then uses the sentence pairs that were assigned the highest probability of alignment to train a modified version of IBM Translation Model 1 (one of the five translation models that assigns a probability to each of the possible word-by-word alignmentsdeveloped by Brown et al. (1993) )."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-63",
"text": "The corpus is realigned, augmenting the initial alignment model with IBM Model 1, to produce an alignment based both on sentence length and word correspondences."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-64",
"text": "It uses a novel search-pruning technique to efficiently find the sentence pairs that will be aligned with the highest probability without the use of anchor points or larger previously aligned units like paragraphs or sections."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-65",
"text": "This is an effective method that gets a relatively high performance especially in precision."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-66",
"text": "Nonetheless, this method has the drawback that it usually gets a low recall especially when dealing with sparse data (Trieu et al., 2015) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-67",
"text": "Hunalign sentence alignment method by Varga et al. (2005) uses a hybrid algorithm based on a length-based method that makes use of a bilingual dictionary."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-68",
"text": "The similarity score between a source and a target sentence consists of two major components, which are token-based score and length-based score."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-69",
"text": "The token-based score depends on the number of shared words in the two sentences while the length-based alignment is based on the character count of the sentence."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-70",
"text": "Varga et al.'s (2005) method uses a dictionary-based crude translation model instead of a full IBM translation model as used by Moore (2002) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-71",
"text": "This has the very important advantage that it can exploit a bilingual lexicon, if one is available, and tune it according to frequencies in the target corpus."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-72",
"text": "Moore's (2002) method offers no such way to tune a pre-existing language model."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-73",
"text": "Moreover, the focus of Moore's (2002) algorithm on one-to-one alignments is less than optimal, since excluding one-to-many and many-to-many alignments may result in losing substantial amounts of aligned material if the two languages have different sentence structuring conventions (Varga et al., 2005) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-74",
"text": "Bleualign sentence aligner by Sennrich and Volk (2011) is based on the BLEU (bilingual evaluation understudy) score, which is an algorithm for evaluating the quality of text that has been machinetranslated from one natural language to another."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-75",
"text": "Instead of computing an alignment between the source and target text directly, this technique bases its alignment search on a Machine Translation (MT) of the source text."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-76",
"text": "The YASA method by Lamraoui and Langlais (2013) also operates a two-step process through the parallel data."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-77",
"text": "Cognates are first recognized in order to accomplish a first token-level alignment that (efficiently) delimits a fruitful search space."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-78",
"text": "Then, sentence alignment is performed on this reduced search space."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-79",
"text": "The speed of the YASA aligner and memory use is comparatively better than Moore's (2002) aligner (Lamraoui and Langlais, 2013) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-80",
"text": "Though the method by Braune and Fraser (2010) is four times slower than Moore's (2002) method, it supports one to many and many to one alignments as well."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-81",
"text": "It uses an improved pruning method and in the second pass, the sentences are optimally aligned and merged."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-82",
"text": "This method uses a two-step clustering approach in the second pass of the alignment."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-83",
"text": "The method by T\u00f3th et al. (2008) exploits the fact that Named Entities cannot be ignored from any translation process, so a sentence and its translation equivalent contain the same Named Entities."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-84",
"text": "The method by M\u00fajdricza-Maydt et al. (2013) uses a two-step process to align sentences."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-85",
"text": "Machine alignments known as \"wood standard\" annotations, produced using state-of-the-art sentence aligners in a first step, are used in a second step, to train a discriminative learner."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-86",
"text": "This combination of arbitrary amounts of machine aligned data and an expressive discriminative learner provides a boost in precision."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-87",
"text": "All features used in the second step, with the exception of the POS agreement feature, are language-independent."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-88",
"text": "According to Gale and Church (1993) a considerably large parallel corpus having a small error percentage can be built without lexical constraints."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-89",
"text": "According to the authors, lexical constraints might slow down the program and make it less useful in the first pass."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-90",
"text": "Linguistic methods can produce better results if the performance of the system is not a concern."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-91",
"text": "Hybrid methods such as that of Moore's (2002) that do not require particular knowledge about the corpus or the languages involved are faster as they tend to build the bilingual dictionary for aligning using the input to the aligner based on previous word-correspondence-based models."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-92",
"text": "Furthermore, results of some of the above methods such as Hunalign (Varga et al, 2005) , Bleualign (Sennrich and Volk, 2011) and Gargantua (Braune and Fraser, 2010) could be improved by applying linguistic factors such as word forms, chunks and collocations (Navlea and Todira\u015fcu, 2010) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-93",
"text": "Some have used morphologically processed (lemmatized and morphologically tagged) data and have used taggers (POS tagger) because it significantly increases the value of the data (Bojar et al, 2014) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-95",
"text": "**INDIC LANGUAGES**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-96",
"text": "Automatic alignment of sentences has been attempted for few Indic language pairs from the South Asian subcontinent including Hindi-Urdu (Kaur and Kaur, 2012) and Hindi-Punjabi (Kumar and Goyal, 2010) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-97",
"text": "This research used the method proposed by Gale and Church (1993) citing the close linguistic similarities between languages of these pairs, causing parallel sentences to be of similar lengths."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-99",
"text": "**METHODOLOGY**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-101",
"text": "**DATA SOURCE**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-102",
"text": "The parallel corpus used in aligning sentences is from annual reports published by different government departments in Sri Lanka."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-103",
"text": "These government reports have been manually translated from Sinhala to Tamil by translators with different levels of experience in translation and Sinhala-Tamil competency."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-104",
"text": "Thus the quality of the translations compared to other sources such as those from the Parliament of Sri Lanka is comparatively low with a considerable number of omissions and mistranslations."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-105",
"text": "These annual reports are in pdf format."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-106",
"text": "Text was automatically extracted from the pdf documents, and converted to Unicode to ensure uniformity."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-107",
"text": "The text thus obtained was segmented into sentences using a custom tokenization algorithm implemented specifically for Tamil and Sinhala."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-108",
"text": "Although there are some tokenizers for Sinhala 1 and Tamil, they could not be used for this purpose, since the abbreviations used in our input text are different from those in the existing tokenizers."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-109",
"text": "Therefore we created a list of manually extracted abbreviations."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-110",
"text": "Splitting documents into sentences was done by using delimiters such as \" ., ? , ! \"."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-111",
"text": "Splitting into sentences using full stops is misleading at abbreviations, decimal digits, e-mails, URLs etc., because full stops at these places are not actual sentence boundaries."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-112",
"text": "Therefore splitting into sentences at these points was avoided by means of regular expression checks."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-113",
"text": "However issues such as omissions of punctuation marks result in the need for complex alignments (one to many, many to many)."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-114",
"text": "For example 2 the following sentences in Sinhala specify five cities (Kuruwita, Rathnapura, Balangoda, Godakawela, Opanayake) followed by the sentence \"The Active Committee representing the Operations Co-ordination Centers for Language Associations in Vavuniya was established\"."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-115",
"text": "(\u0d9a\u0dd4\u0dbb\u0dd4\u0dc0\u0dd2\u0da7,\u0dbb\u0dad\u0dca\u0db1\u0db4\u0dd4\u0dbb,\u0db6\u0dbd\u0d82\u0d9c \u0ddc\u0da9, \u0d9c \u0ddc\u0da9\u0d9a\u0d9c\u0dd9\u0dbd, \u0d95\u0db4\u0db1\u0dcf\u0dba\u0d9a)."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-117",
"text": "**\u0dd9\u0dc0\u0dd4\u0db1\u0dd2\u0dba\u0dcf\u0dd9 \u0db7\u0dcf\u0dc2\u0dcf \u0dc3\u0d82 \u0db8\u0dca \u0d9c\u0dd9\u0d9c\u0dd9\u0dba\u0dd4\u0db8\u0dca \u0dd9\u0db0 \u0dba\u0dc3\u0dca \u0dae\u0dcf\u0db1 \u0d9a \u0dbb\u0dd2\u0dba\u0dcf\u0d9a\u0dcf\u0dbb\u0dd3 \u0d9a\u0db8\u0dd2\u0da7\u0dd4\u0dd9 \u0dc3\u0dca \u0dae\u0dcf\u0db4\u0dd2\u0dad \u0d9a\u0dbb\u0db1 \u0dbd\u0daf\u0dd3.**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-118",
"text": "However due to the omission of the period in the corresponding Tamil text, the above is identified as one single sentence in Tamil requiring the alignment to map one Tamil sentence to many Sinhala sentences."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-119",
"text": "The bilingual dictionary used for alignment was obtained from the trilingual dictionary 3 combined with the glossaries obtained from the Department of Official languages 4 , Sri Lanka."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-120",
"text": "The number of words in the lexicon obtained has around 90000 words, but it does not have all the commonly used words in the languages and mostly has the spoken forms of words in Sinhala, which are not used in the written official documents."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-122",
"text": "**SENTENCE ALIGNMENT**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-123",
"text": "Depending on the similarities and dissimilarities between the languages and the quality of the data source, different techniques discussed in section 2 have given different results for the alignment for different language pairs."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-124",
"text": "For example, a method like that of Chuang and Yeh (2005) would work well for parallel text where punctuations are consistent, while that of Varga et al. (2005) would work better for languages that lack etymological relations."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-125",
"text": "Thus the objective of this research is to experiment with these techniques for Sinhala-Tamil, and identify the best technique."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-126",
"text": "However, not all methods described in section 2 can be used in the context of Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-127",
"text": "For example, methods by T\u00f3th et al. (2008) and M\u00fajdricza-Maydt et al. (2013) cannot be used because NER systems and comprehensive POS taggers are not fully developed for Sinhala (Dahanayaka and Weerasinghe, 2014; Manamini et al., 2016) and Tamil (Pandian et al., 2008; Vijayakrishna and Devi, 2008) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-128",
"text": "Also methods that align using the punctuations in the two languages similar to that of Chuang and Yeh (2005) cannot be used in this case because when extracting text from pdf, some punctuations are lost, and also the translators of the original text have not been consistent with the use of punctuations."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-129",
"text": "Constrained by the available resources, we compared methods by Gale and Church (1993) , Moore (2002) , Varga et al. (2005) , Braune and Fraser (2010) , Lamraoui and Langlais (2013) , and Sennrich and Volk (2011) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-130",
"text": "These methods have shown promising results for languages that show close linguistic relationships, which is also the case with Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-131",
"text": "These close linguistic relationships include similarities in word or sentence length, similarities in sentence structure and in languages that use the character set, similarities between words."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-132",
"text": "Linguistic similarities between Sinhala and Tamil include word and sentence length similarities and sentence structure similarity with both Sinhala and Tamil following a Subject-Object-Verb structure."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-133",
"text": "The mean and variance for the number of Tamil characters per Sinhala was found and these values were used for the Gale and Church's (1993) method."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-134",
"text": "Default values were used for the other methods during the evaluation."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-135",
"text": "For Moore's (2002) method, a bilingual word dictionary is built using the IBM Model 1."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-136",
"text": "However, this dictionary may lack significant vocabulary when the input corpus contains sparse data, as pointed out by Trieu and Nguyen (2015) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-137",
"text": "The output files from this method contain all the sentences from the input files that align 1-to-1 with probability greater than the \"threshold\" according to the statistical model computed by the aligner."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-138",
"text": "For evaluation using this method we used a threshold of 0.8 instead of the default value of 0.5."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-139",
"text": "Around 1300 sentences were extracted from pdf files and were aligned using these methods."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-140",
"text": "This corpus is publicly available 3 for the benefit of Sinhala and Tamil language computing."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-141",
"text": "The same sentences were manually aligned with the help of a human translator."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-142",
"text": "Then the automatically aligned sentences were compared with the manually aligned sentences to obtain the precision and recall values."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-144",
"text": "**EVALUATION**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-145",
"text": "The evaluation for sentence alignment was done by using data that was manually aligned."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-146",
"text": "The reason for this approach instead of getting the human translator to evaluate the automatically aligned sentences was to ensure that the manual evaluation was independent from the automatically produced output, as the automated alignments may influence the human aligner."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-147",
"text": "Furthermore this approach also facilitated the comparison of the performance of multiple methods."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-148",
"text": "Table 1 shows the precision, recall, and F-measure obtained for the six methods."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-150",
"text": "**DISCUSSION**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-151",
"text": "Most of the above methods (Gale and Church, 1993; Brown et al., 1991; Chen and S.F, 1993; Braune and Fraser, 2010) have been first used for English and French sentence alignment."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-152",
"text": "Both these languages have many similarities, which include the sentence structure and the sentence length."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-153",
"text": "The sentence structure of these languages is of the form subject-verb-object and the sentence length is quite close."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-154",
"text": "The same similarities can also be found in Sinhala and Tamil languages."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-155",
"text": "Sinhala and Tamil languages have the same sentence structure, Subject-Object-Verb."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-156",
"text": "Also the average sentence lengths of the two languages are quite close."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-157",
"text": "Considering 700 sentences, average length of Sinhala is 113.76 and for Tamil it is 130.53."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-158",
"text": "Therefore statistical methods have given good results in our case."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-159",
"text": "The lexical components used in the hybrid methods suggested above are also language independent."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-160",
"text": "Thus the hybrid methods are also applicable for Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-161",
"text": "We used Gale and Church (1993) method even though we could not align the paragraphs before aligning the sentences, due the dissimilarities among the text converted from pdfs."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-162",
"text": "The length of Tamil sentences was comparatively higher than Sinhala sentences and the correlation between Sinhala and Tamil was comparatively low, hence we cannot consider mean and variance as language independent as suggested by Gale and Church (1993) ."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-163",
"text": "Therefore we calculated the mean and variance for Sinhala and Tamil using 700 sentences."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-164",
"text": "Gale and Church (1993) introduced 1 as mean and 6.8 as variance for English and French Languages."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-165",
"text": "For Sinhala and Tamil, we figured out mean is 1.152 and variance is 1.860."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-166",
"text": "Even after changing the parameters for Sinhala and Tamil in the Gale and Church (1993) method, we obtained a comparatively low precision because this method does not only look at one to one alignments but also one to zero, many to one, one to many or many to many alignments."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-167",
"text": "Also according to Gale and Church (1993) , in this method one to zero alignment is never handled correctly."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-168",
"text": "Most misalignments arise due to one to zero, many to one to many or many to many alignments, resulting in methods that consider only one to one alignments to have better precision values."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-169",
"text": "Given the nature of the source documents used in this research, there were a significant non one-to-one alignments and incorrect translations, which affected the precision value."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-170",
"text": "However, as this method omits only a few sentences, it obtains high recall and F-Score than some of the other methods."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-171",
"text": "Since the text used for alignment in our case has considerably sparse data, the dictionary built in the Moore's (2002) method lacks significant vocabulary."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-172",
"text": "Furthermore because of the fact that Moore's (2002) method only considers one to one alignment, the recall obtained by this method is very low while the precision is very high."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-173",
"text": "In our case, even though there are alignments that are not one to one, the high precision of Moore's method has shown that it is possible to align a considerable number of sentences only by using one to one alignments."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-174",
"text": "According to Moore (2002) , in practice one to one alignments are the only alignments that are currently used for training machine translation systems."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-175",
"text": "The YASA aligner by Lamraoui and Langlais (2013) has proven to be robust to noise by having a good precision and recall for the parallel corpus of Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-176",
"text": "Also the Braune and Fraser's (2010) method is known to work better especially for corpora where the sentences do not align one to one that often."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-177",
"text": "However, our source text has a number of one to one alignments (as was proved by the alignment in Moore's (2002) method) along with other forms of alignments, which could be the reason for the low recall of this method."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-178",
"text": "Even though the method by Varga et al. (2005) has given the highest F-score, the results for this method could be improved using a better dictionary that includes all or most of the words that are used in the annual reports."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-179",
"text": "A factor significantly affecting the results of the alignment process was the quality of the source documents."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-180",
"text": "Compared to other documents such as parliamentary documents, news articles and subtitles commonly used in evaluating alignment, the annual reports we considered were of comparatively less quality including significant omissions and inconsistencies and high complexity with significant many to one, one to many, and many to many alignments."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-181",
"text": "The data set considered comprised of nearly 7% many to one, one to many or many to many alignments and nearly 15% one to zero or zero to one alignments indicating improper or incomplete translations."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-183",
"text": "**CONCLUSION**"
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-184",
"text": "We have addressed the problem of the lack of sentence aligned Sinhala-Tamil parallel corpus large enough to be useful in a multitude of natural language processing tasks."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-185",
"text": "We have experimented with a number of alignment techniques developed for other language pairs, introducing necessary modifications for Sinhala and Tamil, where applicable."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-186",
"text": "The results generated have been satisfactory, indicating that better results could be obtained with more language resources such as morphological analyzers, POS taggers and named entity recognizers, which are currently not fully developed for Sinhala."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-187",
"text": "This research is carried out as part of a major project to build a machine translation system between Sinhala and Tamil."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-188",
"text": "POS taggers and named entity recognizers are being developed as part of this larger project."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-189",
"text": "With the availability of these resources, methods utilizing these resources could also be introduced for Sinhala and Tamil in the near future, to obtain improved results."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-190",
"text": "Future work in improving the automatic generation of the Sinhala-Tamil parallel corpus includes experimenting with more techniques that have worked for other language pairs."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-191",
"text": "The suitability of techniques that specifically use language resources such as POS taggers and morphological analysers could also be evaluated with the availability of such resources of better quality."
},
{
"sent_id": "c327812b2369a1dfc8e2ce4077b997-C001-192",
"text": "Additionally the identified techniques could be evaluated with documents from different domains, whereas in this research evaluation has been done only with annual reports."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"c327812b2369a1dfc8e2ce4077b997-C001-19"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-43"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-56"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-60"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-88"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-151"
]
],
"cite_sentences": [
"c327812b2369a1dfc8e2ce4077b997-C001-19",
"c327812b2369a1dfc8e2ce4077b997-C001-43",
"c327812b2369a1dfc8e2ce4077b997-C001-56",
"c327812b2369a1dfc8e2ce4077b997-C001-60",
"c327812b2369a1dfc8e2ce4077b997-C001-88",
"c327812b2369a1dfc8e2ce4077b997-C001-151"
]
},
"@DIF@": {
"gold_contexts": [
[
"c327812b2369a1dfc8e2ce4077b997-C001-19",
"c327812b2369a1dfc8e2ce4077b997-C001-20",
"c327812b2369a1dfc8e2ce4077b997-C001-21"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-166"
]
],
"cite_sentences": [
"c327812b2369a1dfc8e2ce4077b997-C001-19",
"c327812b2369a1dfc8e2ce4077b997-C001-166"
]
},
"@USE@": {
"gold_contexts": [
[
"c327812b2369a1dfc8e2ce4077b997-C001-26"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-97"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-129"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-133"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-161"
]
],
"cite_sentences": [
"c327812b2369a1dfc8e2ce4077b997-C001-26",
"c327812b2369a1dfc8e2ce4077b997-C001-97",
"c327812b2369a1dfc8e2ce4077b997-C001-129",
"c327812b2369a1dfc8e2ce4077b997-C001-133",
"c327812b2369a1dfc8e2ce4077b997-C001-161"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"c327812b2369a1dfc8e2ce4077b997-C001-162"
],
[
"c327812b2369a1dfc8e2ce4077b997-C001-167"
]
],
"cite_sentences": [
"c327812b2369a1dfc8e2ce4077b997-C001-162",
"c327812b2369a1dfc8e2ce4077b997-C001-167"
]
},
"@EXT@": {
"gold_contexts": [
[
"c327812b2369a1dfc8e2ce4077b997-C001-166"
]
],
"cite_sentences": [
"c327812b2369a1dfc8e2ce4077b997-C001-166"
]
}
}
},
"ABC_46faad9d86cda118df5eb9c1e7df65_3": {
"x": [
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-131",
"text": "In addition, GEN uses the Training set very differently from classical, supervised models."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-133",
"text": "**DISCRIMINATIVE MODEL**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-154",
"text": "The result is a tree structure with one incoming relation per node."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-155",
"text": "In cases of nodes with multiple equiprobable incoming relations, the algorithm takes whichever relation it sees first."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-2",
"text": "This paper provides a detailed comparison of a data programming approach with (i) off-the-shelf, state-of-the-art deep learning architectures that optimize their representations (BERT) and (ii) handcrafted-feature approaches previously used in the discourse analysis literature."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-3",
"text": "We compare these approaches on the task of learning discourse structure for multi-party dialogue."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-4",
"text": "The data programming paradigm offered by the Snorkel framework allows a user to label training data using expert-composed heuristics, which are then transformed via the \"generative step\" into probability distributions of the class labels given the data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-5",
"text": "We show that on our task the generative model outperforms both deep learning architectures as well as more traditional ML approaches when learning discourse structure-it even outperforms the combination of deep learning methods and handcrafted features."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-6",
"text": "We also implement several strategies for \"decoding\" our generative model output in order to improve our results."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-7",
"text": "We conclude that weak supervision methods hold great promise as a means for creating and improving data sets for discourse structure."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-10",
"text": "In this paper, we investigate and demonstrate the potential of a weak supervision, data programming approach (Ratner et al., 2016) to the task of learning discourse structure for multi-party dialogue."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-11",
"text": "We offer a detailed comparison of our data programming approach with (i) off-the-shelf, state-of-the-art deep learning architectures that optimize their representations (BERT) and (ii) handcrafted-feature approaches previously used in the discourse analysis literature."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-12",
"text": "Our data programming paradigm exploits the Snorkel framework that allows a user to label training data using expert-composed heuristics, which are then transformed via the \"generative step\" into probability distributions of the class labels given the data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-13",
"text": "We show that the generative model produced from these heuristics outperforms both deep learning architectures as well as more traditional ML approaches when learning discourse structure by up to 20 points of F1 score; it even outperforms the combination of generative and discriminative approaches that are the foundation of the Snorkel framework."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-14",
"text": "We also implement several strategies for \"decoding\" our generative model output that improve our results."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-15",
"text": "We assume discourse structures are dependency structures (Muller et al., 2012; Li et al., 2014) and restrict the structure learning problem to predicting edges or attachments between discourse unit (DU) pairs in the dependency graph."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-16",
"text": "Although the problem of attachment is only a part of the overall task of discourse interpretation, it is a difficult problem that serves as a useful benchmark for various approaches to discourse parsing."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-17",
"text": "After training a supervised deep learning algorithm to predict attachments on the STAC annotated corpus 1 , we then constructed a weakly supervised learning system in which we used 10% of the corpus as a development set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-18",
"text": "Experts on discourse structure wrote a set of attachment rules, or labeling functions (LFs), and tested them against this development set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-19",
"text": "We treated the remainder of the corpus as raw/unannotated data to be automatically annotated using the data programming framework."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-21",
"text": "**STATE OF THE ART**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-22",
"text": "Discourse structures for texts represent causal, topical, argumentative information through what are called coherence relations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-23",
"text": "For dialogues with multiple interlocutors, extraction of their discourse structures provides useful semantic infor-mation to the \"downstream\" models used, for example, in the production of intelligent meeting managers or the analysis of user interactions in online fora."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-24",
"text": "However, despite considerable efforts to retrieve discourse structures automatically (Fisher and Roark, 2007; Duverle and Prendinger, 2009; Li et al., 2014; Joty et al., 2013; Ji and Eisenstein, 2014; Yoshida et al., 2014; Li et al., 2014; Surdeanu et al., 2015) , we are still a long way from usable discourse models, especially for dialogue."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-25",
"text": "Standard supervised models struggle to capture the sparse attachments, even when relatively large annotated corpora are available."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-26",
"text": "In addition, the annotation process is time consuming and often fraught with errors and disagreements, even among expert annotators."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-27",
"text": "This motivated us to explore the data programming approach that exploits expert linguistic knowledge in a more compact and consistent rule based form."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-28",
"text": "Given our interest in the analysis of multi-party dialogues, we used the STAC corpus of multiparty chats, an initial version of which is described in (Afantenos et al., 2015; Perret et al., 2016) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-29",
"text": "In all versions of this corpus, dialogue structures are directed acyclical graphs (DAGs) formed according to SDRT 2 (Asher and Lascarides, 2003; ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-30",
"text": "An SDRT discourse structure is a graph, V, E 1 , E 2 , , Last , where: V is a set of nodes or Discourse Units (DUs); E 1 \u2286 V 2 is a set of edges between DUs representing coherence relations; E 2 \u2286 V 2 represents a dependency relation between DUs; : E 1 \u2192 R is a labeling function that assigns a semantic type to an edge in E 1 from a set R of discourse relation types, and Last is a designated element of V giving the last DU relative to textual or temporal order."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-31",
"text": "E 2 is used to represent Complex Discourse Units (CDUs), which are clusters of two or more DUs connected as an ensemble to other DUs in the graph."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-32",
"text": "As learning this type of recursive structure presents difficulties beyond the scope of this paper, we followed a \"flattening\" strategy similar to (Muller et al., 2012) to remove CDUs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-33",
"text": "This process yields a set V * , which is V without CDUs, and a set E * 1 , a flattened version of E 1 ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-34",
"text": "Building these structures typically requires three steps: (i) segmenting the text into the basic units of the discourse, typically clauses -these are EDUs or Elementary Discourse Units; these, together with CDUs, form the set of nodes V in 2 Segmented Discourse Representation Theory the graph; (ii) predicting the attachments between DUs, i.e. to identify the elements in E 1 ; (iii) predicting the semantic type of the edge in E 1 ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-35",
"text": "This paper focuses on step (ii)."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-36",
"text": "Our dialogue structures are thus of the form V * , E * 1 , Last ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-37",
"text": "Step (ii) is a difficult problem for automatic processing because attachments are theoretically possible between any two DUs in a dialogue or text, and often graphs include long-distance relations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-38",
"text": "Muller et al. (2012) is the first paper we know of that focuses on the discourse parsing attachment problem, albeit for monologue."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-39",
"text": "It targeted a restricted version of an SDRT graph and trains a simple MaxEnt algorithm to produce probability distributions over pairs of EDUs, what we call a \"local model\" with a positive F1 attachment score of 0.635."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-40",
"text": "They further applied global decoding constraints to produce a slight improvement in attachment scores (these are discussed in more detail in Section 5)."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-41",
"text": "Afantenos et al. (2015) used a similar strategy for dialogue on an early version of the STAC corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-42",
"text": "Perret et al. (2016) targeted a more elaborate approximation of SDRT graphs on the same version of the STAC corpus and reported a local model F1 attachment of 0.483."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-43",
"text": "They then used Integer Linear Programming (ILP) to encode global decoding constraints particular to SDRT to improve the F1 attachment score to 0.689."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-44",
"text": "Having sketched recent progress in discourse parsing, we briefly turn to the state of the art concerning data programming."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-45",
"text": "Ratner et al. (2016) introduced the data programming paradigm, along with a framework, Snorkel , which uses a weak supervision method (Zhou, 2017) , to apply labels to large data sets by way of heuristic labeling functions that can access distant, disparate knowledge sources."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-46",
"text": "These labels are then used to train classic data-hungry machine learning (ML) algorithms."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-47",
"text": "The crucial step in the data programming process uses a generative model to unify the noisy labels by generating a probability distribution for all labels for each data point."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-48",
"text": "This set of probabilities replaces the ground-truth labels in a standard discriminative model outfitted with a noise-aware loss function and trained on a sufficiently large data set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-50",
"text": "**THE STAC ANNOTATED CORPUS**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-52",
"text": "**OVERVIEW**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-53",
"text": "While earlier versions only included linguistic moves by players, STAC now contains in addi-tion a multimodal corpus of multi-party chats between players of an online game Hunter et al., 2018) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-54",
"text": "It includes 2,593 dialogues (each with a weakly connected DAG discourse structure), 12,588 \"linguistic\" DUs, 31,811 \"non-linguistic\" DUs and 31,251 semantic relations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-55",
"text": "A dialogue begins at the beginning of a player's turn, and ends at the end of that player's turn."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-56",
"text": "In the interim, players can bargain with each other or make spontaneous conversation."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-57",
"text": "These player utterances are the \"linguistic\" turns."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-58",
"text": "In addition the corpus contains information given visually in the game interface but transcribed in the corpus into Server or interface messages, \"nonlinguistic\" turns (Hunter et al., 2018) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-59",
"text": "All turns are segmented into DUs, and these units are then connected by semantic relations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-60",
"text": "Each dialogue represents a complete conversation."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-61",
"text": "There are typically many such conversations, each beginning with a non-linguistic turn in which a player is designated to begin negotiations (see Figure 1 )."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-62",
"text": "The dialogues end when this player performs a non-linguistic action that signals the end of their turn."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-63",
"text": "The dialogues are the units on which we build a complete discourse structure."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-64",
"text": "The STAC multimodal corpus is divided into a development, train and test set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-65",
"text": "The development and test sets are each 10% of the total size of the corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-66",
"text": "To compare our approach to earlier efforts, we also used the corpus from (Perret et al., 2016) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-67",
"text": "This corpus was also useful to check for over fitting of our Generative model developed on the multi-modal data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-68",
"text": "The corpus from (Perret et al., 2016) is an early version of a \"linguistic only\" version of the STAC corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-69",
"text": "It contains no nonlinguistic DUs, unlike the STAC multimodal corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-70",
"text": "3 It also contains quite a few errors; for example, about 60 stories in the (Perret et al., 2016) dataset have no discourse structure in them at all and consist of only one DU."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-71",
"text": "We eliminated these from the Perret 2016 data set that we used in our comparative experiments below, as these sto-3 There is also on the STAC website an updated linguistic only version of the STAC corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-72",
"text": "It has 1,091 dialogues, 11,961 linguistic only DUs and 10,191 semantic relations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-73",
"text": "We have not reported results on that data set here."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-74",
"text": "The dataset from (Perret et al., 2016) is similar to our linguistic only STAC corpus but is still substantially different and degraded in quality."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-75",
"text": "report significant error rates in annotation on the earlier versions of the STAC corpus and that the current linguistic only corpus of STAC offers an improvement over the (Perret et al., 2016) corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-76",
"text": "ries were obviously not a correct representation of what was going on in the game at the relevant point."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-78",
"text": "**DATA PREPARATION**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-79",
"text": "To concentrate on the attachment task, we implemented the following simplifying measures on the STAC corpus:"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-80",
"text": "1. Roughly 56% of the dialogues in the corpus contain only non-linguistic DUs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-81",
"text": "The discourse structure of these dialogues is more regular and thus less challenging; so we ignore these dialogues for our prediction task."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-82",
"text": "2. 98% of the discourse relations in our development corpus span 10 DUs or less."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-83",
"text": "To reduce class imbalance, we restricted the relations we consider to a distance of \u2264 10."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-84",
"text": "3. Following (Muller et al., 2012; Perret et al., 2016) we \"flatten\" CDUs by connecting all relations incoming or outgoing from a CDU to the \"head\" of the CDU, or its first DU."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-85",
"text": "The STAC corpus as we use it in our learning experiments thus includes 1,130 dialogues, 13,734 linguistic DUs, 18,767 non-linguistic DUs and 22,098 semantic relations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-86",
"text": "We also performed these operations on our version of the linguistic only corpus used by (Perret et al., 2016) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-88",
"text": "**DATA PROGRAMMING EXPERIMENTS**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-90",
"text": "**CANDIDATES AND LABELING FUNCTIONS**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-91",
"text": "The Snorkel implementation of the data programming paradigm inspired our weak supervision approach."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-92",
"text": "We first identified and extracted candidates from the data, and then wrote a set of labeling functions (LFs) to apply to the candidates while only consulting the development set, the 10% of the STAC corpus set aside to develop and test our LFs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-93",
"text": "We treated the training set (80% of the corpus) as unseen/unlabeled data to which we applied the finished LFs and the models."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-94",
"text": "The last 10% of the STAC corpus was reserved as a final test set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-95",
"text": "Candidates are the units of data for which labels are predicted."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-96",
"text": "For this study, the candidates are all DU pairs which could possibly be connected by a semantic relation."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-97",
"text": "We use our own method to create candidates from the DUs culled from the texts of the dialogues, making sure to limit the pairs to Although the rules are written to capture specific relation types between segments, they return 1/0 for attached/not attached those that occur in the same dialogues."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-98",
"text": "We also ruled out the possibility of backwards relations between two DUs which have different speakers: it is linguistically impossible for a speaker of, say, an assertion d 1 at time t 1 to answer a question d 2 , asked by a different speaker at time t 2 > t 1 , i.e. before d 2 was asked."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-99",
"text": "That is, we include (d 1 , d 2 ) in our candidates but rule out (d 2 , d 1 )."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-100",
"text": "LFs are expert-composed functions that make an attachment prediction for a given candidate: each LF returns a 1, a 0 or a -1 (\"attached\"/\"do not know\"/\"not attached\") for each candidate."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-101",
"text": "However, each of our LFs is written and evaluated with a specific relation type Result, Question-answerpair (QAP), Continuation, Sequence, Acknowledgement, Conditional, Contrast, Elaboration and Comment in mind."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-102",
"text": "In this way, LFs leverage a kind of type-related information, which makes sense from an empirical perspective as well as an epistemological one."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-103",
"text": "An attachment decision concerning two DUs is tightly linked to the type of relation relating the DUs: when an annotator decides that two DUs are attached, he or she does so with some knowledge of what type of relation attaches them."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-104",
"text": "Figure 2 shows a sample LFs used for attachment prediction with the Result relation in mind."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-105",
"text": "LFs also exploit information about the DUs' linguistic or non-linguistic status, the dialogue acts they express, their lexical content, grammatical category and speaker, and the distance between them-features also used in supervised learning methods (Perret et al., 2016; Afantenos et al., 2015; Muller et al., 2012) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-106",
"text": "Finally, we fix the order in which each LF \"sees\" the candidates such that it considers adjacent DUs before distant DUs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-107",
"text": "This allows LFs to exploit information about previously predicted attachments and dialogue history in new predictions."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-108",
"text": "Our rule set and their description are available here: https://tizirinagh.github.io/acl2019/. Figure 2 gives an example of a labeling function that we used."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-110",
"text": "**THE GENERATIVE MODEL**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-111",
"text": "Once the LFs are applied to all the candidates, we have a matrix of labels (\u039b) given by each LF \u039b for each candidate."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-112",
"text": "The generative model, GEN, as specified in (1), provides a general distribution of marginal probabilities relative to n accuracy dependencies \u03c6 j (\u039b i , y i ) for an LF \u03bb j with respect inputs x i , the LF's outputs on i \u039b ij and true labels y i that depend on parameters \u03b8 j where:"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-113",
"text": "The parameters are estimated by minimizing the negative log marginal likelihood of the output of an observed matrix \u039b as in (2)."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-114",
"text": "GEN does not have access to the gold labels on the Training set but uses the Training set as unlabelled data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-115",
"text": "So in this model, the true class labels y i are latent variables that generate the labeling function outputs, which are estimated via Gibbs sampling over the Training set (80% of the STAC corpus), after it has been labeled by the LFs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-116",
"text": "The objective in (2) is then optimized by interleaving stochastic gradient descent steps with Gibbs sampling ones."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-117",
"text": "For each candidate, GEN thus uses the accuracy measures for the LFs in (1) to assign marginal probabilities that two DUs are attached."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-118",
"text": "GEN estimates the accuracy of each LF, a marginal probability for each label, and consequently a probability for positive attachment."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-119",
"text": "In this model, the true class labels y i are latent variables that generate the labeling function outputs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-120",
"text": "The model in (1) presupposes that the LFs are independent, but this assumption does not always hold: one LF might be a variation of another or they might depend on a common source of information (Mintz et al., 2009 )."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-121",
"text": "If we don't take these dependencies into account, we risk assigning incorrect accuracies to the LFs."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-122",
"text": "Snorkel provides a more complex model that automatically calculates the dependencies between LFs and marginal probabilities which we use for the generative step ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-123",
"text": "The higher order dependencies significantly improved the generative model's results on the full STAC corpus (see Table 1 )."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-124",
"text": "When we obtain the results from the generative model GEN, we choose on the development corpus a threshold to apply to these marginals by calculating the threshold that gives us the best F1 score."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-125",
"text": "The best threshold is 0.85 (p > .85 for positive attachment) in the STAC corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-126",
"text": "Figure 3 shows the probability distribution, on which even taking 0.8 as a threshold gives a lower F1 score because of false positive attachment."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-127",
"text": "Binarizing these marginals allows us to pass these binarized probabilities to the discriminative model."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-128",
"text": "This also allows us to evaluate GEN with respect to gold label \"attachment\" 0/\"non attachment\" 1 on the STAC test data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-129",
"text": "The generative model GEN shares with other \"local\" models the feature that it considers pairs of DUs in isolation of the whole structure."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-130",
"text": "However, unlike other local models, our LFs enable GEN to exploit prior decisions on pairs of DUs, and thus we exploit more contextual information about discourse structure in GEN than in our classical, supervised local models."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-134",
"text": "The standard Snorkel approach inputs the marginal probabilities from the generative step directly into a discriminative model, which is trained on those probabilities using a noise-aware loss function (Ratner et al., 2016) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-135",
"text": "Ideally, this step generalizes the LFs by augmenting the feature representation -from, say, dozens of LFs to a high dimensional feature space -and allows the model to predict labels for more new data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-136",
"text": "Thus the precision potentially lost in the generalization is offset by a larger increase in recall."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-137",
"text": "We tested three discriminative models in our study."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-138",
"text": "Each one was trained on the gold labeled data in the Training set of the STAC corpus."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-139",
"text": "First we tried a single layer BI-LSTM with 300 neurons, which takes as input 100 dimensionalembeddings for the text of each DU in the candidate pair."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-140",
"text": "We concatenated the outputs of the BI-LSTM and fed them to a simple perceptron with one hidden layer and Rectified Linear Unit (ReLU) activation (Hahnloser et al., 2000; Jarrett et al., 2009; Nair and Hinton, 2010) and optimized with Adam (Kingma and Ba, 2014)."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-141",
"text": "Given that our data is extremely unbalanced in favor of the \"unattached\" class (\"attached\" candidates are roughly 13% of the candidates on the development set), we also implemented a classbalancing method inspired by (King and Zeng, 2001) which maps class indices to weight values used for weighting the loss function during training."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-142",
"text": "We also implemented BERT (Devlin et al., 2018) 's sequence classification model (source code on the link below 4 ) with 10 training epochs and all default parameters otherwise."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-143",
"text": "BERT, the Bidirectional Encoder Representations from Transformers, is a text encoder pre-trained using language models where the system has to guess a missing word or word piece removed at random from the text."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-144",
"text": "Originally designed for automatic translation tasks, BERT uses bi-directional selfattention to produce the encodings and performs at the state of the art on many textual classification tasks."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-145",
"text": "In order to use these methods, we had to binarize the marginal probabilities before moving to the discriminative step, using a threshold of p > .85 as explained in Section 4.2."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-146",
"text": "Though this marks a departure from the standard Snorkel approach, we found that our discriminative model results were higher when the marginals were binarized and when the class re-balancing was used, albeit much lower than expected overall."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-147",
"text": "Finally, to facilitate comparison with earlier work, we also implemented a local model, Lo-gReg* as mentioned in the following of the paper, that used marginal probabilities together with handcrafted features (the feature set used in (Afantenos et al., 2015) listed in their Table 2 ) and a Logistic Regression classifier."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-149",
"text": "**DECODING**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-150",
"text": "A set of highly accurate predictions for individual candidates does not necessarily lead to accurate discourse structures; for instance, without global structural constraints, GEN and local models may not yield the directed acyclic graphs (DAGs), required by SDRT."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-151",
"text": "As in previous work (Muller et al., 2012; Afantenos et al., 2015; Perret et al., 2016) , we use the Maximum Spanning Tree (MST) algorithm, and a variation thereof, to ensure that the dialogue structures predicted conform to some more general structural principle."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-152",
"text": "We implemented the Chu-Liu-Edmonds algorithm (Chu, 1965; Edmonds, 1967) , an efficient method of finding the highest-scoring non-projective tree in a directed graph, as described in Jurafsky and Martin 5 ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-153",
"text": "The algorithm greedily selects the relations with the highest probabilities from the dependency graphs produced by the local model, then removes any cycles."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-156",
"text": "Since SDRT structures can contain nodes with multiple incoming relations, i.e. are not always tree-like, we altered the MST algorithm in the manner of (Muller et al., 2012; Afantenos et al., 2015; Perret et al., 2016) , forcing the MST to include all high-probability incoming relations which do not create cycles."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-157",
"text": "This produces MS-DAG structures which are in principle more faithful to SDRT."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-158",
"text": "In addition, since discourse attachments in general follow an inverse power law (many short-distance attachments and fewer long-distance attachments), we implemented two MST/MS-DAG variants that always choose the shortest relation among multiple high-probability relations (MST/short and MS-DAG/short)."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-160",
"text": "**RESULTS AND ANALYSIS**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-161",
"text": "We set out to test the performance of combinations of generative and discriminative models along the lines of the data programming paradigm on the task of dialogue structure prediction in order to automatically generate SDRT corpus data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-162",
"text": "While our results were consistent with what data programming promises-more data with accuracy comparable if not slightly below that of hand-labeled data-our most surprising and interesting result was the performance of the generative model on its own."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-163",
"text": "As seen in Table 2 on STAC test data, GEN dramatically outperformed our deep learning baselines-BiLSTM, BERT, and BERT + Lo-gReg* architectures on gold labels-as well as the LAST baseline, which attaches every DU in a dialogue to the DU directly preceding it."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-164",
"text": "In addition, stand alone GEN also outperformed all the coupled Snorkel models, in which GEN is combined with an added discriminative step, by up to a 30 point improvement in F1 score (GEN vs. GEN+BiLSTM)."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-165",
"text": "We did not expect this, given that adding a discriminative model in Snorkel is meant to generalize, and hence improve, what GEN learns."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-166",
"text": "Critical to this success was the inclusion of higher order dependencies in GEN and the fact that our LFs exploited contextual information about the DUs that was unavailable to the deep learning models or even the handcrafted feature model."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-167",
"text": "GEN beats all competitors in terms of F1 score while taking a fraction of the annotated data to develop and train the model, showing the power and promise of the generative model."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-168",
"text": "One might wonder whether GEN and discriminative models are directly comparable."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-169",
"text": "Generative machine learning algorithms learn the joint probability of X and Y, whereas discriminative algorithms learn the conditional probability of Y given X. Nevertheless, when we exploit the generative model we are trying to find the Y for which P (X \u2227 Y ) is maximized."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-170",
"text": "In effect we are producing, though not learning, a conditional probability."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-171",
"text": "So it makes sense to compare our generative model's output with that of other, discriminative machine learning approaches."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-172",
"text": "We also got surprising results concerning the supervised model benchmarks."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-173",
"text": "Table 2 shows that LogReg* was the best supervised learning method on the STAC data in terms of producing local models."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-174",
"text": "This is evidence that hand crafted features capturing non local information about a DU's contexts do better than all purpose contextual encodings from neural nets at least on this task."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-175",
"text": "We also implemented BERT+LogReg*, a learning algorithm that uses BERT's encodings together with a Logistic Regression classifier trained on STAC's gold data with handcrafted features from (Afantenos et al., 2015) and used in (Perret et al., 2016) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-176",
"text": "BERT+LogReg* outputs a local model that improves upon BERT's local model, but it did not do as well as LogReg* on its own (let alone GEN), suggesting that BERT's encodings actually interfered with the correct predictions."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-177",
"text": "We also investigated GEN coupled with various discriminative models to test the standard Snorkel."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-178",
"text": "As we remarked above, we found that binarizing GEN's output improved the performance of the coupled discriminative model, so Table 2 only reports scores for various GEN-coupled discriminative models that take the binarized GEN predictions as input."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-179",
"text": "In keeping with our analysis of the supervised benchmarks, we found that the best discriminative model to couple with GEN in the Snorkel architecture was LogReg*, far outperforming GEN with either a BiLSTM or BERT on the STAC test set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-180",
"text": "Its results were only slightly less good than those of stand alone GEN."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-181",
"text": "To further investigate comparisons between different architectures for solving the attachment problem, we compared various local models extended with the MS-DAG decoding algorithm discussed in Section 5, giving the global results shown in the righthand columns of Tables 3 and 4 ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-182",
"text": "With MS-DAG added, GEN continued to outperform all other approaches on STAC data."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-183",
"text": "In Table 5 , we experimented with adding all decoding algorithms to the local GEN result."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-184",
"text": "This gave a boost in F1 score-2 points with classic MST and MS-DAG and 4 points with the variants favoring relation instances with shorter attachments."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-185",
"text": "It is not surprising that MST improves the GEN results, since it eliminates some of the false positive relations that pass the generative threshold and includes some of the false negative relations that fall below the threshold."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-186",
"text": "The general inverse power law distribution of discourse attachments explains the good performance of the MST shortest link variant."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-187",
"text": "GEN + \"MST-short\" has the highest attachment score of all approaches to the problem of attachment in the literature (Morey et al., 2018) , though we are cautious in comparing scores for systems applied to different corpora."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-188",
"text": "Finally, we wanted to see how GEN and our other models fared on our version of the (Perret (Perret et al., 2016) data set, GEN has higher scores than LogReg*'s local model; but with a decoding mechanism similar to that reported in (Perret et al., 2016) , Lo-gReg*'s global model significantly improves over the GEN's."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-189",
"text": "We see a 6 point loss in F1 score on GEN's global model relative to LogReg*'s, even though both used identical MST decoding mechanisms."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-190",
"text": "This is what one would expect from a Snorkel based architecture, although it's not the rule that we observed for GEN."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-191",
"text": "The only reason GEN did not beat LogReg* is that it did not get a sufficient boost from decoding."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-192",
"text": "We think that this happened because our LFs already contain a lot of global information about the discourse structure, which meant that MST had less of an effect."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-193",
"text": "Note, however, that even on the (Perret et al., 2016) data set, the MST decoding mechanism provided LogReg* only a boost of 12 F1 points, as seen in Table 4 , which is significantly lower than what is reported in (Perret et al., 2016) ."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-194",
"text": "This 12% boost is the upper limit for boosts with MST that we were able to reproduce."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-195",
"text": "This could be a result of our eliminating the degraded one EDU stories from the data set."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-197",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-198",
"text": "We have compared a weak supervision approach, inspired by Snorkel, with a standard supervised model on the difficult task of discourse attachment."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-199",
"text": "The results of the model from Snorkel's generative step surpass those of a standard supervised learning approach, proving it more than competitive with standard approaches."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-200",
"text": "It can also generate a lot of annotated data in a very short time relative to what is needed for a traditional approach: state that the STAC corpus took at least 4 years to build; we created and refined our labeling functions in two months."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-201",
"text": "In addition, a big advantage of the generative model from the learning point of view is that we don't have class balancing or drowning problems, which plague problems like discourse attachment."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-202",
"text": "This is because the generative model's predictions are generated in a very different way from those of a discriminative model, which are based on inductive generalizations."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-203",
"text": "Still it is clear that we must further investigate the interaction of the generative and discriminative models in order to eventually leverage the power of generalization that a discriminative model is supposed to afford."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-204",
"text": "In future work, we will enrich our weak supervi- sion system by giving the LFs access to more sophisticated contexts that take into account global structuring constraints in order to see how they compare to the simple, exogenous decoding constraints like MST."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-205",
"text": "We think we can put much more sophisticated decoding constraints that don't just work off local probabilities of arcs but on the full information about the arc in its discourse context."
},
{
"sent_id": "46faad9d86cda118df5eb9c1e7df65-C001-206",
"text": "This will lead us to understand how weakly supervised methods can effectively capture the global structural constraints on discourse structures directly without decoding or elaborate learning architectures."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"46faad9d86cda118df5eb9c1e7df65-C001-28"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-66",
"46faad9d86cda118df5eb9c1e7df65-C001-67",
"46faad9d86cda118df5eb9c1e7df65-C001-68",
"46faad9d86cda118df5eb9c1e7df65-C001-69",
"46faad9d86cda118df5eb9c1e7df65-C001-70"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-84"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-86"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-151"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-175"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-188"
]
],
"cite_sentences": [
"46faad9d86cda118df5eb9c1e7df65-C001-28",
"46faad9d86cda118df5eb9c1e7df65-C001-66",
"46faad9d86cda118df5eb9c1e7df65-C001-68",
"46faad9d86cda118df5eb9c1e7df65-C001-70",
"46faad9d86cda118df5eb9c1e7df65-C001-84",
"46faad9d86cda118df5eb9c1e7df65-C001-86",
"46faad9d86cda118df5eb9c1e7df65-C001-151",
"46faad9d86cda118df5eb9c1e7df65-C001-175",
"46faad9d86cda118df5eb9c1e7df65-C001-188"
]
},
"@DIF@": {
"gold_contexts": [
[
"46faad9d86cda118df5eb9c1e7df65-C001-70",
"46faad9d86cda118df5eb9c1e7df65-C001-71"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-75"
],
[
"46faad9d86cda118df5eb9c1e7df65-C001-193"
]
],
"cite_sentences": [
"46faad9d86cda118df5eb9c1e7df65-C001-70",
"46faad9d86cda118df5eb9c1e7df65-C001-75",
"46faad9d86cda118df5eb9c1e7df65-C001-193"
]
},
"@SIM@": {
"gold_contexts": [
[
"46faad9d86cda118df5eb9c1e7df65-C001-74"
]
],
"cite_sentences": [
"46faad9d86cda118df5eb9c1e7df65-C001-74"
]
},
"@BACK@": {
"gold_contexts": [
[
"46faad9d86cda118df5eb9c1e7df65-C001-105"
]
],
"cite_sentences": [
"46faad9d86cda118df5eb9c1e7df65-C001-105"
]
},
"@EXT@": {
"gold_contexts": [
[
"46faad9d86cda118df5eb9c1e7df65-C001-156"
]
],
"cite_sentences": [
"46faad9d86cda118df5eb9c1e7df65-C001-156"
]
}
}
},
"ABC_d6d8f08147e45acc0a61692abb37a9_3": {
"x": [
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-2",
"text": "Abstract."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-3",
"text": "This paper describes a novel method for a word sense disambiguation that utilizes relatives (i.e. synonyms, hypernyms, meronyms, etc in WordNet) of a target word and raw corpora."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-4",
"text": "The method disambiguates senses of a target word by selecting a relative that most probably occurs in a new sentence including the target word."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-5",
"text": "Only one cooccurrence frequency matrix is utilized to efficiently disambiguate senses of many target words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-6",
"text": "Experiments on several English datum present that our proposed method achieves a good performance."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-9",
"text": "With its importance, a word sense disambiguation (WSD) has been known as a very important field of a natural language processing (NLP) and has been studied steadily since the advent of NLP in the 1950s."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-10",
"text": "In spite of the long study, few WSD systems are used for practical NLP applications unlike part-of-speech (POS) taggers and syntactic parsers."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-11",
"text": "The reason is because most of WSD studies have focused on only a small number of ambiguous words based on sense tagged corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-12",
"text": "In other words, the previous WSD systems disambiguate senses of just a few words, and hence are not helpful for other NLP applications because of its low coverage."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-13",
"text": "Why have the studies about WSD stayed on the small number of ambiguous words?"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-14",
"text": "The answer is on sense tagged corpus where a few words are assigned to correct senses."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-15",
"text": "Since the construction of the sense tagged corpus needs a great amount of times and cost, most of current sense tagged corpora contain a small number of words less than 100 and the corresponding senses to the words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-16",
"text": "The corpora, which have sense information of all words, have been built recently, but are not large enough to provide sufficient disambiguation information of the all words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-17",
"text": "Therefore, the methods based on the sense tagged corpora have difficulties in disambiguating senses of all words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-18",
"text": "In this paper, we proposed a novel WSD method that requires no sense tagged corpus 1 and that identifies senses of all words in sentences or documents, not a small number of words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-19",
"text": "Our proposed method depends on raw corpus, which is relatively very large, and on WordNet [1] , which is a lexical database in a hierarchical structure."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-21",
"text": "**RELATED WORKS**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-22",
"text": "There are several works for WSD that do not depend on a sense tagged corpus, and they can be classified into three approaches according to main resources used: raw corpus based approach [2] , dictionary based approach [3, 4] and hierarchical lexical database approach."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-23",
"text": "The hierarchical lexical database approach can be reclassified into three groups according to usages of the database: gloss based method [5] , conceptual density based method [6, 7] and relative based method [8, 9, 10] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-24",
"text": "Since our method is a kind of the relative based method, this section describes the related works of the relative based method."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-25",
"text": "[8] introduced the relative based method using International Roget's Thesaurus as a hierarchical lexical database."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-26",
"text": "His method is conducted as follows: 1) Get relatives of each sense of a target word from the Roget's Thesaurus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-27",
"text": "2) Collect example sentences of the relatives, which are representative of each sense."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-28",
"text": "3) Identify salient words in the collective context and determine weights for each word."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-29",
"text": "4) Use the resulting weights to predict the appropriate sense for the target word occurring in a novel text."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-30",
"text": "He evaluated the method on 12 English nouns, and showed over than 90% precision."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-31",
"text": "However, the evaluation was conducted on just a small part of senses of the words, not on all senses of them."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-32",
"text": "He indicated that a drawback of his method is on the ambiguous relative: just one sense of the ambiguous relative is usually related to a target word but the other senses of the ambiguous relatives are not."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-33",
"text": "Hence, a collection of example sentences of the ambiguous relative includes the example sentences irrelevant to the target word, which prevent WSD systems from collecting correct WSD information."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-34",
"text": "For example, an ambiguous word rail is a relative of a meaning bird of a target word crane at WordNet, but the word rail means railway for the most part, not the meaning related to bird."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-35",
"text": "Therefore, most of the example sentences of rail are not helpful for WSD of crane."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-36",
"text": "His method has another problem in disambiguating senses of a large number of target words because it requires a great amount of time and storage space to collect example sentences of relatives of the target words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-37",
"text": "[9] followed the method of [8] , but tried to resolve the ambiguous relative problem by using just unambiguous relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-38",
"text": "That is, the ambiguous relative rail is not utilized to build a training data of the word crane because the word rail is ambiguous."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-39",
"text": "Another difference from [8] is on a lexical database: they utilized WordNet as a lexical database for acquiring relatives of target words instead of International Roget's Thesaurus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-40",
"text": "Since WordNet is freely available for research, various kinds of WSD studies based on WordNet can be compared with the method of [9] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-41",
"text": "They evaluated their method on 14 ambiguous nouns and achieved a good performance comparable to the methods based on the sense tagged corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-42",
"text": "However, the evaluation was conducted on a small part of senses of the target words like [8] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-43",
"text": "However, many senses in WordNet do not have unambiguous relatives through relationships such as synonyms, direct hypernyms, and direct hyponyms."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-44",
"text": "2 A possible alternative is to use the unambiguous relatives in the long distance from a target word, but the way is still problematic because the longer the distance of two senses is, the weaker the relationship between them is."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-45",
"text": "In other words, the unambiguous relatives in the long distance may provide irrelevant examples for WSD like ambiguous relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-46",
"text": "Hence, the method has difficulties in disambiguating senses of words that do not have unambiguous relatives near the target words in the WordNet."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-47",
"text": "The problem becomes more serious when verbs, which most of the relatives are ambiguous, are disambiguated."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-48",
"text": "Like [8] , the method also has a difficulty in disambiguating senses of many words because the method collects the example sentences of relatives of many words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-49",
"text": "[10] reimplemented the method of [9] using a web, which may be a very large corpus, in order to collect example sentences."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-50",
"text": "They built training datum of all noun words in WordNet whose size is larger than 7GB, but evaluated their method on a small number of nouns of lexical sample task of SENSEVAL-2 as [8] and [9] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-52",
"text": "**WORD SENSE DISAMBIGUATION BY RELATIVE SELECTION**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-53",
"text": "Our method disambiguates senses of a target word in a sentence by selecting only a relative among the relatives of the target word that most probably occurs in the sentence."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-54",
"text": "A flowchart of our method is presented in Figure 1 with an example 3 : 1) Given a new sentence including a target word, a set of relatives of the target word is created by looking up in WordNet."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-55",
"text": "2) Next, the relative that most probably occurs in the sentence is chosen from the set."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-56",
"text": "In this step, cooccurrence frequencies between relatives and words in the sentence are used in order to calculate the probabilities of relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-57",
"text": "Our method does not depend on the training data, but on co-occurrence frequency matrix."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-58",
"text": "Hence in our method, it is not necessary to build the training data, which requires too much time and space."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-59",
"text": "3) Finally, a sense of the target word is determined as the sense that is related to the selected relative."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-60",
"text": "In this example, the relative stork is selected with the highest probability and the proper sense is determined as crane#1, which is related to the selected relative stork."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-61",
"text": "Our method makes use of ambiguous relatives as well as unambiguous relatives unlike [9] and hence overcomes the shortage problem of relatives and also reduces the problem of ambiguous relatives in [8] by handling relatives separately instead of putting example sentences of the relatives together into a pool."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-63",
"text": "**RELATIVE SELECTION**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-64",
"text": "The selected relative of the i-th target word tw i in a sentence C is defined to be the relative of tw i that has the largest co-occurrence probability with the words in the sentence:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-65",
"text": "where SR is the selected relative, r ij is the j-th relative of tw i , S rij is a sense of tw i that is related to the relative r ij , and W is a weight of r ij ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-66",
"text": "The right hand side of Eq. 1 is logarithmically calculated by Bayesian rule:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-67",
"text": "The first probability in Eq. 2 is computed under the assumption that words in C occur independently as follows:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-68",
"text": "where w k is the k-th word in C and n is the number of words in C. The probability of w k given r ij is calculated:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-69",
"text": "where P (r ij , w k ) is a joint probability of r ij and w k , and P (r ij ) is a probability of r ij ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-70",
"text": "Other probabilities in Eq. 2 and 4 are computed as follows:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-71",
"text": "where f req(r ij , w k ) is the frequency that r ij and w k co-occur in a raw corpus, f req(r ij ) is the frequency of r ij in the corpus, and CS is a corpus size, which is the sum of frequencies of all words in the raw corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-72",
"text": "W Nf(S rij ) and W Nf(tw i ) is the frequency of a sense related to r ij and tw i in WordNet."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-73",
"text": "4 In Eq. 7, 0.5 is a smoothing factor and n is the number of senses of tw i ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-74",
"text": "Finally, in Eq. 2, the weights of relatives, W (r ij , tw i ), are described in following Section 3.1."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-75",
"text": "Relative Weight."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-76",
"text": "WordNet provides relatives of words, but all of them are not useful for WSD."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-77",
"text": "That is to say, it is clear that most of ambiguous relatives may bring about a problem by providing example sentences irrelevant to the target word to WSD system as described in the previous section."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-78",
"text": "However, WordNet as a lexical database is classified as a fine-grained dictionary, and consequently some words are classified into ambiguous words though the words represent just one sense in the most occurrences."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-79",
"text": "Such ambiguous relatives may be useful for WSD of target words that are related to the most frequent senses of the ambiguous relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-80",
"text": "For example, a relative bird of a word crane is an ambiguous word, but it usually represents one meaning, \"warm-blooded egglaying vertebrates characterized by feathers and forelimbs modified as wings\", which is closely related to crane."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-81",
"text": "Hence, the word bird can be a useful relative of the word crane though the word bird is ambiguous. But the ambiguous relative is not useful for other target words that are related to the least frequent senses of the relatives: that is, a relative bird is never helpful to disambiguate the senses of a word birdie, which is related to the least frequent sense of the relative bird."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-82",
"text": "We employ a weighting scheme for relatives in order to identify useful relatives for WSD."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-83",
"text": "In terms of weights of relatives, our intent is to provide the useful relative with high weights, but the useless relatives with low weights."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-84",
"text": "For instance, a relative bird of a word crane has a high weight whereas a relative bird of a word birdie get a low weight."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-85",
"text": "For the sake of the weights, we calculate similarities between a target word and its relatives and determine the weight of each relative based on the degree of the similarity."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-86",
"text": "Among similarity measures between words, the total divergence to the mean (TDM) is adopted, which is known as one of the best similarity measures for word similarity [11] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-87",
"text": "Since TDM estimates a divergence between vectors, not between words, words have to be represented by vectors in order to calculate the similarity between the words based on the TDM."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-88",
"text": "We define vector elements as words that occur more than 10 in a raw corpus, and build vectors of words by counting co-occurrence frequencies of the words and vector elements."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-89",
"text": "TDM does measure the divergence between words, and hence a reciprocal of the TDM measure is utilized as the similarity measure:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-90",
"text": "where Sim("
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-92",
"text": "**CO-OCCURRENCE FREQUENCY MATRIX**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-93",
"text": "In order to select a relative for a target word in a given sentence, we must calculate probabilities of relatives given the sentence, as described in previous section."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-94",
"text": "These probabilities as Eq. 5 and 6 can be estimated based on frequencies of relatives and co-occurrence frequencies between each relative and each word in the sentence."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-95",
"text": "In order to acquire the frequency information for calculating the probabilities, the previous relative based methods constructed a training data by collecting example sentences of relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-96",
"text": "However, to construct the training data requires a great amount of time and storage space."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-97",
"text": "What is worse, it is an awful work to construct training datum of all ambiguous words, whose number is over than 20,000 in WordNet."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-98",
"text": "Instead, we build a co-occurrence frequency matrix (CFM) from a raw corpus that contains frequencies of words and word pairs."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-99",
"text": "A value in the i-th row and j-th column in the CFM represents the co-occurrence frequency of the i-th word and j-th word in a vocabulary, and a value in the i-th row and the i-th column represents the frequency of the i-th word."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-100",
"text": "The CFM is easily built by counting words and word pairs in a raw corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-101",
"text": "Furthermore, it is not necessary to make a CFM per each ambiguous word since a CFM contains frequencies of all words including relatives and word pairs."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-102",
"text": "Therefore, our proposed method disambiguates senses of all ambiguous words efficiently by referring to only one CFM."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-103",
"text": "The frequencies in Eq. 5 and 6 can be obtained through a CFM as follows:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-104",
"text": "where w i is a word, and cf m(i, j) represents the value in the i-th row and j-th column of the CFM, in other word, the frequency that the i-th word and j-th word co-occur in a raw corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-106",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-108",
"text": "**EXPERIMENTAL ENVIRONMENT**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-109",
"text": "Experiments were carried out on several English sense tagged corpora: SemCor and corpora for both lexical sample task and all words task of both SENSEVAL-2 & -3."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-110",
"text": "5 SemCor [12] 6 is a semantic concordance, where all content words (i.e. noun, verb, adjective, and adverb) are assigned to WordNet senses."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-111",
"text": "SemCor consists of three parts: brown1, brown2 and brownv."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-112",
"text": "We used all of the three parts of the SemCor for evaluation."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-113",
"text": "In our method, raw corpora are utilized in order to build a CFM and to calculate similarities between words for the sake of the weights of relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-114",
"text": "We adopted Wall Street Journal corpus in Penn Treebank II [13] and LATIMES corpus in TREC as raw corpora, which contain about 37 million word occurrences."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-115",
"text": "Our CFM contains frequencies of content words and content word pairs."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-116",
"text": "In order to identify the content words from the raw corpus, Tree-Tagger [14] , which is a kind of automatic POS taggers, is employed."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-117",
"text": "WordNet provides various kinds of relationships between words or synsets."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-118",
"text": "In our experiments, the relatives in Table 1 are utilized according to POSs of target words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-119",
"text": "In the table, hyper3 means 1 to 3 hypernyms (i.e. parents, grandparents and great-grandparent) and hypo3 is 1 to 3 hyponyms (i.e. children, grandchildren and great-grandchildren)."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-121",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-122",
"text": "Comparison with Other Relative Based Methods."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-123",
"text": "We tried to compare our proposed method with the previous relative based methods."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-124",
"text": "However, both of [8] and [9] did not evaluate their methods on a publicly available data."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-125",
"text": "We implemented their methods and compared our method with them on the same evaluation data."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-126",
"text": "When both of the methods are implemented, it is practically difficult to collect example sentences of all target words in the evaluation data."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-127",
"text": "Instead, we implemented the previous methods to work with our CFM."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-128",
"text": "WordNet was utilized as a lexical database to acquire relatives of target words and the sense disambiguation modules were implemented by using on Na\u00efve Bayesian classifier, which [9] adopted though [8] utilized International Roget's Thesaurus and other classifier similar to decision lists."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-129",
"text": "Also the bias of word senses, which is presented at WordNet, is reflected on the implementation in order to be in a same condition with our method."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-130",
"text": "Hence, the reimplemented methods in this paper are not exactly same with the previous methods, but the main ideas of the methods are not corrupted."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-131",
"text": "A correct sense of a target word tw i in a sentence C is determined as follows:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-132",
"text": "where Sense(tw i , C) is a sense of tw i in C, s ij is the j-th sense of tw i ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-133",
"text": "P wn (s ij ) is the WordNet probability of s ij ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-134",
"text": "The right hand side of Eq. 10 is calculated logarithmically under the assumption that words in C occur independently:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-135",
"text": "where w k is the k-th word in C and n is the number of words in C. In Eq. 11, we assume independence among the words in C. Probabilities in Eq. 11 are calculated as follows:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-136",
"text": "where f req(s ij , w k ) is the frequency that s ij and w k co-occur in a corpus, f req(s ij ) is the frequency of s ij in a corpus, which is the sum of frequencies of all relatives related to s ij ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-137",
"text": "CS means corpus size, which is the sum of frequencies of all words in a corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-138",
"text": "W Nf(s ij ) and W Nf(tw i ) are the frequencies of a s ij and tw i in WordNet, respectively, which represent bias of word senses."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-139",
"text": "Eq. 14 is the same with Eq. 7 in Section 3."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-140",
"text": "Since the training data are built by collecting example sentences of relatives in the previous works, the frequencies in Eq. 12 and 13 are calculated with our matrix as follows:"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-141",
"text": "where r l is a relative related to the sense s ij ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-142",
"text": "f req(r l , w k ) and f req(r l ) are the co-occurrence frequency between r l and w k and the frequency of r l , respectively, and both frequencies can be obtained by looking up the matrix since the matrix contains the frequencies of words and word pairs."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-143",
"text": "The main difference between [8] and [9] is whether ambiguous relatives are utilized or not."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-144",
"text": "Considering the difference, we implemented the method of [8] to include the ambiguous relatives into relatives, but the method of [9] to exclude the ambiguous relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-145",
"text": "Table 3 ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-146",
"text": "Comparison results with top 3 systems at SENSEVAL S2 LS S2 ALL S3 ALL [15] 40.2% 56.9% ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-147",
"text": "[16] 29.3% 45.1% ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-148",
"text": "[5] 24.4% 32.8% ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-149",
"text": "[17] . . 58.3% [18] . . 54.8% [19] . . 48.1% Our method 40.94% 45.12% 51.35% Table 2 shows the comparison results."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-150",
"text": "7 In the table, All Relatives and Unambiguous Relatives represent the results of the reimplemented methods of [8] and [9] , respectively."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-151",
"text": "It is observed in the table that our proposed method achieves better performance on all evaluation data than the previous methods though the improvement is not large."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-152",
"text": "Hence, we may have an idea that our method handles relatives and in particular ambiguous relatives more effectively than [8] and [9] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-153",
"text": "Compared with [9] , [8] obtains a better performance, and the difference between the performance of them are totally more than 15 % on all of the evaluation data."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-154",
"text": "From the comparison results, it is desirable to utilize ambiguous relatives as well as unambiguous relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-155",
"text": "[10] evaluated their method on nouns of lexical sample task of SENSEVAL-2."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-156",
"text": "Their method achieved 49.8% recall."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-157",
"text": "When evaluated on the same nouns of the lexical sample task, our proposed method achieved 47.26%, and the method of [8] 45.61%, and the method of [9] 38.03%."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-158",
"text": "Compared with our implementations, [10] utilized a web as a raw corpus that is much larger than our raw corpus, and employed various kinds of features such as bigram, trigram, part-of-speeches, etc."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-159",
"text": "8 Therefore, it can be conjectured that a size of a raw corpus and features play an important role in the performance."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-160",
"text": "We can observe that in our implementation of the method of [9] , the data sparseness problem is very serious since unambiguous relatives are usually not frequent in the raw corpus."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-161",
"text": "In the web, the problem seems to be alleviated."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-162",
"text": "Further studies are required for the effects of various features."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-163",
"text": "Comparison with Systems Participated in SENSEVAL."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-164",
"text": "We also compared our method with the top systems at SENSEVAL that did not use sense tagged corpora."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-165",
"text": "9 Table 3 shows the official results of the top 3 participating systems at SENSEVAL-2 & 3 and experimental performance of our method."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-166",
"text": "In the table, it is observed that our method is ranked in top 3 systems."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-168",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-169",
"text": "We have proposed a simple and novel method that determines senses of all contents words in sentences by selecting a relative of the target words in WordNet."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-170",
"text": "The relative is selected by using a co-occurrence frequency between the relative and the words surrounding the target word in a given sentence."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-171",
"text": "The cooccurrence frequencies are obtained from a raw corpus, not from a sense tagged corpus that is often required by other approaches."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-172",
"text": "We tested the proposed method on SemCor data and SENSEVAL data, which are publicly available."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-173",
"text": "The experimental results show that the proposed method effectively disambiguates many ambiguous words in SemCor and in test data for SENSEVAL all words task, as well as a small number of ambiguous words in test data for SENSEVAL lexical sample task."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-174",
"text": "Also our method more correctly disambiguates senses than [8] and [9] ."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-175",
"text": "Furthermore, the proposed method achieved comparable performance with the top 3 ranked systems at SENSEVAL-2 & 3."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-176",
"text": "In consequence, our method has two advantages over the previous methods ( [8] and [9] ): our method 1) handles the ambiguous relatives and unambiguous relatives more effectively, and 2) utilizes only one co-occurrence matrix for disambiguating all contents words instead of collecting training data of the content words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-177",
"text": "However, our method did not achieve good performances."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-178",
"text": "One reason of the low performance is on the relatives irrelevant to the target words."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-179",
"text": "That is, investigation of several instances which assign to incorrect senses shows that relatives irrelevant to the target words are often selected as the most probable relatives."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-180",
"text": "Hence, we will try to devise a filtering method that filters out the useless relatives before the relative selection phase."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-181",
"text": "Also we will plan to investigate a large number of tagged instances in order to find out why our method did not achieve much better performance than the previous works and to detect how our method selects the correct relatives more precisely."
},
{
"sent_id": "d6d8f08147e45acc0a61692abb37a9-C001-182",
"text": "Finally, we will conduct experiments with various features such as bigrams, trigrams, POSs, etc, which [10] considered and examine a relationship of a size of a raw corpus and a system performance."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d6d8f08147e45acc0a61692abb37a9-C001-23"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-37"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-39"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-42"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-48"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-50"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-124"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-143"
]
],
"cite_sentences": [
"d6d8f08147e45acc0a61692abb37a9-C001-23",
"d6d8f08147e45acc0a61692abb37a9-C001-37",
"d6d8f08147e45acc0a61692abb37a9-C001-39",
"d6d8f08147e45acc0a61692abb37a9-C001-42",
"d6d8f08147e45acc0a61692abb37a9-C001-48",
"d6d8f08147e45acc0a61692abb37a9-C001-50",
"d6d8f08147e45acc0a61692abb37a9-C001-124",
"d6d8f08147e45acc0a61692abb37a9-C001-143"
]
},
"@SIM@": {
"gold_contexts": [
[
"d6d8f08147e45acc0a61692abb37a9-C001-23",
"d6d8f08147e45acc0a61692abb37a9-C001-24",
"d6d8f08147e45acc0a61692abb37a9-C001-25"
]
],
"cite_sentences": [
"d6d8f08147e45acc0a61692abb37a9-C001-23",
"d6d8f08147e45acc0a61692abb37a9-C001-25"
]
},
"@DIF@": {
"gold_contexts": [
[
"d6d8f08147e45acc0a61692abb37a9-C001-61"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-150",
"d6d8f08147e45acc0a61692abb37a9-C001-151"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-152",
"d6d8f08147e45acc0a61692abb37a9-C001-153"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-157"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-173",
"d6d8f08147e45acc0a61692abb37a9-C001-174"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-176"
]
],
"cite_sentences": [
"d6d8f08147e45acc0a61692abb37a9-C001-61",
"d6d8f08147e45acc0a61692abb37a9-C001-150",
"d6d8f08147e45acc0a61692abb37a9-C001-152",
"d6d8f08147e45acc0a61692abb37a9-C001-153",
"d6d8f08147e45acc0a61692abb37a9-C001-157",
"d6d8f08147e45acc0a61692abb37a9-C001-174",
"d6d8f08147e45acc0a61692abb37a9-C001-176"
]
},
"@USE@": {
"gold_contexts": [
[
"d6d8f08147e45acc0a61692abb37a9-C001-124",
"d6d8f08147e45acc0a61692abb37a9-C001-125"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-128"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-143",
"d6d8f08147e45acc0a61692abb37a9-C001-144"
],
[
"d6d8f08147e45acc0a61692abb37a9-C001-150"
]
],
"cite_sentences": [
"d6d8f08147e45acc0a61692abb37a9-C001-124",
"d6d8f08147e45acc0a61692abb37a9-C001-128",
"d6d8f08147e45acc0a61692abb37a9-C001-143",
"d6d8f08147e45acc0a61692abb37a9-C001-144",
"d6d8f08147e45acc0a61692abb37a9-C001-150"
]
}
}
},
"ABC_6b11cfba6ee73c1f67941cf73506be_3": {
"x": [
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-2",
"text": "We describe DCU's LFG dependencybased metric submitted to the shared evaluation task of WMT-MetricsMATR 2010."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-3",
"text": "The metric is built on the LFG F-structurebased approach presented in (Owczarzak et al., 2007) ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-4",
"text": "We explore the following improvements on the original metric: 1) we replace the in-house LFG parser with an open source dependency parser that directly parses strings into LFG dependencies; 2) we add a stemming module and unigram paraphrases to strengthen the aligner; 3) we introduce a chunk penalty following the practice of METEOR to reward continuous matches; and 4) we introduce and tune parameters to maximize the correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-5",
"text": "Experiments show that these enhancements improve the dependency-based metric's correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-8",
"text": "String-based automatic evaluation metrics such as BLEU (Papineni et al., 2002) have led directly to quality improvements in machine translation (MT)."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-9",
"text": "These metrics provide an alternative to expensive human evaluations, and enable tuning of MT systems based on automatic evaluation results."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-10",
"text": "However, there is widespread recognition in the MT community that string-based metrics are not discriminative enough to reflect the translation quality of today's MT systems, many of which have gone beyond pure string-based approaches (cf. (Callison-Burch et al., 2006) )."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-11",
"text": "With that in mind, a number of researchers have come up with metrics which incorporate more sophisticated and linguistically motivated resources."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-12",
"text": "Examples include METEOR (Banerjee and Lavie, 2005; Lavie and Denkowski, 2009 ) and TERP , both of which now utilize stemming, WordNet and paraphrase information."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-13",
"text": "Experimental and evaluation campaign results have shown that these metrics can obtain better correlation with human judgements than metrics that only use surface-level information."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-14",
"text": "Given that many of today's MT systems incorporate some kind of syntactic information, it was perhaps natural to use syntax in automatic MT evaluation as well."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-15",
"text": "This direction was first explored by (Liu and Gildea, 2005) , who used syntactic structure and dependency information to go beyond the surface level matching."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-16",
"text": "Owczarzak et al. (2007) extended this line of research with the use of a term-based encoding of Lexical Functional Grammar (LFG: (Kaplan and Bresnan, 1982) ) labelled dependency graphs into unordered sets of dependency triples, and calculating precision, recall, and F-score on the triple sets corresponding to the translation and reference sentences."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-17",
"text": "With the addition of partial matching and n-best parses, Owczarzak et al. (2007) 's method considerably outperforms Liu and Gildea's (2005) w.r.t."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-18",
"text": "correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-19",
"text": "The EDPM metric (Kahn et al., 2010) improves this line of research by using arc labels derived from a Probabilistic Context-Free Grammar (PCFG) parse to replace the LFG labels, showing that a PCFG parser is sufficient for preprocessing, compared to a dependency parser in (Liu and Gildea, 2005) and (Owczarzak et al., 2007) ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-20",
"text": "EDPM also incorporates more information sources: e.g. the parser confidence, the Porter stemmer, WordNet synonyms and paraphrases."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-21",
"text": "Besides the metrics that rely solely on the dependency structures, information from the dependency parser is a component of some other metrics that use more diverse resources, such as the textual entailment-based metric of (Pado et al., 2009) ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-22",
"text": "In this paper we extend the work of (Owczarzak et al., 2007) in a different manner: we use an adapted version of the Malt parser (Nivre et al., 2006) to produce 1-best LFG dependencies and allow triple matches where the dependency labels are different."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-23",
"text": "We incorporate stemming, synonym and paraphrase information as in (Kahn et al., 2010) , and at the same time introduce a chunk penalty in the spirit of METEOR to penalize discontinuous matches."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-24",
"text": "We sort the matches according to the match level and the dependency type, and weight the matches to maximize correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-25",
"text": "The remainder of the paper is organized as follows."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-26",
"text": "Section 2 reviews the dependency-based metric."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-27",
"text": "Sections 3, 4, 5 and 6 introduce our improvements on this metric."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-28",
"text": "We report experimental results in Section 7 and conclude in Section 8."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-30",
"text": "**THE DEPENDENCY-BASED METRIC**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-31",
"text": "In this section, we briefly review the metric presented in (Owczarzak et al., 2007) ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-33",
"text": "**C-STRUCTURE AND F-STRUCTURE IN LFG**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-34",
"text": "In Lexical Functional Grammar (Kaplan and Bresnan, 1982 ), a sentence is represented as both a hierarchical c-(onstituent) structure which captures the phrasal organization of a sentence, and a f-(unctional) structure which captures the functional relations between different parts of the sentence."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-35",
"text": "Our metric currently only relies on the f-structure, which is encoded as labeled dependencies in our metric."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-37",
"text": "**MT EVALUATION AS DEPENDENCY TRIPLE MATCHING**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-38",
"text": "The basic method of (Owczarzak et al., 2007) can be illustrated by the example in Table 1 ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-39",
"text": "The metric in (Owczarzak et al., 2007) performs triple matching over the Hyp-and Ref-Triples and calculates the metric score using the F-score of matching precision and recall."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-40",
"text": "Let m be the number of matches, h be the number of triples in the hypothesis and e be the number of triples in the reference."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-41",
"text": "Then we have the matching precision P = m/h and recall R = m/e."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-42",
"text": "The score of the hypothesis in (Owczarzak et al., 2007) is the Fscore based on the precision and recall of matching as in (1): Owczarzak et al., 2007) uses several techniques to facilitate triple matching."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-43",
"text": "First of all, considering that the MT-generated hypotheses have variable quality and are sometimes ungrammatical, the metric will search the 50-best parses of both the hypothesis and reference and use the pair that has the highest F-score to compensate for parser noise."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-44",
"text": "Secondly, the metric performs complete or partial matching according to the dependency labels, so the metric will find more matches on dependency structures that are presumably more informative."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-45",
"text": "More specifically, for all except the LFG Predicate-Only labeled triples of the form dep(head, modifier), the method does not allow a match if the dependency labels (deps) are different, thus enforcing a complete match."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-46",
"text": "For the Predicate-Only dependencies, partial matching is allowed: i.e. two triples are considered identical even if only the head or the modifier are the same."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-47",
"text": "Finally, the metric also uses linguistic resources for better coverage."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-48",
"text": "Besides using WordNet synonyms, the method also uses the lemmatized output of the LFG parser, which is equivalent to using an English lemmatizer."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-49",
"text": "If we do not consider these additional linguistic resources, the metric would find the following matches in the example in Table 1 : adjunct(talks, in), obj(in, egypt) and adjunct(week, next), as these three triples appear both in the reference and in the hypothesis."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-51",
"text": "**POINTS FOR IMPROVEMENT**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-52",
"text": "We see several points for improvement from Table 1 and the analysis above."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-53",
"text": "\u2022 More linguistic resources: we can use more linguistic resources than WordNet in pursuit of better coverage."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-54",
"text": "\u2022 Using the 1-best parse instead of 50-best parses: the parsing model we currently use does not produce k-best parses and using only the 1-best parse significantly improves the speed of triple matching."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-55",
"text": "We allow 'soft' triple matches to capture the triple matches which we might otherwise miss using the 1-best parse."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-56",
"text": "\u2022 Rewarding continuous matches: it would be more desirable to reflect the fact that the 3 matching triples adjunct(talks, in), obj(in, egypt) and adjunct(week, next) are continuous in Table 1 ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-57",
"text": "We introduce our improvements to the metric in response to these observations in the following sections."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-58",
"text": "3 Producing and Matching LFG Dependency Triples"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-60",
"text": "**THE LFG PARSER**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-61",
"text": "The metric described in (Owczarzak et al., 2007) uses the DCU LFG parser (Cahill et al., 2004) to produce LFG dependency triples."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-62",
"text": "The parser uses a Penn treebank-trained parser to produce c-structures (constituency trees) and an LFG fstructure annotation algorithm on the c-structure to obtain f-structures."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-63",
"text": "In (Owczarzak et al., 2007) , triple matching on f-structures produced by this paradigm correlates well with human judgement, but this paradigm is not adequate for the WMTMetricsMatr evaluation in two respects: 1) the inhouse LFG annotation algorithm is not publicly available and 2) the speed of this paradigm is not satisfactory."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-64",
"text": "We instead use the Malt Parser 1 (Nivre et al., 2006 ) with a parsing model trained on LFG dependencies to produce the f-structure triples."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-65",
"text": "Our collaborators 2 first apply the LFG annotation algorithm to the Penn Treebank training data to obtain f-structures, and then the f-structures are converted into dependency trees in CoNLL format to train the parsing model."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-66",
"text": "We use the liblinear (Fan et al., 2008) classification module to for fast parsing speed."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-68",
"text": "**HARD AND SOFT DEPENDENCY MATCHING**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-69",
"text": "Currently our parser produces only the 1-best outputs."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-70",
"text": "Compared to the 50-best parses in (Owczarzak et al., 2007) , the 1-best parse limits the number of triple matches that can be found."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-71",
"text": "To compensate for this, we allow triple matches that have the same Head and Modifier to constitute a match, even if their dependency labels are different."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-72",
"text": "Therefore for triples Dep1(Head1, Mod1) and Dep2(Head2, Mod2), we allow three types of match: a complete match if the two triples are identical, a partial match if Dep1=Dep2 and Head1=Head2, and a soft match if Head1=Head2 and Mod1=Mod2."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-74",
"text": "**CAPTURING VARIATIONS IN LANGUAGE**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-75",
"text": "In (Owczarzak et al., 2007) , lexical variations at the word-level are captured by WordNet."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-76",
"text": "We use a Porter stemmer and a unigram paraphrase database to allow more lexical variations."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-77",
"text": "With these two resources combined, there are four stages of word level matching in our system: exact match, stem match, WordNet match and unigram paraphrase match."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-78",
"text": "The stemming module uses Porter's stemmer implementation 3 and the WordNet module uses the JAWS WordNet interface."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-79",
"text": "4 Our metric only considers unigram paraphrases, which are extracted from the paraphrase database in TERP 5 using the script in the ME-TEOR 6 metric."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-80",
"text": "The metric described in (Owczarzak et al., 2007) does not explicitly consider word order and fluency."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-81",
"text": "METEOR, on the other hand, utilizes this information through a chunk penalty."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-82",
"text": "We introduce a chunk penalty to our dependency-based metric following METEOR's string-based approach."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-83",
"text": "Given a reference r = w r1 ...w rn , we denote w ri as 'covered' if it is the head or modifier of a matched triple."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-84",
"text": "We only consider the w ri s that appear as head or modifier in the reference triples."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-85",
"text": "After this notation, we follow METEOR's approach by counting the number of chunks in the reference string, where a chunk w rj ...w rk is a sequence of adjacent covered words in the reference."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-86",
"text": "Using the hypothesis and reference in Table 1 as an example, the three matched triples adjunct(talks, in), obj(in, egypt) and adjunct(week, next) will cover a continuous word sequence in the reference (underlined), constituting one single chunk:"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-87",
"text": "rice to hold talks (in) egypt next week Based on this observation, we introduce a similar chunk penalty P en as in METEOR in our metric, as in 2:"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-88",
"text": "where \u03b2 and \u03b3 are free parameters, which we tune in Section 6.2."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-89",
"text": "We add this penalty to the dependency based metric (cf."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-90",
"text": "Eq. (1)), as in Eq. (3)."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-91",
"text": "6 Parameter Tuning"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-93",
"text": "**PARAMETERS OF THE METRIC**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-94",
"text": "In our metric, dependency triple matches can be categorized according to many criteria."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-95",
"text": "We assume that some matches are more critical than others and encode the importance of matches by weighting them differently."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-96",
"text": "The final match will be the sum of weighted matches, as in (4):"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-97",
"text": "where \u03bb t and m t are the weight and number of match category t. We categorize a triple match according to three perspectives: 1) the level of match L={complete, partial}; 2) the linguistic resource used in matching R={exact, stem, WordNet, paraphrase}; and 3) the type of dependency D. To avoid too large a number of parameters, we only allow a set of frequent dependency types, along with the type other, which represents all the other types and the type soft for soft matches."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-98",
"text": "We have D={app, subj, obj, poss, adjunct, topicrel, other, soft}. Therefore for each triple match m, we can have the type of the match t \u2208 L \u00d7 R \u00d7 D."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-100",
"text": "**TUNING**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-101",
"text": "In sum, we have the following parameters to tune in our metric: precision weight \u03b1, chunk penalty parameters \u03b2, \u03b3, and the match type weights \u03bb 1 ...\u03bb n ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-102",
"text": "We perform Powell's line search (Press et al., 2007) on the sufficient statistics of our metric to find the set of parameters that maximizes Pearson's \u03c1 on the segment level."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-103",
"text": "We perform the optimization on the MT06 portion of the NIST MetricsMATR 2010 development set with 2-fold cross validation."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-105",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-106",
"text": "We experiment with four settings of the metric: HARD, SOFT, SOFTALL and WEIGHTED in order to validate our enhancements."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-107",
"text": "The first two settings compare the effect of allowing/not allowing soft matches, but only uses WordNet as in (Owczarzak et al., 2007) ."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-108",
"text": "The third setting applies our additional linguistic features and the final setting tunes parameter weights for higher correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-109",
"text": "We report Pearson's r, Spearman's \u03c1 and Kendall's \u03c4 on segment and system levels on the NIST MetricsMATR 2010 development set using Snover's scoring tool."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-110",
"text": "7 Table 2 shows that allowing soft triple matches and using more linguistic features all lead to higher correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-111",
"text": "Though the parameters might somehow overfit on the data set even if we apply cross validation, this certainly confirms the necessity of weighing dependency matches according to their types."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-112",
"text": "Table 3 , the trend is very similar to that of the segment level."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-113",
"text": "The improvements we introduce all lead to improvements in correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-115",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-116",
"text": "In this paper we describe DCU's dependencybased MT evaluation metric submitted to WMTMetricsMATR 2010."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-117",
"text": "Building upon the LFGbased metric described in (Owczarzak et al., 2007) , we use a publicly available parser instead of an in-house parser to produce dependency labels, so that the metric can run on a third party machine."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-118",
"text": "We improve the metric by allowing more lexical variations and weighting dependency triple matches depending on their importance according to correlation with human judgement."
},
{
"sent_id": "6b11cfba6ee73c1f67941cf73506be-C001-119",
"text": "For future work, we hope to apply this method to languages other than English, and perform more refinement on dependency type labels and linguistic resources."
}
],
"y": {
"@EXT@": {
"gold_contexts": [
[
"6b11cfba6ee73c1f67941cf73506be-C001-3"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-22"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-75",
"6b11cfba6ee73c1f67941cf73506be-C001-76"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-117"
]
],
"cite_sentences": [
"6b11cfba6ee73c1f67941cf73506be-C001-3",
"6b11cfba6ee73c1f67941cf73506be-C001-22",
"6b11cfba6ee73c1f67941cf73506be-C001-75",
"6b11cfba6ee73c1f67941cf73506be-C001-117"
]
},
"@BACK@": {
"gold_contexts": [
[
"6b11cfba6ee73c1f67941cf73506be-C001-17"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-19"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-39"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-42"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-61"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-63"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-75"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-80"
]
],
"cite_sentences": [
"6b11cfba6ee73c1f67941cf73506be-C001-17",
"6b11cfba6ee73c1f67941cf73506be-C001-19",
"6b11cfba6ee73c1f67941cf73506be-C001-39",
"6b11cfba6ee73c1f67941cf73506be-C001-42",
"6b11cfba6ee73c1f67941cf73506be-C001-61",
"6b11cfba6ee73c1f67941cf73506be-C001-63",
"6b11cfba6ee73c1f67941cf73506be-C001-75",
"6b11cfba6ee73c1f67941cf73506be-C001-80"
]
},
"@USE@": {
"gold_contexts": [
[
"6b11cfba6ee73c1f67941cf73506be-C001-31"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-38"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-107"
]
],
"cite_sentences": [
"6b11cfba6ee73c1f67941cf73506be-C001-31",
"6b11cfba6ee73c1f67941cf73506be-C001-38",
"6b11cfba6ee73c1f67941cf73506be-C001-107"
]
},
"@DIF@": {
"gold_contexts": [
[
"6b11cfba6ee73c1f67941cf73506be-C001-63",
"6b11cfba6ee73c1f67941cf73506be-C001-64"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-69",
"6b11cfba6ee73c1f67941cf73506be-C001-70"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-80",
"6b11cfba6ee73c1f67941cf73506be-C001-81",
"6b11cfba6ee73c1f67941cf73506be-C001-82"
],
[
"6b11cfba6ee73c1f67941cf73506be-C001-117",
"6b11cfba6ee73c1f67941cf73506be-C001-118"
]
],
"cite_sentences": [
"6b11cfba6ee73c1f67941cf73506be-C001-63",
"6b11cfba6ee73c1f67941cf73506be-C001-70",
"6b11cfba6ee73c1f67941cf73506be-C001-80",
"6b11cfba6ee73c1f67941cf73506be-C001-117"
]
}
}
},
"ABC_48add0c1226863808c3c3a8c29a12e_3": {
"x": [
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-2",
"text": "ABSTRACT"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-3",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-4",
"text": "**INTRODUCTION**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-5",
"text": "Automatic language identification of an input text is an important task in Natural Language Processing (NLP), especially when processing speech or social media messages."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-6",
"text": "Besides, it constitutes the first stage of many NLP pipelines."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-7",
"text": "Before applying tools trained on specific languages, one must determine the language of the text."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-8",
"text": "It has attracted considerable attention in recent years [1, 2, 3, 4, 5, 6, 7, 8] ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-9",
"text": "Most of the existing approaches take words as features, and then adopt effective supervised classification algorithms to solve the problem."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-10",
"text": "Generally speaking, language identification between different languages is a task that can be solved at a high accuracy."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-11",
"text": "For example, Simoes et al. [9] achieved 97% accuracy for discriminating among 25 unrelated languages."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-12",
"text": "However, it is generally difficult to distinguish between related languages or variations of a specific language (see [9] and [10] for example)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-13",
"text": "To deal with this problem, Huang and Lee [3] proposed a contrastive approach based on documentlevel top-bag-of-word similarity to reflect distances among the three varieties of Mandarin in China, Taiwan and Singapore, which is a kind of word-level uni-gram feature."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-14",
"text": "The word unigram feature is sufficient for document-level identification of language variants."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-15",
"text": "More recent studies focus on sentence-level languages identification, such as the Discriminating between Similar Languages (DSL) shared task 2014 and 2015 [7, 8] ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-16",
"text": "The best system of these shard tasks shows that the uni-gram is an effective feature."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-17",
"text": "For the sentence-level language identification, you are given a single sentence, and you need to identify the language."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-18",
"text": "Chinese is spoken in different regions, with noticeable differences between regions."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-19",
"text": "The first difference is the character set used."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-20",
"text": "For example, Mainland China and Singapore adopt simplified character form, while Taiwan and Hong Kong use complex/traditional character form, as shown in the following two examples."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-21",
"text": "We observe furthermore that the same meaning can be expressed using different linguistic expressions in the Mainland China, Hong Kong and Taiwan variety of Mandarin Chinese."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-22",
"text": "Table 1 lists some examples."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-23",
"text": "As a result, the words distribution should be different in the Chinese variants spoken in the Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore variety or dialect 1 of Mandarin Chinese, a.k.a., the Greater China Region (GCR)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-24",
"text": "Therefore, we can extract the different fine-grained representative words (n-gram with n \u2264 3; lines 2-5 in Table 1 ) for the GCR respectively using Pointwise Mutual Information (PMI) in order to reflect the correlation between the words and their ascribed language varieties."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-25",
"text": "Compared with English, no space exists between words in Chinese sentence."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-26",
"text": "Due to the Chinese word segmentation issue, some representative words for the GCR cannot be extracted using PMI (lines 6-8 in Table 1 )."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-27",
"text": "To expand these representative words for each dialect, we extract more coarse-grained (n-gram with n \u2265 4) words using a word alignment technology, and then propose word alignment-based feature (dictionary for each dialect with n-gram under n \u2265 4)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-28",
"text": "In fact, the word alignment-based dictionary can extract both fine-grained representative words and coarse-grained words simultaneously."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-29",
"text": "The above observation indicates that character form, PMI-based and word alignment-based information are useful information to discriminate dialects in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-30",
"text": "In order to investigate the detailed characteristics of different dialects of Mandarin Chinese, we extend 3 dialects in Huang and Lee [3] to 6 dialects."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-31",
"text": "In fact, the more dialects there are, the more difficult the dialects discrimination becomes."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-32",
"text": "It also has been verified through our experiments."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-33",
"text": "Very often, texts written in a character set are converted to another character set, in particular on the Web."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-34",
"text": "This makes the character form feature unusable."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-35",
"text": "In order to detect dialects for those texts, we convert texts in traditional characters to simplified characters in order to investigate the effectiveness of linguistic and textual features alone."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-36",
"text": "Due to these characteristic of Chinese, current methods do not work for the specific GCR dialects."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-37",
"text": "Evaluation results on our two different 15,000 sentence-level news dataset and 18,000 sentencelevel open-domain dataset from Wikipedia show that bi-gram, character form, PMI-based and word alignment-based features significantly outperform the traditional baseline systems using character and word uni-grams."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-38",
"text": "The main contributions of this paper are as follows:"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-39",
"text": "(1) We find character-level bi-gram and word segmentation based features work better than traditional character-level uni-gram feature in the dialects discrimination for the GCR; (2) Some features such as character form, PMI-based and word alignment-based features can improve the dialects identification performance for the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-40",
"text": "The remainder of the paper is organized as follows."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-41",
"text": "Section 2 presents the state-of-the-art approaches in the language identification field."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-42",
"text": "Section 3 describes the main features used in this paper."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-43",
"text": "The dataset collection and experiment results are shown in Section 4, and we conclude our paper in Section 5."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-45",
"text": "**RELATED WORK**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-46",
"text": "A number of studies on identification of similar languages and language varieties have been carride out."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-47",
"text": "For example, Murthy and Kumar [1] focused on Indian languages identification."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-48",
"text": "Meanwhile, Ranaivo-Malancon [2] proposed features based on frequencies of character n-grams to identify Malay and Indonesian."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-49",
"text": "Huang and Lee [3] presented the top-bag-of-word similarity based contrastive approach to reflect distances among the three varieties of Mandarin in Mainland China, Taiwan and Singapore."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-50",
"text": "Zampieri and Gebre [4] found that word uni-grams gave very similar performance to character n-gram features in the framework of the probabilistic language model for the Brazilian and European Portuguese language discrimination."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-51",
"text": "Tiedemann and Ljubesic [5] ; Ljubesic and Kranjcic [6] showed that the Na\u00efve Bayes classifier with uni-grams achieved high accuracy for the South Slavic languages identification."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-52",
"text": "Grefenstette [11] ; Lui and Cook [12] found that bag-of-words features outperformed the syntax or character sequencesbased features for the English varieties."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-53",
"text": "Besides these works, other recent studies include: Spanish varieties identification [13] , Arabic varieties discrimination [14, 15, 16, 17] , and Persian and Dari identification [18] ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-54",
"text": "Among the above related works, study [3] is the most related work to ours."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-55",
"text": "The differences between study [3] and our work are two-fold:"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-56",
"text": "(1)They focus on document-level varieties of Mandarin in China, Taiwan and Singapore, while we deal with sentence-level varieties of Mandarin in China, Hong Kong, Taiwan, Macao, Malaysia and Singapore."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-57",
"text": "In order to investigate the detailed characteristic of different dialects of Mandarin Chinese, we extend dialects in Huang and Lee [3] to 6 dialects."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-58",
"text": "Also, the more dialects there are, the more difficult the dialects discrimination becomes."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-59",
"text": "(2)The top-bag-of-word they proposed in Huang and Lee [3] is word uni-gram feature essentially."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-60",
"text": "While in this paper, besides the traditional uni-gram feature, we propose some novel features, such as character form, PMI-based and word alignment-based features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-61",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-62",
"text": "**DIALECTS CLASSIFICATION MODELS**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-63",
"text": "In this section, we recast the sentence-level dialects identification in the GCR as a multiclass classification problem."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-64",
"text": "Below we will describe some common features in the general language (unrelated languages or different languages) identification as well as some novel features such as character form, PMI-based and word alignment-based features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-65",
"text": "These features are fed into a classifier to determine the dialect of a sentence."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-67",
"text": "**CHARACTER-LEVEL FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-68",
"text": "In this section, we represent the N-gram features and character form features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-70",
"text": "**N-GRAM FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-71",
"text": "According to the related works [4, 5, 6] , word uni-grams are effective features for discriminating general languages."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-72",
"text": "Compared with English, no space exists between words in Chinese sentence."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-73",
"text": "Therefore, we use character uni-grams, bi-grams and tri-grams as features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-74",
"text": "However, Huang and Lee [3] did not use character-level n-grams."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-76",
"text": "**CHARACTER FORM FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-77",
"text": "Due to various historical reasons, there are many different linguistic phenomena and expression variances among the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-78",
"text": "As mentioned earlier, Mainland China, Malaysia and Singapore adopt the simplified character form, while Hong Kong, Taiwan and Macao use the complex/traditional character form."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-79",
"text": "This kind of information is very helpful to identify sentence-level dialects in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-80",
"text": "Motivated by the above observations, we first construct a complex/traditional Chinese dictionary with 626 characters crawled from the URL."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-81",
"text": "2 These characters have been simplified."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-82",
"text": "Thus they make the strongest differences between two character sets."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-83",
"text": "Then we generate the character form based feature as a Boolean variable to detect whether the Chinese sentence contain any word in the traditional dictionary."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-84",
"text": "Table 2 lists some complex/traditional character examples."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-85",
"text": "Generally, we can know that the complex character form occurs mostly in the strict genre i.e. news text."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-86",
"text": "Thus, this kind of information is useful to discriminate dialects in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-87",
"text": "In this section, we represent the word segmentation, PMI and word alignment features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-89",
"text": "**WORD SEGMENTATION FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-90",
"text": "Chinese word segmentation [19] is a vital pre-processing step before Chinese information processing."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-91",
"text": "As we mentioned earlier, different words may be used to express the same meaning in different dialects."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-92",
"text": "Therefore, words are also useful features for dialect detection."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-93",
"text": "These features have been successfully used in Huang and Lee."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-94",
"text": "Thus, we firstly conduct word segmentation using ICTCLAS 3 Chinese word segmentation package which can handle both simplified and traditional/complex characters for the Chinese sentence, and then extract each word uni-gram to generate word segmentation feature vector."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-96",
"text": "**PMI FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-97",
"text": "Once a sentence is segmented into words, we adopt Pointwise Mutual Information (PMI) to determine the relationship between the words and theirs ascribed language varieties."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-98",
"text": "PMI for a word only used in a dialect will be high, while the one used in all the dialects will be low."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-99",
"text": "Specifically, we calculate the relationship between words and theirs dialect type by Equation (1) as follows:"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-100",
"text": "where w i indicates any word in the corpus and I j a dialect."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-101",
"text": "p(w i ) stands for the ratio of the frequency of a word in the corpus to the total number of words, p(I j ) means the ratio of the frequency of words in the documents using dialect j to the total number of words in the corpus, p(w i ,I j ) indicates for the account of the frequency of the word i occurs in the documents using dialect j and the total number of words in the corpus."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-102",
"text": "Then, we can generate different word set for each dialect of the GCR, and yield PMI-based feature according to the word set."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-103",
"text": "For example, if a word in a sentence occurs among in the word set of Mainland China, Taiwan and Singapore, thus the value of PMI-based feature is MC_TW_SGP (MC stands for Mainland China, TW refers to Taiwan, and SGP means Singapore)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-104",
"text": "In fact, we can take the PMI-based feature as a way of weighting the word segmentation-based features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-106",
"text": "**WORD ALIGNMENT FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-107",
"text": "As mentioned earlier, for a single semantic meaning, various linguistic expressions exist in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-108",
"text": "Then, how to align different coarse-grained expressions (n-gram with n \u2265 4; lines 6-8 in Table1) of the same meaning for each dialect is a vital problem."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-109",
"text": "We can generate dictionary for each dialect as an expansion of PMI-based word set."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-110",
"text": "Intuitively, according to word alignment problem in the machine translation, given a source sentence e consisting of words e 1 , e 2 ,\u2026, e l and a target sentence f consisting of words f 1 , f 2 ,\u2026, f m , we need to infer an alignment a, a sequence of indices a 1 , a 2 ,\u2026, a m which indicates the corresponding source word e ai or a null word."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-111",
"text": "Therefore, we can recast the coarse-grained expressions extraction in the GCR as a word alignment problem in the statistic machine translation."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-112",
"text": "Specifically, we firstly crawl about 30 million parallel sentence pairs between Mainland China and each other dialect in the GCR from parallel texts in the news and Wikipedia website."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-113",
"text": "The corpus collected for the word alignment features different from the test data as described in the subsection 4.1, which are not just the same texts converted from traditional characters to simplified characters, or vice versa, and then extract the word alignment using GIZA++ [20] ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-114",
"text": "After removing the Longest Common Subsequence (LCS) ( [21] ) of words, we can extract the different linguistic expressions mapping between Mainland China and each other dialect in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-115",
"text": "Then, we generate about 12,374 parallel word set for each dialect of the GCR, and yield word alignment-based feature according to the word set."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-116",
"text": "For example, if a word occurs in the word set of Mainland China, Singapore and Malaysia, then we set the value of word alignmentbased feature as MC_SGP_MAL (MC stands for Mainland China, SGP means Singapore, and MAL refers to Malaysia)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-117",
"text": "To be specific, Figure 1 shows an example to extract word alignment from two parallel sentences."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-118",
"text": "As shown, we can extract \""
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-119",
"text": "/ren ji jie mian/human interface\" and \"\u4eba\u673a\u4ecb\u9762/ren ji jie mian/human interface\" to generate dictionary for each dialect."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-121",
"text": "**CLASSIFIER**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-122",
"text": "After extracting the above proposed features, in order to do fair comparison with the baseline systems (Section 4.1), we train a single or combined multiclass linear kernel support vector machine using LIBLINEAR [22] with default parameters such as verbosity level with 1, trade-off between training error and margin with 0.01, slack rescaling, zero/one loss."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-123",
"text": "Also, due to large number of the parameters of SVM, we do not tune them on the development sets."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-124",
"text": "According to previous studies, SVM were well-suited to high-dimensional feature spaces; SVM has shown good performance in many other language identification work [10, 23] ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-125",
"text": "Therefore, we adopt SVM to discriminate sentence-level dialects for the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-126",
"text": "Besides, we trained maximum entropy and na\u00efve Bayes classifiers, but the results are much worse than SVM."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-127",
"text": "We also trained some other kernel function of SVM with polynomial, radial basis function, and sigmoid, but the linear kernel gets the best results."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-128",
"text": "Consequently, we only report the results with SVM using linear kernel function in the Section 4."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-130",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-131",
"text": "In this section, we first introduce the experimental settings, and then evaluate the performance of our proposed approach for identifying dialects in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-133",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-134",
"text": "Dataset: We crawl our sentence-level dialect data set from news websites 4 and Wikipedia using the jsoup 5 utility."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-135",
"text": "After removing the useless sentence, such as the English sentences, the English words account 50% in total and the sentences including less than 15 words, we obtained 27,679 news sentences in total (3,452 sentences for Macao, 5,437 sentences for Mainland China, 5,816 sentences for Hong Kong, 5,711 sentences for Malaysia, 4,672 sentences for Taiwan, and 2,591 sentences for Singapore) ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-136",
"text": "In order to balance these sentences for each dialect, we random select 2,500 sentences for each dialect in the GCR, thus we generate 15,000 sentences for the GCR in total."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-137",
"text": "Similarly, we also extract 18,000 Wikipedia sentences (6,000 sentences for each dialect, including Mainland China, Hong Kong and Taiwan in the GCR)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-138",
"text": "All sentences from the same website can be automatically annotated with a specific dialect type."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-139",
"text": "There are no duplicates in the dataset."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-140",
"text": "For evaluation, we adopt 5-cross validation for the two datasets."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-141",
"text": "For the news dataset, we generate three scenarios:"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-142",
"text": "(1) 6-way detection: The dialects of Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore are all considered; (2) 3-way detection: We detect dialects of Mainland China, Taiwan and Singapore as in Huang and Lee [3] ; (3) 2-way detection: We try to distinguish between two groups of dialects, the ones used in Mainland China, Malaysia and Singapore using simplified characters, and the ones used in Hong Kong, Taiwan and Macao using traditional characters."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-143",
"text": "For the Wikipedia dataset, we also generate two similar scenarios:"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-144",
"text": "(1) 3-way detection: We detect dialects of Mainland China, Hong Kong and Taiwan; (2) 2-way detection: We try to distinguish between two groups of dialects, the ones used in Mainland China using simplified characters, and the ones used in Hong Kong and Taiwan using traditional characters."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-145",
"text": "Baseline system 1: As mentioned in Section 2, we take the Huang and Lee [3] 's top-bag-of-word similarity-based approach as one of our baseline system."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-146",
"text": "We re-implement their method in this paper using the similar 3-way news dataset."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-148",
"text": "**BASELINE SYSTEM 2:**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-149",
"text": "Another baseline, word uni-gram based feature for English using SVM classifier, was proposed by Purver [10] , which have been verified effective in the DSL shared task 2015."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-151",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-152",
"text": "In this section, we report the experiment results for the dialects identification for the GCR on both news and Wikipedia dataset."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-153",
"text": "Table 3 shows the experimental results for the dialect identification in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-155",
"text": "**RESULTS ON NEWS DATASET**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-157",
"text": "**(1) SINGLE FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-158",
"text": "If we use a single type of feature, we can see that the uni-gram feature (baseline system 2) is not the best one for Chinese dialect detection in the GCR, although it has been found effective for English detection in previous studies in the DSL shared task."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-159",
"text": "Instead, bi-gram and word segmentation based features are better than uni-gram one."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-160",
"text": "Both of the proposed bi-gram and word segmentation based features significantly outperforms the baseline systems with p<0.01 using paired t-test for significance."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-161",
"text": "Also the bi-gram and word segmentation based features are better than the Huang and Lee [3] 's method (baseline system 1) for 6-way, 3-way and 2-way dialect identification in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-162",
"text": "Obviously, the random method does not work for the GCR dialect identification."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-164",
"text": "**(2)BI-GRAM VS TRI-GRAMS AND UNI-GRAM**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-165",
"text": "Besides, the performance of bi-gram outperforms tri-gram and uni-gram."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-166",
"text": "We explain this by the fact that there are much sparser in tri-grams than in bi-grams."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-167",
"text": "Another explanation is that most Chinese words are formed of two characters."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-168",
"text": "Bi-grams can better capture the meaningful words than tri-grams."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-169",
"text": "This observation is consistent with the observation in Chinese information retrieval: Nie et al. [24] found that character bi-grams work equally well to words for Chinese information retrieval."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-170",
"text": "We have the same observation in Table 3 (bi-gram vs. word segmentation)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-172",
"text": "**(3)LINGUISTIC AND ALIGNMENT FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-173",
"text": "According to Table 3 , we also observe that the character form features are useful for 2-way GCR dialect classification, which verifies our observation."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-174",
"text": "Also, the proposed word segment-based feature is effective for the 2-way dialect identification, which yields 98.42% accuracy."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-176",
"text": "**(4)COMBINED FEATURES**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-177",
"text": "For the combined features, we can know that the character form, PMI and word alignment based features can improve the language identification in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-178",
"text": "They can be successfully integrated into the effective bi-gram features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-179",
"text": "As shown in Table 3 , PMI-based feature can bring performance improvement by 1.6% for the 6-way dialect identification."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-180",
"text": "After integrating the novel 3 features together, we get the final best performance with 90.91% for 3-way dialects identification in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-181",
"text": "The combined features significantly outperform the bi-gram with p<0.01 using paired t-test for significance, which shows the effectiveness of our novel features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-182",
"text": "Table 3 ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-183",
"text": "Accuracy using different features on the news dataset."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-184",
"text": "Performance that is significantly superior to baseline systems (p<0.01, using paired t-test for significance) is denoted by *."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-185",
"text": "Features/Systems 6-way 3-way 2-way Baseline systems random 16.67 33.33 50.00 baseline system 1; Huang and Lee [3] 66.67 66.67 83.33 uni-gram (baseline system 2; Matthew Purver [10] 74 More specifically, the accuracy of our best feature (bi-gram + character form + PMI + word alignment) for each dialect identification in the GCR for the 6-way classification is reported in Table 4 ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-186",
"text": "As shown, we gain the best identification performance for Macao, while the accuracy of Malaysia is the worst one."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-187",
"text": "We observe much noise among the texts in Malaysia."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-188",
"text": "Most sentences in Malaysia has the English words account 10% in total."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-189",
"text": "Thus, how to crawl large numbers of sentences with high quality for a dialect will be one of our future works."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-190",
"text": "To be more specific, we list the confusion matrix for each dialect in the GCR in"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-192",
"text": "**RESULTS ON WIKIPEDIA DATASET**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-193",
"text": "As shown in Table 3 , character form based features are very effective (94.36% for 2-way dialects classification)."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-194",
"text": "Similar to Huang and Lee [3] 's work, in order to eliminate the trivial issue of character encoding (simplified and traditional character), we convert Taiwan and Hong Kong texts to the same simplified character set using Zhconvertor 6 utility to focus on actual linguistic and textual features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-195",
"text": "Table 6 shows the experimental results for the dialect identification in the GCR."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-196",
"text": "As shown, again, the bi-gram features work better than both uni-gram and tri-gram features on Wikipedia dataset."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-197",
"text": "Also, the word alignment-based features can contribute about 3.32% performance improvement."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-198",
"text": "This also confirms our intuition that the word alignment-based information is helpful to discriminate dialects in the GCR, which shows the effectiveness of both fine-grained and coarsegrained characteristic of word alignment based features."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-199",
"text": "In order to generate word alignment word sets, as mentioned in the Section 1, the sentences are parallel for Mainland China, Hong Kong and Taiwan."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-200",
"text": "After converting them to the same character set, the difference among them is subtle."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-201",
"text": "Therefore, both the PMI-based feature and character form feature will be invalid in this situation."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-202",
"text": "PMI depend so much on the use of a particular character set shows that it is correlated with other knowledge sources and it has been well defined."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-203",
"text": "It also shows that the dialect identification on parallel sentences with same character set for the GCR is a challenging task."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-204",
"text": "The reason why the performance on Wikipedia is much lower than on the news dataset is listed as follows."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-205",
"text": "The texts in the Wikipedia are not written in pure dialects (maybe as mostly translated from English) or topic information biased good results achieved for the news data set, i.e. topics discussed in different news make them different."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-206",
"text": "The word alignment based feature improve the performance about 3.32%, Also, the 2-way classification results shows the proposed bi-gram and word alignment based features are quite promising."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-207",
"text": "Table 6 ."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-208",
"text": "Accuracy using different features on the Wikipedia dataset."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-209",
"text": "Performance that is significantly superior to baseline systems (p<0.01, using paired t-test for significance) is denoted by *."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-210",
"text": "Features/Systems 3-way 2-way Baseline systems baseline system 1; Huang and Lee [3] 66.67 66.67 uni-gram (baseline system 2; Matthew Purver [10]"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-211",
"text": "----------------------------------"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-212",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-213",
"text": "In this paper, we study the problem of dialect identification for Chinese."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-214",
"text": "We found that the unigram is commonly used in the previous work, showing very good results for European languages."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-215",
"text": "Unlike the European languages, words in Chinese are not separated by spaces."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-216",
"text": "Therefore, a naive adaptation of the uni-gram features to Chinese character uni-gram, does not work well."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-217",
"text": "However, longer elements such as character bi-grams and segmented words work much better."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-218",
"text": "This indicates that such longer units are more meaningful in Chinese and can better reflect the characteristics of a dialect."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-219",
"text": "In addition, we also proposed new features based on PMI and word alignment."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-220",
"text": "These features are also shown useful for Chinese dialect identification."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-221",
"text": "In future work, we would like to explore more features, and test other classifiers."
},
{
"sent_id": "48add0c1226863808c3c3a8c29a12e-C001-222",
"text": "Furthermore, we will finally investigate how dialect identification can help other NLP tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"48add0c1226863808c3c3a8c29a12e-C001-8"
],
[
"48add0c1226863808c3c3a8c29a12e-C001-13"
],
[
"48add0c1226863808c3c3a8c29a12e-C001-49"
]
],
"cite_sentences": [
"48add0c1226863808c3c3a8c29a12e-C001-8",
"48add0c1226863808c3c3a8c29a12e-C001-13",
"48add0c1226863808c3c3a8c29a12e-C001-49"
]
},
"@EXT@": {
"gold_contexts": [
[
"48add0c1226863808c3c3a8c29a12e-C001-30"
],
[
"48add0c1226863808c3c3a8c29a12e-C001-57"
]
],
"cite_sentences": [
"48add0c1226863808c3c3a8c29a12e-C001-30",
"48add0c1226863808c3c3a8c29a12e-C001-57"
]
},
"@SIM@": {
"gold_contexts": [
[
"48add0c1226863808c3c3a8c29a12e-C001-54"
],
[
"48add0c1226863808c3c3a8c29a12e-C001-194"
]
],
"cite_sentences": [
"48add0c1226863808c3c3a8c29a12e-C001-54",
"48add0c1226863808c3c3a8c29a12e-C001-194"
]
},
"@DIF@": {
"gold_contexts": [
[
"48add0c1226863808c3c3a8c29a12e-C001-55"
],
[
"48add0c1226863808c3c3a8c29a12e-C001-73",
"48add0c1226863808c3c3a8c29a12e-C001-74"
],
[
"48add0c1226863808c3c3a8c29a12e-C001-161"
]
],
"cite_sentences": [
"48add0c1226863808c3c3a8c29a12e-C001-55",
"48add0c1226863808c3c3a8c29a12e-C001-74",
"48add0c1226863808c3c3a8c29a12e-C001-161"
]
},
"@USE@": {
"gold_contexts": [
[
"48add0c1226863808c3c3a8c29a12e-C001-145"
]
],
"cite_sentences": [
"48add0c1226863808c3c3a8c29a12e-C001-145"
]
}
}
},
"ABC_d53d1b53168041baea5b5002b46627_3": {
"x": [
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-2",
"text": "End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-3",
"text": "To further improve the quality, a two-pass model has been proposed to rescore streamed hypotheses using the nonstreaming Listen, Attend and Spell (LAS) model while maintaining a reasonable latency."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-4",
"text": "The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-5",
"text": "In this work, we propose to attend to both acoustics and first-pass hypotheses using a deliberation network."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-6",
"text": "A bidirectional encoder is used to extract context information from first-pass hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-7",
"text": "The proposed deliberation model achieves 12% relative WER reduction compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction on a proper noun test set."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-8",
"text": "Compared to a large conventional model, our best model performs 21% relatively better for VS."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-9",
"text": "In terms of computational complexity, the deliberation decoder has a larger size than the LAS decoder, and hence requires more computations in second-pass decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-12",
"text": "E2E ASR has gained a lot of popularity due to its simplicity in training and decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-13",
"text": "An all-neural E2E model eliminates the need to individually train components of a conventional model (i.e., acoustic, pronunciation, and language models), and directly outputs subword (or word) symbols [1, 2, 3, 4, 5] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-14",
"text": "In large scale training, E2E models perform competitively compared to more sophisticated conventional systems on Google traffic [6, 7] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-15",
"text": "Given its all-neural nature, an E2E model can be reasonably downsized to fit on mobile devices [6] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-16",
"text": "Despite the rapid progress made by E2E models, they still face challenges compared to state-of-the-art conventional models [8, 9] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-17",
"text": "To bridge the quality gap between a streaming recurrent neural network transducer (RNN-T) [6] and a large conventional model [8] , a two-pass framework has been proposed in [10] , which uses a non-streaming LAS decoder to rescore the RNN-T hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-18",
"text": "The rescorer attends to audio encoding from the encoder, and computes sequence-level log-likelihoods of first-pass hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-19",
"text": "The two-pass model achieves 17%-22% relative WER reduction (WERR) compared to RNN-T [6] and has a similar WER to a large conventional model [8] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-20",
"text": "A class of neural correction models post-process hypotheses using only the text information, and can be considered as second-pass models [11, 12, 13] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-21",
"text": "The models typically use beam search to generate new hypotheses, compared to rescoring where one leverages external language models trained with large text corpora [14] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-22",
"text": "For example, a neural correction model in [11] takes first-pass text hypotheses and generates new sequences to improve numeric utterance recognition [15] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-23",
"text": "A transformer-based spelling correction model is proposed in [12] to correct the outputs of a connectionist temporal classification model in Mandarin ASR."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-24",
"text": "In addition, [13] leverages text-to-speech (TTS) audio to train an attention-based neural spelling corrector to improve LAS decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-25",
"text": "These neural correction models typically use only text as inputs, while the aforementioned two-pass model attends to acoustics alone for second-pass processing."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-26",
"text": "In this work, we propose to combine acoustics and first-pass text hypotheses for second-pass decoding based on the deliberation network [16] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-27",
"text": "The deliberation model has been used in state-of-the-art machine translation [17] , or generating intermediate representation in speech-to-text translation [18] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-28",
"text": "Our deliberation model has a similar structure as [16] : An RNN-T model generates the first-pass hypotheses, and deliberation attends to both acoustics and first-pass hypotheses for a second-pass decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-29",
"text": "We encode first-pass hypotheses bidirectionally to leverage context information for decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-30",
"text": "Note that the first-pass hypotheses are sequences of wordpieces [19] and are usually short in VS, and thus the encoding should have limited impact on latency."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-31",
"text": "Our experiments are conducted using the same training data as in [20, 21] , which is from multiple domains such as Voice Search, YouTube, Farfield and Telephony."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-32",
"text": "We first analyze the behavior of the deliberation model, including performance when attending to multiple RNN-T hypotheses, contribution of different attention, and rescoring vs. beam search."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-33",
"text": "We apply additional encoder (AE) layers and minimum WER (MWER) training [22] to further improve quality."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-34",
"text": "The results show that our MWER trained 8-hypothesis deliberation model performs 11% relatively better than LAS rescoring [10] in VS WER, and up to 15% for proper noun recognition."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-35",
"text": "Joint training further improves VS slightly (2%) but significantly for a proper noun test set: 9%."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-36",
"text": "As a result, our best deliberation model achieves a WER of 5.0% on VS, which is 21% relatively better than the large conventional model [8] (6.3% VS WER)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-37",
"text": "Lastly, we analyze the computational complexity of the deliberation model, and show some decoding examples to understand its strength."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-39",
"text": "**DELIBERATION BASED TWO-PASS E2E ASR**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-41",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-42",
"text": "As shown in Fig. 1 , our deliberation network consists of three major components: A shared encoder, an RNN-T decoder [1] , and a deliberation decoder, similar to [10, 16] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-43",
"text": "The shared encoder takes log-mel filterbank energies, x = (x1, ..., xT ), where T denotes the number of frames, and generates an encoding e. The encoder output e is then fed to an RNN-T decoder to produce first-pass decoding results yr in a streaming fashion."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-44",
"text": "Then the deliberation decoder attends to both e and yr to predict a new sequence y d ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-45",
"text": "We use a bidirectional encoder to further encode yr for useful context information, and the output is denoted as h b ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-46",
"text": "Note that we could use mul-"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-48",
"text": "**Z 5 T M S B Y D C C K 8 I A W K D Z K L Y C Z F P U W B E 4 A G Z I A 5 / 4 N B 7 T S V P + P M H R F B L A C C C J V 8 + 7 P X J 2 K A M Q L U E K 2 7 R P O A L X E F N A Q W L 3 U P Z G M H I Z J G X Y O S R E X 7 2 F T 8 H J 8 Y P 4 / D W J K N A U / D 3 X M Z I B Q E R 4 H P J A G M 9 X X T Y V 5 X 6 6 Y Q X N O Z L 0 K K T N L Z O J A V G G I 8 Y Q L 3 U W I U X N G A O Y Q B W Z E D E K U O M M T K J G R 3 / S U L 0 K P V 3 B N Q 7 F A 8 U R 8 Q 4 I I H I 3 S M T P G L L L A D 3 A A G A I K K M V S E X T C R 9 W G 9 W 2 / W + 6 X 1 Y S P M D T A F W R / F A V A W B A = = < / L A T E X I T > Y D < L A T E X I T S H A 1 _ B A S E 6 4 = \" X 7 P C Q 9 N D + 9 5 Q H D F U Q 6 G X G I 3 L Y L E = \" > A A A B / N I C B Z D L S S N A F I Y N X M U 9 R C W V M 8 E I U C P J F X R Z D O O Y G R 1 A G 8 J K M M M H T I 7 M N I G H B H W V N Y 4 U C E T Z U P N T N L Y R T P W H G Y / / N M M 5 8 3 U J 4 A O S 6 8 T Y W L 5 Z X V U V B F Q 3 T 7 Z 3 D S 2 9 / Y 6 K U 0 L Z M 8 Y I L J 2 P K C Z 4 X N R A Q B B E I H K J P C G 6 3 V H 6 U U / E M 6 L 4 H N 1 B L J A N J M O I B 5 W S 0 J Z R H G 6 A P Y A X 5 F N H / Q B F U G B N Q L T T 4 U W W S 6 I H U I 3 X / B Z 4 M U 1 D F G E V R K M + B S X G 5 E Q C P 4 I V 1 U G Q W E L O M A X Z X 2 N E Q Q A C F H P + G U + 0 4 + M G L V P F G K F U 7 4 M C H E P L O A C 7 Q W I J N V + B M P / V + I K E L 0 7 O O Y Q F F T H Z O I A V G G I 8 Y Q L 7 X D I K I T N A Q O T 6 V K X H R B I K O R G Q D S G E / / I I D B P 1 + 6 Z E U D 2 V N A / K O C R O C B 2 J U 2 S J C 9 R E N 6 I F 2 O I I H D 2 H F / R Q P B R P X P V X P M T D M S Q Z A / R H X S C 3 I J C W F W = = < / L A T E X I T >**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-50",
"text": "**Z 5 T M S B Y D C C K 8 I A W K D Z K L Y C Z F P U W B E 4 A G Z G J U / 2 C Q + 3 B F Q T P T 4 U V W C 6 I G Q G 3 F / U Z 1 Y 5 P G T A I V R O U U 6 Y T G Z U Q B P 4 L L 5 V 6 Q W U L O I A X Y 1 6 A K E D N E N J 0 / X Y F G 6 E M W V U Z J W F P 3 9 0 R G I Q 3 H U W A 6 I W J D P V + B M P / V U I M E L 1 7 G Z Z I C K 3 S 2 K E W F H H H P S S B 9 R H G F M T Z A Q O L M V K Y H R B E K J R G Y C C G D / / I I T G P V 9 6 X A U Z 2 V 1 K + K O E R O C B 2 J U + S I C 1 R H N 6 I B M O I I D D 2 H F / R Q P V R P 1 P V 1 P M T D S O Q Z A / R H 1 S C 3 Y Y A W Z W = = < / L A T E X I T >**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-52",
"text": "**DELIBERATION DECODER**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-53",
"text": "Additional Encoder Fig. 1 ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-54",
"text": "Diagram of the deliberation model with an optional additional encoder (dashed box)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-55",
"text": "tiple hypotheses {y i r }, where i = 1, ..., H and H is the number of hypotheses, and in this scenario we encode each hypothesis y i r separately using the same bidirectional encoder, and then concatenate their outputs in time to form h b ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-56",
"text": "We keep the audio encoder unidirectional due to latency considerations."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-57",
"text": "Then, two attention layers are followed to attend to acoustic encoding and first-pass hypothesis encoding separately."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-58",
"text": "The two context vectors, c b and ce, are concatenated as inputs to a LAS decoder."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-59",
"text": "There are two major differences between our model and the LAS rescoring [10] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-60",
"text": "First, the deliberation model attends to both e and yr, while [10] only attends to the acoustic embedding, e. Second, our deliberation model encodes yr bidirectionally, while [10] only relies on unidirectional encoding e for decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-61",
"text": "[10] shows that the incompatibility between an RNN-T encoder and a LAS decoder leads to a gap between the rescoring model and LASonly model."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-62",
"text": "To help adaptation, we introduce a 2-layer LSTM as an additional encoder (dashed box in Fig. 1 to indicate optional) to further encode e."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-63",
"text": "We show in Sect."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-64",
"text": "4 that additional encoder layers improve both deliberation and LAS rescoring models."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-66",
"text": "**ADDITIONAL ENCODER LAYERS**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-68",
"text": "**TRAINING**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-69",
"text": "A deliberation model is typically trained from scratch by jointly optimizing all components [16] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-70",
"text": "However, we find training a two-pass model from scratch tends to be unstable in practice [10] , and thus use a two-step training process: Train the RNN-T as in [6] , and then fix the RNN-T parameters and only train the deliberation decoder and additional encoder layers as in [7, 10] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-72",
"text": "**MWER LOSS**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-73",
"text": "We apply the MWER loss [22] in training which optimizes the expected word error rate by using n-best hypotheses:"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-74",
"text": "where y i d is the ith hypothesis from the deliberation decoder, and W (y i d , y * ) the number of word errors for y i d w.r.t the ground truth target y * .P (y i d |x) is the probability of the ith hypothesis normalized over all other hypotheses to sum to 1."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-75",
"text": "B is the beam size."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-76",
"text": "In practice, we combine the MWER loss with cross-entropy (CE) loss to stabilize training: L MWER (x, y * ) = LMWER(x, y * ) + \u03b1LCE(x, y * ), where \u03b1 = 0.01 as in [22] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-78",
"text": "**JOINT TRAINING**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-79",
"text": "Training the deliberation decoder while fixing RNN-T parameters is not optimal since the model components are not jointly updated."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-80",
"text": "We propose to use a combined loss to train all modules jointly:"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-81",
"text": "where LRNNT(\u00b7) is the RNN-T loss, and LCE(\u00b7) the CE loss for the deliberation decoder."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-82",
"text": "\u03b8e, \u03b81, and \u03b82 denote the parameters of shared encoder, RNN-T decoder, and deliberation decoder, respectively."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-83",
"text": "Note that a jointly trained model can be further trained with MWER loss."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-84",
"text": "The joint training is similar to \"deep finetuning\" in [10] but without a pre-trained decoder."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-86",
"text": "**DECODING**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-87",
"text": "Our decoding consists of two passes: 1) Decode using the RNN-T model to obtain the first-pass sequence yr, and 2) Attend to both yr and e, and perform the second beam search to generate y d ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-88",
"text": "We are also curious how rescoring performs given bidirectional encoding from yr."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-89",
"text": "In rescoring, we run the deliberation decoder on yr in a teacher-forcing mode [10] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-90",
"text": "Note the difference from [10] when rescoring a hypothesis is that the deliberation network sees all candidate hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-91",
"text": "We compare rescoring and beam search in Sect."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-92",
"text": "4."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-94",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-96",
"text": "**DATASETS**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-97",
"text": "For training, we use the same multidomain datasets as in [20, 21] which include anonymized and hand-transcribed English utterances from general Google traffic, far-field environments, telephony conversations, and YouTube."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-98",
"text": "We augment the clean training utterances by artificially corrupting them by using a room simulator, varying degrees of noise, and reverberation such that the signal-to-noise ratio (SNR) is between 0dB and 30dB [23] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-99",
"text": "We also use mixed-bandwidth utterances at 8kHz or 16 kHz for training [24] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-100",
"text": "Our main test set includes~14K anonymized hand-transcribed VS utterances sampled from Google traffic."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-101",
"text": "To evaluate the performance of proper noun recognition, we report performance on a side-by-side (SxS) test set, and 4 voice command test sets [6] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-102",
"text": "The SxS set contains utterances where the LAS rescoring model [10] performs inferior to a state-of-the-art conventional model [8] , and one reason is due to proper nouns."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-103",
"text": "The voice command test sets include 3 TTS test sets created using parallel-wavenet [25] : Songs, Contacts-TTS, and Apps, where the commands include song, contact, and app names, respectively."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-104",
"text": "The Contacts-Real set contains anonymized and hand-transcribed utterances from Google traffic to communicate with a contact, for example, \"Call Jon Snow\"."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-106",
"text": "**ARCHITECTURE DETAILS AND TRAINING**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-107",
"text": "Our first-pass RNN-T model has the same architecture as [6] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-108",
"text": "The encoder of the RNN-T consists of an 8-layer Long Short-Term Memory (LSTM) [26] and the prediction network contains 2 layers."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-109",
"text": "Each LSTM layer has 2,048 hidden units followed by 640-dimensional projection."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-110",
"text": "A time-reduction layer is added after the second layer to improve the inference speed without accuracy loss."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-111",
"text": "Outputs of the encoder and prediction network are fed to a joint-network with 640 hidden units, which is followed by a softmax layer predicting 4,096 mixed-case wordpieces."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-112",
"text": "The deliberation decoder can attend to multiple hypotheses, and RNN-T hypotheses with different lengths are thus padded with end-of-sentence label \\s to a length of 120."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-113",
"text": "Each subword unit in a hypothesis is then mapped to a vector by a 96-dimensional embedding layer, and then encoded by a 2-layer bidirectional LSTM encoder, where each layer has 2,048 hidden units followed by 320-dimensional projection."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-114",
"text": "Each of the two attention models is a multi-headed attention [27] with four attention heads."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-115",
"text": "The two output context vectors are concatenated and fed to a 2-layer LAS decoder (2,048 hidden units followed by 640-dimensional projection per layer)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-116",
"text": "The LAS decoder has a 4,096-dimensional softmax layer to predict the same mixed-case wordpieces [19] as the RNN-T."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-117",
"text": "For feature extraction, we use 128-dimensional log-Mel features from 32-ms windows at a rate of 10 ms."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-118",
"text": "Each feature is stacked with three previous frames to form a 512-dimensional vector, and then downsampled to a 30-ms frame rate."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-119",
"text": "Our models are trained in Tensorflow [28] using the Lingvo framework [29] on 8\u00d78 Tensor Processing Units (TPU) slices with a global batch size of 4,096."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-121",
"text": "**COMPUTATIONAL COMPLEXITY**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-122",
"text": "We estimate the computational complexity of the deliberation decoder using the number of floating-point operations (FLOPS) required:"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-123",
"text": "where MB is the size of the bidirectional encoder, N the number of decoded tokens, and H the number of first-pass hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-124",
"text": "MD denotes the size of the LAS decoder, and B the second beam search size."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-125",
"text": "FLOPSatten is the FLOPS required for two attention layers, and we compute it as the sum of multiplying the sizes of source and query matrices with the number of time frames and N , respectively."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-126",
"text": "Our deliberation decoder contains roughly 66M parameters, where the size of the bidirectional encoder is MB = 22M, LAS decoder is MD = 42M, and attention layers have 2M parameters."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-128",
"text": "**RESULTS**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-129",
"text": "In this section we analyze the importance of certain components of the deliberation model by ablation studies, improve the model by MWER and AE layers, and select one of our best deliberation models for comparison."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-131",
"text": "**NUMBER OF RNN-T HYPOTHESES**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-132",
"text": "The deliberation decoder may attend to multiple first-pass hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-133",
"text": "We encode the hypotheses separately, and then concatenate them as the input to the attention layer."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-134",
"text": "We use a beam size of 8 for RNN-T decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-135",
"text": "Unless stated otherwise, the WER we report is for VS test set."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-136",
"text": "The third row in Table 1 shows that the WER improves slightly when increasing the number of RNN-T hypotheses from 1 to 8."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-137",
"text": "However, after applying MWER training, the WER improves continuously: 5.4% to 5.1%."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-138",
"text": "We suspect that MWER training specifically helps deliberation attend to relevant parts of first-pass hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-139",
"text": "Since 8-hypothesis model gives the best performance, we use that for subsequent experiments."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-140",
"text": "MWER training is not used for simplicity."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-142",
"text": "**ACOUSTICS VS. TEXT**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-143",
"text": "We are curious about how different attention (c b vs ce) contribute to deliberation, and thus train separate models where we attend to"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-145",
"text": "**ADDITIONAL ENCODER LAYERS**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-146",
"text": "To help the deliberation decoder better adapt to the shared encoder, we add AE layers for dedicated encoding for the deliberation decoder."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-147",
"text": "The AE consists of a 2-layer LSTM with 2,048 hidden units followed by 640-dimensional projection per layer."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-148",
"text": "Beam search is used for decoding."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-149",
"text": "In Table 3 Table 3 ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-150",
"text": "WERs (%) with or without AE layers."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-151",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-152",
"text": "**RESCORING**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-153",
"text": "We propose to use the deliberation decoder to rescore first-pass RNN-T results, and expect bidirectional encoding to help compared to LAS rescoring [10] ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-154",
"text": "Table 5 shows that the deliberation rescoring (E8) performs 5% relatively better than LAS rescoring (B3)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-155",
"text": "AE layers are added to both models."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-157",
"text": "**COMPARISONS**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-158",
"text": "From the above analysis, an MWER trained 8-hypothesis deliberation model with AE layers performs the best, and thus we use that for comparison below."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-159",
"text": "In Table 4 , we compare deliberation models with an RNN-T [6] and LAS rescoring model [10] To understand where the improvement comes from, in Fig. 2 we show an example of deliberation attention distribution on the RNN-T hypotheses (x-axis) at every step of the second-pass decoding (yaxis)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-160",
"text": "We can see the attention selects mainly one wordpiece when the first-pass result is correct (e.g. \" weather\", \" in\", etc)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-161",
"text": "However, when the first-pass output is wrong (e.g. \"ond\" and \"on\"), the attention looks ahead at \" Nevada\" for context information for correction."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-162",
"text": "We speculate that the attention functions similarly as a context-aware language model on the first-pass sequence."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-163",
"text": "In Table 4 , we also report gigaFLOPS (GFLOPS) estimated using Eq. (3) on the 90%-tile VS set, where an utterance has roughly 109 audio frames and a decoded sequence of 14 tokens."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-164",
"text": "Since the deliberation decoder has a larger size than LAS decoder (67MB vs. 33MB), it requires around 1.8 times GFLOPS as LAS rescoring."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-165",
"text": "The increase mainly comes from the bidirectional encoder for 8 first-pass hypotheses."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-166",
"text": "However, we note that the computation can be parallelized across hypotheses [10] and should have less impact on latency."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-167",
"text": "Latency estimation is complicated, and we will quantify that in future works."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-169",
"text": "**DECODING EXAMPLES**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-170",
"text": "Lastly, we compare some decoding examples between deliberation and LAS rescoring in Table 6 ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-171",
"text": "One type of wins for deliberation is URL, where the deliberation model corrects and concatenates string pieces to a single one since it sees the whole first-pass hypothesis."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-172",
"text": "Second type is proper noun."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-173",
"text": "Leveraging the context, deliberation re- Fig. 2 ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-174",
"text": "Example attention probabilities on a first-pass RNN-T hypothesis: \"Weather in London Nevada\", for generating the second-pass result \"Weather in Lund Nevada\"."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-175",
"text": "Brighter colors correspond to higher probabilities."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-176",
"text": "A beginning wordpiece starts with a space marker (i.e., \" \")."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-177",
"text": "s denotes start of sentence, and \\s the end of sentence."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-178",
"text": "alizes the previous word should be a proper noun (i.e. Walmart)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-179",
"text": "Third, the deliberation decoder corrects semantic errors (china \u2192 train)."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-180",
"text": "On the other hand, we also see some losses of deliberation due to over-correction of proper nouns or spelling difference."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-181",
"text": "The former is probably from knowledge in training, and the latter is benign and does not affect semantics."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-183",
"text": "**LAS RESCORING**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-184",
"text": "Deliberation Quality times.com quadcitytimes.com Where my job application Walmart job application china near me train near me bio of Chesty Fuller bio of Chester Fuller 2016 Kia Forte5 2016 Kia Forte 5 Table 6 ."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-185",
"text": "Decoding examples of deliberation and LAS rescoring."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-186",
"text": "Deliberation wins are in green and losses in red."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-187",
"text": "----------------------------------"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-188",
"text": "**CONCLUSION**"
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-189",
"text": "We presented a new two-pass E2E ASR based on the deliberation network, and our best model obtained significant improvements over LAS rescoring in both VS tasks and proper noun recognition: 12% and 23% WERR, respectively."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-190",
"text": "The model also performs 21% relatively better than a large conventional model for VS."
},
{
"sent_id": "d53d1b53168041baea5b5002b46627-C001-191",
"text": "Although the model requires more computation than LAS rescoring, batching across hypotheses can improve latency."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d53d1b53168041baea5b5002b46627-C001-17"
]
],
"cite_sentences": [
"d53d1b53168041baea5b5002b46627-C001-17"
]
},
"@DIF@": {
"gold_contexts": [
[
"d53d1b53168041baea5b5002b46627-C001-34"
],
[
"d53d1b53168041baea5b5002b46627-C001-59",
"d53d1b53168041baea5b5002b46627-C001-60"
],
[
"d53d1b53168041baea5b5002b46627-C001-90"
],
[
"d53d1b53168041baea5b5002b46627-C001-102"
],
[
"d53d1b53168041baea5b5002b46627-C001-153"
]
],
"cite_sentences": [
"d53d1b53168041baea5b5002b46627-C001-34",
"d53d1b53168041baea5b5002b46627-C001-59",
"d53d1b53168041baea5b5002b46627-C001-60",
"d53d1b53168041baea5b5002b46627-C001-90",
"d53d1b53168041baea5b5002b46627-C001-102",
"d53d1b53168041baea5b5002b46627-C001-153"
]
},
"@SIM@": {
"gold_contexts": [
[
"d53d1b53168041baea5b5002b46627-C001-42"
],
[
"d53d1b53168041baea5b5002b46627-C001-84"
]
],
"cite_sentences": [
"d53d1b53168041baea5b5002b46627-C001-42",
"d53d1b53168041baea5b5002b46627-C001-84"
]
},
"@USE@": {
"gold_contexts": [
[
"d53d1b53168041baea5b5002b46627-C001-70"
],
[
"d53d1b53168041baea5b5002b46627-C001-89"
],
[
"d53d1b53168041baea5b5002b46627-C001-159"
]
],
"cite_sentences": [
"d53d1b53168041baea5b5002b46627-C001-70",
"d53d1b53168041baea5b5002b46627-C001-89",
"d53d1b53168041baea5b5002b46627-C001-159"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"d53d1b53168041baea5b5002b46627-C001-166"
]
],
"cite_sentences": [
"d53d1b53168041baea5b5002b46627-C001-166"
]
}
}
},
"ABC_f8c992a887a7b7af8b3aa45f72dca7_4": {
"x": [
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-135",
"text": "The popular bag-of-words (BOW) assumption represents a text as a histogram of word occurrences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-160",
"text": "**EVALUATION**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-2",
"text": "To date, few attempts have been made to develop new methods and validate existing ones for automatic evaluation of discourse coherence in the noisy domain of learner texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-3",
"text": "We present the first systematic analysis of several methods for assessing coherence under the framework of automated assessment (AA) of learner free-text responses."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-4",
"text": "We examine the predictive power of different coherence models by measuring the effect on performance when combined with an AA system that achieves competitive results, but does not use discourse coherence features, which are also strong indicators of a learner's level of attainment."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-5",
"text": "Additionally, we identify new techniques that outperform previously developed ones and improve on the best published result for AA on a publically-available dataset of English learner free-text examination scripts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-8",
"text": "Automated assessment (hereafter AA) systems of English learner text assign grades based on textual features which attempt to balance evidence of writing competence against evidence of performance errors."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-9",
"text": "Previous work has mostly treated AA as a supervised text classification or regression task."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-10",
"text": "A number of techniques have been investigated, including cosine similarity of feature vectors (Attali and Burstein, 2006) , often combined with dimensionality reduction techniques such as Latent Semantic Analysis (LSA) (Landauer et al., 2003) , and generative machine learning models (Rudner and Liang, 2002) as well as discriminative ones (Yannakoudakis et al., 2011) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-11",
"text": "As multiple factors influence the linguistic quality of texts, such systems exploit features that correspond to different properties of texts, such as grammar, style, vocabulary usage, topic similarity, and discourse coherence and cohesion."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-12",
"text": "Cohesion refers to the use of explicit linguistic cohesive devices (e.g., anaphora, lexical semantic relatedness, discourse markers, etc.) within a text that can signal primarily suprasentential discourse relations between textual units (Halliday and Hasan, 1976) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-13",
"text": "Cohesion is not the only mechanism of discourse coherence, which may also be inferred from meaning without presence of explicit linguistic cues."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-14",
"text": "Coherence can be assessed locally in terms of transitions between adjacent clauses, parentheticals, and other textual units capable of standing in discourse relations, or more globally in terms of the overall topical coherence of text passages."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-15",
"text": "There is a large body of work that has investigated a number of different coherence models on news texts (e.g., Lin et al. (2011) , Elsner and Charniak (2008) , and Soricut and Marcu (2006) )."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-16",
"text": "Recently, Pitler et al. (2010) presented a detailed survey of current techniques in coherence analysis of extractive summaries."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-17",
"text": "To date, however, few attempts have been made to develop new methods and validate existing ones for automatic evaluation of discourse coherence and cohesion in the noisy domain of learner texts, where spelling and grammatical errors are common."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-18",
"text": "Coherence quality is typically present in marking criteria for evaluating learner texts, and it is iden-tified by examiners as a determinant of the overall score."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-19",
"text": "Thus we expect that adding a coherence metric to the feature set of an AA system would better reflect the evaluation performed by examiners and improve performance."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-20",
"text": "The goal of the experiments presented in this paper is to measure the effect a number of (previously-developed and new) coherence models have on performance when combined with an AA system that achieves competitive results, but does not use discourse coherence features."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-21",
"text": "Our contribution is threefold: 1) we present the first systematic analysis of several methods for assessing discourse coherence in the framework of AA of learner free-text responses, 2) we identify new discourse features that serve as proxies for the level of (in)coherence in texts and outperform previously developed techniques, and 3) we improve the best results reported by Yannakoudakis et al. (2011) on the publically available 'English as a Second or Other Language' (ESOL) corpus of learner texts (to date, this is the only public-domain corpus that contains grades)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-22",
"text": "Finally, we explore the utility of our best model for assessing the incoherent 'outlier' texts used in Yannakoudakis et al. (2011) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-23",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-24",
"text": "**EXPERIMENTAL DESIGN & BACKGROUND**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-25",
"text": "We examine the predictive power of a number of different coherence models by measuring the effect on performance when combined with an AA system that achieves state-of-the-art results, but does not use discourse coherence features."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-26",
"text": "Specifically, we describe a number of different experiments improving on the AA system presented in Yannakoudakis et al. (2011) ; AA is treated as a rank preference supervised learning problem and ranking Support Vector Machines (SVMs) (Joachims, 2002) are used to explicitly model the grade relationships between scripts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-27",
"text": "This system uses a number of different linguistic features that achieve good performance on the AA task."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-28",
"text": "However, these features only focus on lexical and grammatical properties, as well as errors within individual sentences, ignoring discourse coherence, which is also present in marking criteria for evaluating learner texts, as well as a strong indicator of a writer's understanding of a language."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-29",
"text": "Also, in Yannakoudakis et al. (2011) , experiments are presented that test the validity of the system using a number of automatically-created 'outlier' texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-30",
"text": "The results showed that the model is vulnerable to input where individually high-scoring sentences are randomly ordered within a text."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-31",
"text": "Failing to identify such pathological cases makes AA systems vulnerable to subversion by writers who understand something of its workings, thus posing a threat to their validity."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-32",
"text": "For example, an examinee might learn by rote a set of well-formed sentences and reproduce these in an exam in the knowledge that an AA system is not checking for prompt relevance or coherence 1 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-34",
"text": "**DATASET & EXPERIMENTAL SETUP**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-60",
"text": "Discourse connectives (such as but or because) relate propositions expressed by different clauses or sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-35",
"text": "We use the First Certificate in English (FCE) ESOL examination scripts 2 (upper-intermediate level assessment) described in detail in Yannakoudakis et al. (2011) , extracted from the Cambridge Learner Corpus 3 (CLC)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-36",
"text": "The dataset consists of 1,238 texts between 200 and 400 words produced by 1,238 distinct learners in response to two different prompts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-37",
"text": "An overall mark has been assigned in the range 1-40."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-38",
"text": "For all experiments, we use a series of 5-fold cross-validation runs on 1,141 texts from the examination year 2000 to evaluate performance as well as generalization of numerous models."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-39",
"text": "Moreover, we identify the best model on year 2000 and we also test it on 97 texts from the examination year 2001, previously used in Yannakoudakis et al. (2011) to report the best published results."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-40",
"text": "Validating the results on a different examination year tests generalization to some prompts not used in 2000, and also allows us to test correlation between examiners and the AA system."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-41",
"text": "Again, we treat AA as a rank preference learning problem and use SVMs, utilizing the SVM light package (Joachims, 2002) , to facilitate comparison with Yannakoudakis et al. (2011) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-43",
"text": "**DISCOURSE COHERENCE**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-44",
"text": "We focus on the development and evaluation of (automated) methods for assessing coherence in learner texts under the framework of AA."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-45",
"text": "Most of the methods we investigate require syntactic analysis."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-46",
"text": "As in Yannakoudakis et al. (2011) , we analyze all texts using the RASP toolkit (Briscoe et al., 2006) 4 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-48",
"text": "**'SUPERFICIAL' PROXIES**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-49",
"text": "In this section we introduce diverse classes of 'superficial' cohesive features that serve as proxies for coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-50",
"text": "Surface text properties have been assessed in the framework of automatic summary evaluation (Pitler et al., 2010) , and have been shown to significantly correlate with the fluency of machinetranslated sentences (Chae and Nenkova, 2009 )."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-52",
"text": "**PART-OF-SPEECH (POS) DISTRIBUTION**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-53",
"text": "The AA system described in Yannakoudakis et al. (2011) exploited features based on POS tag sequences, but did not consider the distribution of POS types across grades."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-54",
"text": "In coherent texts, constituent clauses and sentences are related and depend on each other for their interpretation."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-55",
"text": "Anaphors such as pronouns link the current sentence to those where the entities were previously mentioned."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-56",
"text": "Pronouns can be directly related to (lack of) coherence and make intuitive sense as cohesive devices."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-57",
"text": "We compute the number of pronouns in a text and use it as a shallow feature for capturing coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-59",
"text": "**DISCOURSE CONNECTIVES**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-61",
"text": "The presence of such items in a text should be indicative of (better) coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-62",
"text": "We thus compute a number of shallow cohesive features as proxies for coherence, based on fixed lists of words belonging to the following categories: (a) Addition (e.g., additionally), (b) Comparison (e.g., likewise), (c) Contrast (e.g., whereas) and (d) Conclusion (e.g., therefore), and use the frequencies of these four categories as features."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-64",
"text": "**WORD LENGTH**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-65",
"text": "The previous AA system treated script length as a normalizing feature, but otherwise avoided such 'superficial' proxies of text quality."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-66",
"text": "However, many cohesive words are longer than average, especially for the closed-class functional component of English vocabulary."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-67",
"text": "We thus assess the minimum, maximum and average word length as a superficial proxy for coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-69",
"text": "**SEMANTIC SIMILARITY**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-70",
"text": "We explore the utility of inter-sentential feature types for assessing discourse coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-71",
"text": "Among the features used in Yannakoudakis et al. (2011) , none explicitly captures coherence and none models intersentential relationships."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-72",
"text": "Incremental Semantic analysis (ISA) (Baroni et al., 2007) is a word-level distributional model that induces a semantic space from input texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-73",
"text": "ISA is a fully-incremental variation of Random Indexing (RI) (Sahlgren, 2005) , which can efficiently capture second-order effects in common with other dimensionality-reduction methods based on singular value decomposition, but does not rely on stoplists or global statistics for weighting purposes."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-74",
"text": "Utilizing the S-Space package (Jurgens and Stevens, 2010), we trained an ISA model 5 using a subset of ukWaC (Ferraresi et al., 2008) , a large corpus of English containing more than 2 billion tokens."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-75",
"text": "We used the POS tagger lexicon provided with the RASP system to discard documents whose proportion of valid English words to total words is less than 0.4; 78,000 documents were extracted in total and were then preprocessed replacing URLs, email addresses, IP addresses, numbers and emoticons with special markers."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-76",
"text": "To measure local coherence we define the similarity between two sentences s i and s i+1 as the maximum cosine similarity between the history vectors of the words they contain."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-77",
"text": "The overall coherence of a text T is then measured by taking the mean of all sentence-pair scores:"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-78",
"text": ") is the cosine similarity between the history vectors of the k th word in s i and the j th word in s i+1 , and n is the total number of sentences 6 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-79",
"text": "We investigate the efficacy of ISA by adding this coherence score, as well as the maximum sim value found over the entire text, to the vectors of features associated with a text."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-80",
"text": "The hypothesis is that the degree of semantic relatedness between adjoining sentences serves as a proxy for local discourse coherence; that is, coherent text units contain semantically-related words."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-81",
"text": "Higgins et al. (2004) and Higgins and Burstein (2007) use RI to determine the semantic similarity between sentences of same/different discourse segments (e.g., from the essay thesis and conclusion, or between sentences and the essay prompt), and assess the percentage of sentences that are correctly classified as related or unrelated."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-82",
"text": "The main differences from our approach are that we assess the utility of semantic space models for predicting the overall grade for a text, in contrast to binary classification at the sentence-level, and we use ISA rather than RI 7 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-84",
"text": "**ENTITY-BASED COHERENCE**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-85",
"text": "The entity-based coherence model, proposed by Barzilay and Lapata (2008) , is one of the most popular statistical models of inter-sentential coherence, and learns coherence properties similar to those employed by Centering Theory (Grosz et al., 1995) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-86",
"text": "Local coherence is modeled on the basis of sequences of entity mentions that are labeled with their syntactic roles (e.g., subject, object)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-87",
"text": "We construct the entity grids using the Brown Coherence Toolkit 8,9 (Elsner and Charniak, 2011b) , and use as features the probabilities of different entity transition types, defined in terms of their role in adjacent sentences 10 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-88",
"text": "Burstein et al. (2010) show how the entity-grid can be used to discriminate highcoherence from low-coherence learner texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-89",
"text": "The main difference with our approach is that we evaluate the entity-grid model in the context of AA text grading, rather than binary classification."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-90",
"text": "7 We also used RI in addition to ISA, and found that it did not yield significantly different results."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-91",
"text": "In particular, we trained a RI model with 2,000 dimensions and a context window of 3 on the same ukWaC data."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-92",
"text": "Below we only report results for the fully-incremental ISA model."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-93",
"text": "8 https://bitbucket.org/melsner/browncoherence 9 The tool does not perform full coreference resolution; instead, coreference is approximated by linking entities that share a head noun."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-94",
"text": "10 We represent entities with specified roles (Subject, Object, Neither, Absent), use transition probabilities of length 2, 3 and 4, and a salience option of 2."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-96",
"text": "**PRONOUN COREFERENCE MODEL**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-97",
"text": "Pronominal anaphora is another important aspect of coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-98",
"text": "Charniak and Elsner (2009) present an unsupervised generative model of pronominal anaphora for coherence modeling."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-99",
"text": "In their implementation, they model each pronoun as generated by an antecedent somewhere in the previous two sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-100",
"text": "If a 'good' antecedent is found, the probability of a pronoun will be high; otherwise, the probability will be low."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-101",
"text": "The overall probability of a text is then calculated as the probability of the resulting sequence of pronoun assignments."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-102",
"text": "In our experiments, we use the pre-trained model distributed by Charniak and Elsner (2009) for news text to estimate the probability of a text and include it as a feature."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-103",
"text": "However, this model is trained on high-quality texts, so performance may deteriorate when applied to learner texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-104",
"text": "It is not obvious how to train such a model on learner texts and we leave this for future research."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-106",
"text": "**DISCOURSE-NEW MODEL**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-107",
"text": "Elsner and Charniak (2008) apply a discourse-new classifier to model coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-108",
"text": "Their classifier distinguishes NPs whose referents have not been previously mentioned in the discourse from those that have been already introduced, using a number of syntactic and lexical features."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-109",
"text": "To model coherence, they assign each NP in a text a label L np \u2208 {new, old} 11 , and calculate the probability of a text as \u03a0 np:N P s P (L np |np)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-110",
"text": "Again, we use the pretrained model distributed by Charniak and Elsner (2009) for news text to find the probability of a text following Elsner and Charniak (2008) and include it as a feature."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-112",
"text": "**IBM COHERENCE MODEL**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-113",
"text": "Soricut and Marcu (2006) adapted the IBM model 1 (Brown et al., 1994) used in machine translation (MT) to model local discourse coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-114",
"text": "The intuition behind the IBM model in MT is that the use of certain words in a source language is likely to trigger the use of certain words in a target language."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-115",
"text": "Instead, they hypothesized that the use of certain words in a sentence tends to trigger the use of certain words in an adjoining sentence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-116",
"text": "In contrast to semantic space models such as ISA or RI (discussed above), this method models the intuition that local coherence is signaled by the identification of word co-occurrence patterns across adjacent sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-117",
"text": "We compute two features introduced by Soricut and Marcu (2006) : the forward likelihood and the backward likelihood."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-118",
"text": "The first refers to the likelihood of observing the words in sentence s i+1 conditioned on s i , and the latter to the likelihood of observing the words in s i conditioned on s i+1 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-119",
"text": "We extract 3 million adjacent sentences from ukWaC 12 , and use the GIZA++ (Och and Ney, 2000) implementation of IBM model 1 to obtain the probabilities of recurring patterns."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-120",
"text": "The forward and backward probabilities are calculated over the entire text, and their values are used as features in our feature vectors 13 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-121",
"text": "We further extend the above model and incorporate syntactic aspects of text coherence by training on POS tags instead of lexical items."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-122",
"text": "We try to model the intuition that local coherence is signaled by the identification of POS co-occurrence patterns across adjacent sentences, where the use of certain POS tags in a sentence tends to trigger the use of other POS tags in an adjacent sentence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-123",
"text": "We analyze 3 million adjacent sentences using the RASP POS tagger and train the same IBM model to obtain the probabilities of recurring POS patterns."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-125",
"text": "**LEMMA/POS COSINE SIMILARITY**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-126",
"text": "A simple method of incorporating (syntactic) aspects of text coherence is to use cosine similarity between vectors of lemma and/or POS-tag counts in adjacent sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-127",
"text": "We experiment with both: each sentence is represented by a vector whose dimension depends on the total number of lemmas/POStypes."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-128",
"text": "The sentence vectors are weighted using lemma/POS frequency, and the cosine similarity between adjacent sentences is calculated."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-129",
"text": "The coherence of a text T is then calculated as the average value of cosine similarity over the entire text 14 :"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-130",
"text": "12 We use the same subset of documents as the ones used to train our ISA model in Section 4.2."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-131",
"text": "13 Pitler et al. (2010) have also investigated the IBM model to measure text quality in automatically-generated texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-132",
"text": "14 Pitler et al. (2010) use POS cosine similarity to measure continuity in automatically-generated texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-134",
"text": "**LOCALLY-WEIGHTED BAG-OF-WORDS**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-136",
"text": "While computationally efficient, such a representation is unable to maintain any sequential information."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-137",
"text": "The locally-weighted bag-of-words (LOW-BOW) framework, introduced by Lebanon et al. (2007) , is a sequentially-sensitive alternative to BOW."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-138",
"text": "In BOW, we represent a text as a histogram over the vocabulary used to generate that text."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-139",
"text": "In LOWBOW, a text is represented by a set of local histograms computed across the whole text, but smoothed by kernels centered on different locations."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-140",
"text": "More specifically, a smoothed characterization of the local histogram is obtained by integrating a length-normalized document with respect to a nonuniform measure that is concentrated around a particular location \u00b5 \u2208 [0, 1]."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-141",
"text": "In accordance with the statistical literature on non-parametric smoothing, we refer to such a measure as a smoothing kernel."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-142",
"text": "The kernel parameters \u00b5 and \u03c3 specify the local histogram's position in the text (i.e., where it is centered) and its scale (i.e., to what extent it is smoothed over the surrounding region) respectively."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-143",
"text": "In contrast to BOW or n-grams, which keep track of frequently occurring patterns independent of their positions, this representation is able to robustly capture medium and long range sequential trends in a text by keeping track of changes in the histograms from its beginning to end."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-144",
"text": "Geometrically, LOWBOW uses local smoothing to embed texts as smooth curves in the multinomial simplex."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-145",
"text": "These curves summarize the progression of semantic and/or statistical trends through the text."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-146",
"text": "By varying the amount of smoothing we obtain a family of sequential representations possessing different sequential resolutions or scales."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-147",
"text": "Low resolution representations capture topic trends and shifts while ignoring finer details."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-148",
"text": "High resolution representations capture fine sequential details but make it difficult to grasp the general trends within the text 15 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-149",
"text": "Since coherence involves both cohesive lexical devices and sequential progression within a text, we believe that LOWBOW can be used to assess the sequential content and the global structure and coher-ence of texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-150",
"text": "We use a publically-available LOW-BOW implementation 16 to create local histograms over word unigrams."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-151",
"text": "For the LOWBOW kernel smoothing function (see above), we use the Gaussian probability density function restricted to [0, 1] and re-normalized, and a smoothing \u03c3 value of 0.02."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-152",
"text": "Additionally, we consider a total number of 9 local histograms (discourse segments)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-153",
"text": "We further extend the above model and incorporate syntactic aspects of text coherence by using local histograms over POS unigrams."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-154",
"text": "This representation is able to capture sequential trends abstracted into POS tags."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-155",
"text": "We try to model the hypothesis that coherence is signaled by sequential, mostly inter-sentential progression of POS types."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-156",
"text": "Since each text is represented by a set of local histrograms/vectors, and standard SVM kernels cannot work with such input spaces, we use instead a kernel defined over sets of vectors: the diffusion kernel (Lafferty and Lebanon, 2005) compares local histograms in a one-to-one fashion (i.e., histograms at the same locations are compared to each other), and has proven to be useful for related tasks (Lebanon et al., 2007; Escalante et al., 2011) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-157",
"text": "To the best of our knowledge, LOWBOW representations have not been investigated for coherence evaluation (under the AA framework)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-158",
"text": "So far, they have been applied to discourse segmentation (AMIDA, 2007), text categorization (Lebanon et al., 2007) , and authorship attribution (Escalante et al., 2011)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-161",
"text": "We examine the predictive power of each of the coherence models/features described in Section 4 by measuring the effect on performance when combined with an AA system that achieves state-of-theart results on the FCE dataset, but does not use discourse coherence features."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-162",
"text": "In particular, we use the system described in Yannakoudakis et al. (2011) as our baseline AA system."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-163",
"text": "Discourse coherence is a strong indicator of thorough knowledge of a second language and thus we expect coherence features to further improve performance of AA systems."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-164",
"text": "We evaluate the grade predictions of our models against the gold standard grades in the dataset using Pearson's product-moment correlation coeffi-16 http://goo.gl/yQ0Q0 cient (r) and Spearman's rank correlation coefficient (\u03c1) as is standard in AA research (Briscoe et al., 2010) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-165",
"text": "Table 1 gives results obtained by augmenting the baseline model with each of the coherence features described above."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-166",
"text": "In each of these experiments, we perform 5-fold cross-validation 17 using all 1,141 texts from the exam year 2000 (see Section 3)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-167",
"text": "Most of the resulting models have minimal effect on performance 18 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-168",
"text": "However, word length, ISA, LOWBOW lex , and the IBM model POS f derived models all improve performance, while larger differences are observed in r. The highest performance -0.675 and 0.678 -is obtained with ISA, while the second best feature is word length."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-169",
"text": "The entity-grid, the pronoun model and the discourse-new model do not improve on the baseline."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-170",
"text": "Although these models have been successfully used as components in state-of-the-art systems for discriminating coherent from incoherent news documents (Elsner and Charniak, 2011b) , and the entity-grid model has also been successfully applied to learner text (Burstein et al., 2010) , they seem to have minimal impact on performance, while the discourse-new model decreases \u03c1 by 0.01."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-171",
"text": "On the other hand, LOWBOW lex and LOWBOW POS give an increase in performance, which confirms our hypothesis that local histograms are useful."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-172",
"text": "Also, the former seems to perform slightly better than the latter."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-173",
"text": "Our adapted version of the IBM model -IBM model POS -performs better than its lexicalized version, which does not have an impact on performance, while larger differences are observed in r. Additionally, the increase in performance is larger than the one obtained with the entity-grid, pronoun or discourse-new model."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-174",
"text": "The forward version of IBM model POS seems to perform slightly better than the backward one, while the results are comparable to LOWBOW POS and outperformed by LOWBOW lex ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-175",
"text": "The rest of the models do not perform as well; the number of pronouns or discourse connectives gives low results, while lemma and POS cosine similarity between adjacent sentences are also 17 We compute mean values of correlation coefficients by first applying the r-to-Z Fisher transformation, and then using the Fisher weighted mean correlation coefficient (Faller, 1981) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-176",
"text": "18 Significance tests in averaged correlations are omitted as variable estimates are produced, whose variance is hard to be estimated unbiasedly."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-177",
"text": "among the weakest predictors."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-178",
"text": "Elsner and Charniak (2011b) have shown that combining the entity-grid with the pronoun, discourse-new and lexicalized IBM models gives state-of-the-art results for discriminating news documents and their random permutations."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-179",
"text": "We also combine these models and assess their performance under the AA framework."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-180",
"text": "Row 16 of Table 1 shows that the combination does not give an improvement over the individual models."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-181",
"text": "Moreover, combining all feature classes together in row 17 does not yield higher results than those obtained with ISA, while \u03c1 is no better than the baseline."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-182",
"text": "In the following experiments, we evaluate the best model identified on year 2000 on a set of 97 texts from the exam year 2001, previously used in Yannakoudakis et al. (2011) to report results of the final best system."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-183",
"text": "Validating the model on a different exam year also shows us the extent to which it generalizes between years."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-184",
"text": "Table 2 published results on the 2001 texts, getting closer to the upper-bound."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-185",
"text": "The upper-bound on this dataset 20 is 0.796 and 0.792 r and \u03c1 respectively, calculated by taking the average correlation between the FCE grades and the ones provided by 4 senior ESOL examiners 21 ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-186",
"text": "Table 3 also presents the average correlation between our extended AA system's predicted grades and the 4 examiners' grades, in addition to the original FCE grades from the dataset."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-187",
"text": "Again, our extended model improves over the baseline."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-188",
"text": "Finally, we explore the utility of our best model for assessing the publically available 'outlier' texts used in Yannakoudakis et al. (2011) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-189",
"text": "The previous AA system is unable to downgrade appropriately 'outlier' scripts containing individually high-scoring sentences with poor overall coherence, created by randomly ordering a set of highly-marked texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-190",
"text": "To test our best system, we train an SVM rank preference model with the ISA-derived coherence feature, which can explicitly capture such sequential trends."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-191",
"text": "A generic model for flagging putative 'outlier' texts -whose predicted score is lower than a predefined threshold -for manual checking might be used as the first stage of a deployed AA system."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-192",
"text": "The ISA model improves r and \u03c1 by 0.320 and 0.463 respectively for predicting a score on this type of 'outlier' texts and their original version (Table 4) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-193",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-194",
"text": "**ANALYSIS & DISCUSSION**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-195",
"text": "In the previous section, we evaluated various cohesion and coherence features on learner data, and found different patterns of performance compared to those previously reported on news texts (see Section 7 for more details)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-196",
"text": "Although most of the models examined gave a minimal effect on AA performance, ISA, LOWBOW lex , IBM model POS f and word length dependent correlations (Williams, 1959; Steiger, 1980) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-197",
"text": "20 See Yannakoudakis et al. (2011) for details."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-198",
"text": "21 The examiners' scores are also distributed with the FCE dataset."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-199",
"text": "gave a clear improvement in correlation, with larger differences in r. Our results indicate that coherence metrics further improve the performance of a competitive AA system."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-200",
"text": "More specifically, we found the ISA-derived feature to be the most effective contributor to the prediction of text quality."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-201",
"text": "This suggests that incoherence in FCE texts might be due to topic discontinuities."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-202",
"text": "Also, the improvement obtained by LOWBOW suggests that patterns of sequential progression within a text can be useful: coherent texts appear to use similar token distributions at similar locations across different documents."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-203",
"text": "The word length feature was successfully used as a proxy for coherence, perhaps because many cohesive words are longer than average."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-204",
"text": "However, such a feature can also capture further aspects of texts, such as lexical complexity, so further investigation is needed to identify the extent to which it measures different properties."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-205",
"text": "On the other hand, the minimal effect of the entity-grid, pronoun and discourse-new model suggests that infelicitous use of pronominal forms or sequences of entities may not be an issue in FCE texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-206",
"text": "Preliminary investigation of the scripts showed that learners tend to repeat the same entity names or descriptions rather than use pronouns or shorter descriptions."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-207",
"text": "A possible explanation for the difference in performance between the lexicalized and POS IBM model is that the latter abstracts away from lexical information and thus avoids misspellings and reduces sparsity."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-208",
"text": "Also, our discourse connective classes do not seem to have a predictive power."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-209",
"text": "This may be because our manually-built word lists do not have sufficient coverage."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-210",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-211",
"text": "**PREVIOUS WORK**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-212",
"text": "Comparatively few metrics have been investigated for evaluating coherence in (ESOL) learner texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-213",
"text": "Miltsakaki and Kukich (2004) employ e-Rater (Attali and Burstein, 2006) , an essay scoring system, and show that Centering Theory's Rough-Shift transitions (Grosz et al., 1995) contribute significantly to the assessment of learner texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-214",
"text": "Higgins et al. (2004) and Higgins and Burstein (2007) use RI to determine the semantic similarity between sentences of same/different discourse segments."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-215",
"text": "Their model is based on a number of different semantic similarity scores and assesses the percentage of sentences that are correctly classified as (un)related."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-216",
"text": "Among their results, they found that it is hard to beat the baseline (as 98.1% of the sentences were annotated as 'highly related') and identify sentences which are not related to other ones in the same discourse segment."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-217",
"text": "We demonstrate that the related fully-incremental ISA model can be used to improve AA grading accuracy on the FCE dataset, as opposed to classifying the (non-)relatedness of sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-218",
"text": "Burstein et al. (2010) show how the entity-grid can be used to discriminate high-coherence from low-coherence learner texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-219",
"text": "They augment this model with additional features related to writing quality and word usage, and show a positive effect in performance for automated coherence prediction of student essays of different populations."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-220",
"text": "On the FCE dataset used here, entity-grids do not improve AA grading accuracy."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-221",
"text": "This may be because the texts are shorter or because grading is a more difficult task than binary classification."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-222",
"text": "Application of their augmented entity-grid model to FCE texts would be an interesting avenue for future research."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-223",
"text": "Foltz et al. (1998) examine local coherence in textbooks and articles using Latent Semantic Analysis (LSA) (Landauer et al., 2003) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-224",
"text": "They assess semantic relatedness using vector-based similarity between adjacent sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-225",
"text": "They argue that LSA may be more appropriate for comparing the relative quality of texts; for determining the overall text coherence it may be difficult to set a criterion for the coherence value since it depends on a variety of different factors, such as the size of the text units to be compared."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-226",
"text": "Nevertheless, our results show that ISA, a similar distributional semantic model with dimen-sionality reduction, improves FCE grading accuracy."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-227",
"text": "Barzilay and Lee (2004) implement lexicalized content models that represent global text properties on news articles and narratives using Hidden Markov Models (HMMs)."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-228",
"text": "In the HMM, states correspond to distinct topics, and transitions between states represent the probability of moving from one topic to another."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-229",
"text": "This approach has the advantage of capturing the order in which different topics appear in texts; however, the HMMs are highly domain specific and would probably need retraining for each distinct essay prompt."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-230",
"text": "Soricut and Marcu (2006) use a log-linear model that combines local and global models of coherence and show that it outperforms each of the individual ones on news articles and accident reports."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-231",
"text": "Their global model is based on the document content model proposed by Barzilay and Lee (2004) ."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-232",
"text": "Their local model of discourse coherence is based on the entity-grid (Barzilay and Lapata, 2008) , as well as on the lexicalized IBM model (see Section 4.6 above); we have experimented with both, and showed that they have a minimal effect on grading performance with the FCE dataset."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-233",
"text": "Elsner and Charniak (2008; 2011a) apply a discourse-new classifier and a pronoun coreference system to model coherence (see Section 4) on dialogue and news texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-234",
"text": "They found that combining these models with the entity-grid achieves state-ofthe-art performance."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-235",
"text": "We found that such a combination, as well as the individual models do not perform well for grading the FCE texts."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-236",
"text": "Recently, Elsner and Charniak (2011a) proposed a variation of the entity-grid intended to integrate topical information."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-237",
"text": "They use Latent Dirichlet Allocation (Blei et al., 2003) to learn topic-to-word distributions, and model coherence by generalizing the binary history features of the entity-grid and computing a real-valued feature which represents the similarity between an entity and the subject(s) of the previous sentence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-238",
"text": "Also, Lin et al. (2011) proposed a model that assesses the coherence of a text based on discourse relation transitions."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-239",
"text": "The underlying idea is that coherent texts exhibit measurable preferences for specific intra-and inter-discourse relation ordering."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-240",
"text": "They found their model to be complementary to the entity-grid, as it encodes the notion of preferential ordering of discourse relations, and thus tackles local coherence from a different perspective."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-241",
"text": "Applying the above models to AA on learner texts would also be an interesting avenue for future work."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-242",
"text": "----------------------------------"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-243",
"text": "**CONCLUSION**"
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-244",
"text": "We presented the first systematic analysis of a wide variety of models for assessing discourse coherence on learner data, and evaluated their individual performance as well as their combinations for the AA grading task."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-245",
"text": "We adapted the LOWBOW model for assessing sequential content in texts, and showed evidence supporting our hypothesis that local histograms are useful."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-246",
"text": "We also successfully adapted ISA, an efficient and incremental variant distributional semantic model, to this task."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-247",
"text": "ISA, LOWBOW, the POS IBM model and word length are the best individual features for assessing coherence."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-248",
"text": "A significant improvement over the AA system presented in Yannakoudakis et al. (2011) and the best published result on the FCE dataset was obtained by augmenting the system with an ISA-based local coherence feature."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-249",
"text": "However, it is quite likely that further experimentation with LOWBOW features, given the large range of possible parameter settings, would yield better results too."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-250",
"text": "We also explored the robustness of the ISA model of local coherence on 'outlier' texts and achieved much better correlations with the examiner's grades for these texts in the FCE dataset."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-251",
"text": "This should facilitate development of an automated system to detect essays consisting of high-quality but incoherent sequences of sentences."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-252",
"text": "All our results are specific to ESOL FCE texts and may not generalize to other genres or ESOL attainment levels."
},
{
"sent_id": "f8c992a887a7b7af8b3aa45f72dca7-C001-253",
"text": "Future work should also investigate a wider range of (learner) texts and further coherence models, such as that of Elsner and Charniak (2011a) and Lin et al. (2011) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-10"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-53"
]
],
"cite_sentences": [
"f8c992a887a7b7af8b3aa45f72dca7-C001-10",
"f8c992a887a7b7af8b3aa45f72dca7-C001-53"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-21"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-29"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-41"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-197"
]
],
"cite_sentences": [
"f8c992a887a7b7af8b3aa45f72dca7-C001-21",
"f8c992a887a7b7af8b3aa45f72dca7-C001-29",
"f8c992a887a7b7af8b3aa45f72dca7-C001-41",
"f8c992a887a7b7af8b3aa45f72dca7-C001-197"
]
},
"@USE@": {
"gold_contexts": [
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-22"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-35"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-39"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-46"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-162"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-182"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-188"
]
],
"cite_sentences": [
"f8c992a887a7b7af8b3aa45f72dca7-C001-22",
"f8c992a887a7b7af8b3aa45f72dca7-C001-35",
"f8c992a887a7b7af8b3aa45f72dca7-C001-39",
"f8c992a887a7b7af8b3aa45f72dca7-C001-46",
"f8c992a887a7b7af8b3aa45f72dca7-C001-162",
"f8c992a887a7b7af8b3aa45f72dca7-C001-182",
"f8c992a887a7b7af8b3aa45f72dca7-C001-188"
]
},
"@EXT@": {
"gold_contexts": [
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-26"
]
],
"cite_sentences": [
"f8c992a887a7b7af8b3aa45f72dca7-C001-26"
]
},
"@SIM@": {
"gold_contexts": [
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-46"
]
],
"cite_sentences": [
"f8c992a887a7b7af8b3aa45f72dca7-C001-46"
]
},
"@DIF@": {
"gold_contexts": [
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-70",
"f8c992a887a7b7af8b3aa45f72dca7-C001-71"
],
[
"f8c992a887a7b7af8b3aa45f72dca7-C001-248"
]
],
"cite_sentences": [
"f8c992a887a7b7af8b3aa45f72dca7-C001-71",
"f8c992a887a7b7af8b3aa45f72dca7-C001-248"
]
}
}
},
"ABC_845c66e6dfafc21ab90e5aa5cbf947_4": {
"x": [
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-138",
"text": "First, we construct a distribution P y|label(xa) over word labels y and sample a different label from it."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-139",
"text": "Second, we sample an example uniformly from within the subset with the chosen label."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-140",
"text": "The goal of this method is to speed up training by targeting pairs that violate the margin constraint."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-158",
"text": "Final test set results are given in Table 1."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-141",
"text": "To construct the multinomial PMF P y|label(xa) , we maintain an n \u00d7 n matrix S, where n is the number of unique word labels in training."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-142",
"text": "Each word label corresponds to an integer i \u2208 [1, n] and therefore a row in S. The values in a row of S are considered similarity scores, and we can retrieve the desired PMF for each row by normalizing by its sum."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-143",
"text": "At the start of each epoch, we initialize S with 0's along the diagonal and 1's elsewhere (which reduces to uniform sampling)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-144",
"text": "For each training pair (d cos (x a , x s ), d cos (x a , x d ) ), we update S for both (i, j) = (label(x a ), label(x d )) and (i, j) = (label(x d ), label(x a )):"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-145",
"text": "The PMFs P y|label(xa) are updated after the forward pass of an entire mini-batch."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-146",
"text": "The constant m * enforces a potentially stronger constraint than is used in the l cos hinge loss, in order to promote diverse sampling."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-147",
"text": "In all experiments, we set m * = 0.6."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-148",
"text": "This is a heuristic approach, and it would be interesting to consider various alternatives."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-149",
"text": "Preliminary experiments showed that the non-uniform sampling method outperformed uniform sampling, and in the following we report results with non-uniform sampling."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-150",
"text": "We optimize the Siamese network model using SGD with Nesterov momentum for 15 epochs."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-151",
"text": "The learning rate is initialized to 0.001 and dropped every 3 epochs until no improvement is seen on the dev set."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-152",
"text": "The final model is taken from the epoch with the highest dev set AP."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-153",
"text": "All models were implemented in Torch [38] and used the rnn library of [39] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-154",
"text": "[14] 1061 0.532 \u00b1 0.014 Siamese CNN [14] 1024 0.549 \u00b1 0.011 Classifier LSTM 1061 0.616 \u00b1 0.009 Siamese LSTM 1024 0.671 \u00b1 0.011"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-156",
"text": "**RESULTS**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-2",
"text": "Acoustic word embeddings -fixed-dimensional vector representations of variable-length spoken word segmentshave begun to be considered for tasks such as speech recognition and query-by-example search."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-3",
"text": "Such embeddings can be learned discriminatively so that they are similar for speech segments corresponding to the same word, while being dissimilar for segments corresponding to different words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-4",
"text": "Recent work has found that acoustic word embeddings can outperform dynamic time warping on query-by-example search and related word discrimination tasks."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-5",
"text": "However, the space of embedding models and training approaches is still relatively unexplored."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-6",
"text": "In this paper we present new discriminative embedding models based on recurrent neural networks (RNNs)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-7",
"text": "We consider training losses that have been successful in prior work, in particular a cross entropy loss for word classification and a contrastive loss that explicitly aims to separate same-word and different-word pairs in a \"Siamese network\" training setting."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-8",
"text": "We find that both classifier-based and Siamese RNN embeddings improve over previously reported results on a word discrimination task, with Siamese RNNs outperforming classification models."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-9",
"text": "In addition, we present analyses of the learned embeddings and the effects of variables such as dimensionality and network structure."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-10",
"text": "Index Terms-acoustic word embeddings, recurrent neural networks, Siamese networks"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-13",
"text": "Many speech processing tasks -such as automatic speech recognition or spoken term detection -hinge on associating segments of speech signals with word labels."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-14",
"text": "In most systems developed for such tasks, words are broken down into subword units such as phones, and models are built for the individual units."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-15",
"text": "An alternative, which has been considered by some researchers, is to consider each entire word segment as a single unit, without assigning parts of it to sub-word units."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-16",
"text": "One motivation for the use of whole-word approaches is that This research was supported by a Google faculty research award and NSF grant IIS-1321015."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-17",
"text": "The opinions expressed in this work are those of the authors and do not necessarily reflect the views of the funding agency."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-18",
"text": "they avoid the need for sub-word models."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-19",
"text": "This is helpful since, despite decades of work on sub-word modeling [1, 2] , it still poses significant challenges."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-20",
"text": "For example, speech processing systems are still hampered by differences in conversational pronunciations [3] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-21",
"text": "A second motivation is that considering whole words at once allows us to consider a more flexible set of features and reason over longer time spans."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-22",
"text": "Whole-word approaches typically involve, at some level, template matching."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-23",
"text": "For example, in template-based speech recognition [4, 5] , word scores are computed from dynamic time warping (DTW) distances between an observed segment and training segments of the hypothesized word."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-24",
"text": "In queryby-example search, putative matches are typically found by measuring the DTW distance between the query and segments of the search database [6, 7, 8, 9] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-25",
"text": "In other words, wholeword approaches often boil down to making decisions about whether two segments are examples of the same word or not."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-26",
"text": "An alternative to DTW that has begun to be explored is the use of acoustic word embeddings (AWEs), or vector representations of spoken word segments."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-27",
"text": "AWEs are representations that can be learned from data, ideally such that the embeddings of two segments corresponding to the same word are close, while embeddings of segments corresponding to different words are far apart."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-28",
"text": "Once word segments are represented via fixed-dimensional embeddings, computing distances is as simple as measuring a cosine or Euclidean distance between two vectors."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-29",
"text": "There has been some, thus far limited, work on acoustic word embeddings, focused on a number of embedding models, training approaches, and tasks [10, 11, 12, 13, 14, 15, 16, 17] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-30",
"text": "In this paper we explore new embedding models based on recurrent neural networks (RNNs), applied to a word discrimination task related to query-by-example search."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-31",
"text": "RNNs are a natural model class for acoustic word embeddings, since they can handle arbitrary-length sequences."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-32",
"text": "We compare several types of RNN-based embeddings and analyze their properties."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-33",
"text": "Compared to prior embeddings tested on the same task, our best models achieve sizable improvements in average precision."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-35",
"text": "**RELATED WORK**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-36",
"text": "We next briefly describe the most closely related prior work."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-37",
"text": "Maas et al. [10] and Bengio and Heigold [11] used acoustic word embeddings, based on convolutional neural networks (CNNs), to generate scores for word segments in automatic speech recognition."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-38",
"text": "Maas et al. trained CNNs to predict (continuous-valued) embeddings of the word labels, and used the resulting embeddings to define feature functions in a segmental conditional random field [18] rescoring system."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-39",
"text": "Bengio and Heigold also developed CNN-based embeddings for lattice rescoring, but with a contrastive loss to separate embeddings of a given word from embeddings of other words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-40",
"text": "Levin et al. [12] developed unsupervised embeddings based on representing each word as a vector of DTW distances to a collection of reference word segments."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-41",
"text": "This representation was subsequently used in several applications: a segmental approach for query-by-example search [13] , lexical clustering [19] , and unsupervised speech recognition [20] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-42",
"text": "Voinea et al. [16] developed a representation also based on templates, in their case phone templates, designed to be invariant to specific transformations, and showed their robustness on digit classification."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-43",
"text": "Kamper et al. [14] compared several types of acoustic word embeddings for a word discrimination task related to query-by-example search, finding that embeddings based on convolutional neural networks (CNNs) trained with a contrastive loss outperformed the reference vector approach of Levin et al. [12] as well as several other CNN and DNN embeddings and DTW using several feature types."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-44",
"text": "There have now been a number of approaches compared on this same task and data [12, 21, 22, 23] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-45",
"text": "For a direct comparison with this prior work, in this paper we use the same task and some of the same training losses as Kamper et al., but develop new embedding models based on RNNs."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-46",
"text": "The only prior work of which we are aware using RNNs for acoustic word embeddings is that of Chen et al. [17] and Chung et al. [15] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-47",
"text": "Chen et al. learned a long short-term memory (LSTM) RNN for word classification and used the resulting hidden state vectors as a word embedding in a queryby-example task."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-48",
"text": "The setting was quite specific, however, with a small number of queries and speaker-dependent training."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-49",
"text": "Chung et al. [15] worked in an unsupervised setting and trained single-layer RNN autoencoders to produce embeddings for a word discrimination task."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-50",
"text": "In this paper we focus on the supervised setting, and compare a variety of RNNbased structures trained with different losses."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-52",
"text": "**APPROACH**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-53",
"text": "An acoustic word embedding is a function that takes as input a speech segment corresponding to a word, X = {x t } T t=1 , where each x t is a vector of frame-level acoustic features, and outputs a fixed-dimensional vector representing the segment, g(X)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-54",
"text": "The basic embedding model structure we use is shown in Fig. 1 ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-55",
"text": "The model consists of a deep RNN with some number S of stacked layers, whose final hidden state vector is passed as input to a set of F of fully connected layers; the output of the final fully connected layer is the embedding g(X)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-56",
"text": "The RNN hidden state at each time frame can be viewed as a representation of the input seen thus far, and its value in the last time frame T could itself serve as the final word embedding."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-57",
"text": "The fully connected layers are added to account for the fact that some additional transformation may improve the representation."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-58",
"text": "For example, the hidden state may need to be larger than the desired word embedding dimension, in order to be able to \"remember\" all of the needed intermediate information."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-59",
"text": "Some of that information may not be needed in the final embedding."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-60",
"text": "In addition, the information maintained in the hidden state may not necessarily be discriminative; some additional linear or non-linear transformation may help to learn a discriminative embedding."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-61",
"text": "Within this class of embedding models, we focus on Long Short-Term Memory (LSTM) networks [24] and Gated Recurrent Unit (GRU) networks [25] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-62",
"text": "These are both types of RNNs that include a mechanism for selectively retaining or discarding information at each time frame when updating the hidden state, in order to better utilize long-term context."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-63",
"text": "Both of these RNN variants have been used successfully in speech recognition [26, 27, 28, 29] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-64",
"text": "In an LSTM RNN, at each time frame both the hidden state h t and an associated \"cell memory\" vector c t , are updated and passed on to the next time frame."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-65",
"text": "In other words, each forward edge in Figure 1 can be viewed as carrying both the cell memory and hidden state vectors."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-66",
"text": "The updates are modulated by the values of several gating vectors, which control the degree to which the cell memory and hidden state are updated in light of new information in the current frame."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-67",
"text": "For a single-layer LSTM network, the updates are as follows:"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-68",
"text": "hidden state where h t , c t , c t , i t , f t , and o t are all vectors of the same dimensionality, W i , W o , W f , and W c are learned weight matrices of the appropriate sizes, b i , b o , b f and b c are learned bias vectors, \u03c3(\u00b7) is a componentwise logistic activation, and refers to the Hadamard (componentwise) product."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-69",
"text": "Similarly, in a GRU network, at each time step a GRU cell determines what components of old information are retained, overwritten, or modified in light of the next step in the input sequence."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-70",
"text": "The output from a GRU cell is only the hidden state vector."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-71",
"text": "A GRU cell uses a reset gate r t and an update gate u t as described below for a single-layer network:"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-72",
"text": "update gate"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-73",
"text": "where r t , u t , h t , and h t are all the same dimensionality, W r , W u , and W h are learned weight matrices of the appropriate size, and b r , b u and b h are learned bias vectors."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-74",
"text": "All of the above equations refer to single-layer networks."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-157",
"text": "Based on development set results, our final embedding models are LSTM networks with 3 stacked layers and 3 fully connected layers, with output dimensionality of 1024 in the case of Siamese networks."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-75",
"text": "In a deep network, with multiple stacked layers, the same update equations are used in each layer, with the state, cell, and gate vectors replaced by layer-specific vectors h l t , c l t , and so on for layer l. For all but the first layer, the input x t is replaced by the hidden state vector from the previous layer h l\u22121 t ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-76",
"text": "For the fully connected layers, we use rectified linear unit (ReLU) [30] activation, except for the final layer which depends on the form of supervision and loss used in training."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-78",
"text": "**TRAINING**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-79",
"text": "We train the RNN-based embedding models using a set of pre-segmented spoken words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-80",
"text": "We use two main training approaches, inspired by prior work but with some differences in the details."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-81",
"text": "As in [14, 11] , our first approach is to use the word labels of the training segments and train the networks to classify the word."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-82",
"text": "In this case, the final layer of g(X) is a log-softmax layer."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-83",
"text": "Here we are limited to the subset of the training set that has a sufficient number of segments per word to train a good classifier, and the output dimensionality is equal to the number of words (but see [14] for a study of varying the dimensionality in such a classifier-based embedding model by introducing a bottleneck layer)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-84",
"text": "This model is trained end-to-end and is optimized with a cross entropy loss."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-85",
"text": "Although labeled data is necessarily limited, the hope is that the learned models will be useful even when applied to spoken examples of words not previously seen in the training data."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-86",
"text": "For words not seen in training, the embeddings should correspond to some measure of similarity of the word to the training words, measured via the posterior probabilities of the previously seen words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-87",
"text": "In the experiments below, we examine this assumption by analyzing performance on words that appear in the training data compared to those that do not."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-88",
"text": "The second training approach, based on earlier work of Kamper et al. [14] , is to train \"Siamese\" networks [31] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-89",
"text": "In this approach, full supervision is not needed; rather, we use weak supervision in the form of pairs of segments labeled as same or different."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-90",
"text": "The base model remains the same as before-an RNN followed by a set of fully connected layers-but the final layer is no longer a softmax but rather a linear activation layer of arbitrary size."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-91",
"text": "In order to learn the parameters, we simultaneously feed three word segments through three copies of our model (i.e. three networks with shared weights)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-92",
"text": "One input segment is an \"anchor\", x a , the second is another segment with the same word label, x s , and the third is a segment corresponding to a different word label, x d ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-93",
"text": "Then, the network is trained using a \"cos-hinge\" loss: x 2 ) is the cosine distance between x 1 , x 2 ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-94",
"text": "Unlike cross entropy training, here we directly aim to optimize relative (cosine) distance between same and different word pairs."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-95",
"text": "For tasks such as query-by-example search, this training loss better respects our end objective, and can use more data since neither fully labeled data nor any minimum number of examples of each word should be needed."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-97",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-98",
"text": "Our end goal is to improve performance on downstream tasks requiring accurate word discrimination."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-99",
"text": "In this paper we use an intermediate task that more directly tests whether sameand different-word pairs have the expected relationship. and that allows us to compare to a variety of prior work."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-100",
"text": "Specifically, we use the word discrimination task of Carlin et al. [21] , which is similar to a query-by-example task where the word segmentations are known."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-101",
"text": "The evaluation consists of determining, for each pair of evaluation segments, whether they are examples of the same or different words, and measuring performance via the average precision (AP)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-102",
"text": "We do this by measuring the cosine similarity between their acoustic word embeddings and declaring them to be the same if the distance is below a threshold."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-103",
"text": "By sweeping the threshold, we obtain a precision-recall curve from which we compute the AP."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-104",
"text": "The data used for this task is drawn from the Switchboard conversational English corpus [32] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-105",
"text": "The word segments range from 50 to 200 frames in length."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-106",
"text": "The acoustic features in each frame (the input to the word embedding models x t ) are 39-dimensional MFCCs+\u2206+\u2206\u2206. We use the same train, development, and test partitions as in prior work [14, 12] , and the same acoustic features as in [14] , for as direct a comparison as possible."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-107",
"text": "The train set contains approximately 10k example segments, while dev and test each contain approximately 11k segments (corresponding to about 60M pairs for computing the dev/test AP)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-108",
"text": "As in [14] , when training the classificationbased embeddings, we use a subset of the training set containing all word types with a minimum of 3 occurrences, reducing the training set size to approximately 9k segments."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-109",
"text": "1 When training the Siamese networks, the training data consists of all of the same-word pairs in the full training set (approximately 100k pairs)."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-110",
"text": "For each such training pair, we randomly sample a third example belonging to a different word type, as required for the l cos hinge loss."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-112",
"text": "**CLASSIFICATION NETWORK DETAILS**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-113",
"text": "Our classifier-based embeddings use LSTM or GRU networks with 2-4 stacked layers and 1-3 fully connected layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-114",
"text": "The final embedding dimensionality is equal to the number of unique word labels in the training set, which is 1061."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-115",
"text": "The recurrent hidden state dimensionality is fixed at 512 and dropout [33] between stacked recurrent layers is used with probability p = 0.3."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-116",
"text": "The fully connected hidden layer dimensionality is fixed at 1024."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-117",
"text": "Rectified linear unit (ReLU) non-linearities and dropout with p = 0.5 are used between fully-connected layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-118",
"text": "However, between the final recurrent hidden state output and the first fully-connected layer no non-linearity or dropout is applied."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-119",
"text": "These settings were determined through experiments on the development set."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-120",
"text": "The classifier network is trained with a cross entropy loss and optimized using stochastic gradient descent (SGD) with Nesterov momentum [34] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-121",
"text": "The learning rate is initialized at 0.1 and is reduced by a factor of 10 according to the following heuristic: If 99% of the current epoch's average batch loss is greater than the running average of batch losses over the last 3 epochs, this is considered a plateau; if there are 3 consecutive plateau epochs, then the learning rate is reduced."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-122",
"text": "Training stops when reducing the learning rate no longer improves dev set AP."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-123",
"text": "Then, the model from the epoch corresponding to the the best dev set AP is chosen."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-124",
"text": "Several other optimizersAdagrad [35] , Adadelta [36] , and Adam [37] -were explored in initial experiments on the dev set, but all reported results were obtained using SGD with Nesterov momentum."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-126",
"text": "**SIAMESE NETWORK DETAILS**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-127",
"text": "For experiments with Siamese networks, we initialize (warmstart) the networks with the tuned classification network, removing the final log-softmax layer and replacing it with a linear layer of size equal to the desired embedding dimensionality."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-128",
"text": "We explored embeddings with dimensionalities between 8 and 2048."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-129",
"text": "We use a margin of 0.4 in the cos-hinge loss."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-130",
"text": "In training the Siamese networks, each training minibatch consists of 2B triplets."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-131",
"text": "B triplets are of the form (x a , x s , x d ) where x a and x s are examples of the same class (a pair from the 100k same-word pair set) and x d is a randomly sampled example from a different class."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-132",
"text": "Then, for each of these B triplets (x a , x s , x d ) , an additional triplet (x s , x a , x d ) is added to the mini-batch to allow all segments to serve as anchors."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-133",
"text": "This is a slight departure from earlier work [14] , which we found to improve stability in training and performance on the development set."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-134",
"text": "In preliminary experiments, we compared two methods for choosing the negative examples x d during training, a uniform sampling approach and a non-uniform one."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-135",
"text": "In the case of uniform sampling, we sample x d uniformly at random from the full set of training examples with labels different from x a ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-136",
"text": "This sampling method requires only word-pair supervision."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-137",
"text": "In the case of non-uniform sampling, x d is sampled in two steps."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-159",
"text": "We include a comparison with the best prior results on this task from [14] , as well as the result of using standard DTW on the input MFCCs (reproduced from [14] ) and the best prior result using DTW, obtained with frame features learned with correlated autoencoders [22] ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-160",
"text": "Both classifier and Siamese LSTM embedding models outperform all prior results on this task of which we are aware."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-161",
"text": "2 We next analyze the effects of model design choices, as well as the learned embeddings themselves."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-162",
"text": "Table 2 shows the effect on development set performance of the number of stacked layers S, the number of fully connected layers F , and LSTM vs. GRU cells, for classifierbased embeddings."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-163",
"text": "The best performance in this experiment is achieved by the LSTM network with S = F = 3."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-164",
"text": "However, performance still seems to be improving with additional layers, suggesting that we may be able to further improve performance by adding even more layers of either type."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-165",
"text": "However, we fixed the model to S = F = 3 in order to allow for more experimentation and analysis within a reasonable time."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-166",
"text": "Table 2 reveals an interesting trend."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-167",
"text": "When only one fully connected layer is used, the GRU networks outperform the LSTMs given a sufficient number of stacked layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-168",
"text": "On the other hand, once we add more fully connected layers, the LSTMs outperform the GRUs."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-169",
"text": "In the first few lines of Table 2, we use 2, 3, and 4 layer stacks of LSTMs and GRUs while holding fixed the number of fully-connected layers at F = 1."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-170",
"text": "There is clear utility in stacking additional layers; however, even with 4 stacked layers the RNNs still underperform the CNN-based embeddings of [14] until we begin adding fully connected layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-172",
"text": "**EFFECT OF MODEL STRUCTURE**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-173",
"text": "After exploring a variety of stacked RNNs, we fixed the stack to 3 layers and varied the number of fully connected layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-174",
"text": "The value of each additional fully connected layer is clearly greater than that of adding stacked layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-175",
"text": "All networks trained with 2 or 3 fully connected layers obtain more than 0.4 AP on the development set, while stacked RNNs with 1 fully connected layer are at around 0.3 AP or less."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-176",
"text": "This may raise the question of whether some simple fully connected model may be all that is needed; however, previous work has shown that this approach is not competitive [14] , and convolutional or recurrent layers are needed to summarize arbitrarylength segments into a fixed-dimensional representation."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-178",
"text": "**EFFECT OF EMBEDDING DIMENSIONALITY**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-179",
"text": "For the Siamese networks, we varied the output embedding dimensionality, as shown in Fig. 2 ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-180",
"text": "This analysis shows that the embeddings learned by the Siamese RNN network are quite robust to reduced dimensionality, outperforming the classifier model for all dimensionalities 32 or higher and outperforming previously reported dev set performance with CNN-based embeddings [14] for all dimensionalities \u2265 16."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-182",
"text": "**EFFECT OF TRAINING VOCABULARY**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-183",
"text": "We might expect the learned embeddings to be more accurate for words that are seen in training than for ones that are not."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-184",
"text": "Fig. 2 measures this effect by showing performance as a function of the number of occurrences of the dev words in the training set."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-185",
"text": "Indeed, both model types are much more successful for in-vocabulary words, and their performance improves the higher the training frequency of the words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-186",
"text": "However, performance increases more quickly for the Siamese network than for the classifier as training frequency increases."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-187",
"text": "This may be due to the fact that, if a word type occurs at least k times in the classifier training set, then it occurs at least 2 \u00d7 k 2 times in the Siamese paired training data."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-188",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-189",
"text": "**VISUALIZATION OF EMBEDDINGS**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-190",
"text": "In order to gain a better qualitative understanding of the differences between clasiffier and Siamese-based embeddings, and of the learned embedding space more generally, we plot a two-dimensional visualization of some of our learned embeddings via t-SNE [41] in Fig. 3 ."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-191",
"text": "For both classifier and Siamese embeddings, there is a marked difference in the quality of clusters formed by embeddings of words that were previously seen vs. previously unseen in training."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-192",
"text": "However, the Siamese network embeddings appear to have better relative distances between word clusters with similar and dissimilar pronunciations."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-193",
"text": "For example, the word programs appears equidistant from problems and problem in the classifierbased embedding space, but in the Siamese embedding space problems falls between problem and programs."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-194",
"text": "Similarly, the cluster for democracy shifts with respect to actually and especially to better respect differences in pronunciation."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-195",
"text": "More study of learned embeddings, using more data and word types, is needed to confirm such patterns in general."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-196",
"text": "Improvements in unseen word embeddings from the classifier embedding space to the Siamese embedding space (such as for democracy, morning, and basketball) are a likely result of optimizing the model for relative distances between words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-197",
"text": "----------------------------------"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-198",
"text": "**CONCLUSION**"
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-199",
"text": "Our main finding is that RNN-based acoustic word embeddings outperform prior approaches, as measured via a word discrimination task related to query-by-example search."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-200",
"text": "Our best results are obtained with deep LSTM RNNs with a combination of several stacked layers and several fully connected layers, optimized with a contrastive Siamese loss."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-201",
"text": "Siamese networks have the benefit that, for any given training data set, they are effectively trained on a much larger set, in the sense that they measure a loss and gradient for every possible pair of data points."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-202",
"text": "Our experiments suggest that the models could still be improved with additional layers."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-203",
"text": "In addition, we have found that, for the purposes of acoustic word embeddings, fully connected layers are very important and have a more significant effect per layer than stacked layers, particularly when trained with the cross entropy loss function."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-204",
"text": "These experiments represent an initial exploration of sequential neural models for acoustic word embeddings."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-205",
"text": "There are a number of directions for further work."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-206",
"text": "For example, while our analyses suggest that Siamese networks are better than classifier-based models at embedding previously unseen words, our best embeddings are still much poorer for unseen words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-207",
"text": "Improvements in this direction may come from larger training sets, or may require new models that better model the shared structure between words."
},
{
"sent_id": "845c66e6dfafc21ab90e5aa5cbf947-C001-208",
"text": "Other directions for future work include additional forms of supervision and training, as well as application to downstream tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-29"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-43"
]
],
"cite_sentences": [
"845c66e6dfafc21ab90e5aa5cbf947-C001-29",
"845c66e6dfafc21ab90e5aa5cbf947-C001-43"
]
},
"@SIM@": {
"gold_contexts": [
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-81"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-108"
]
],
"cite_sentences": [
"845c66e6dfafc21ab90e5aa5cbf947-C001-81",
"845c66e6dfafc21ab90e5aa5cbf947-C001-108"
]
},
"@USE@": {
"gold_contexts": [
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-81"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-88"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-106"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-108"
]
],
"cite_sentences": [
"845c66e6dfafc21ab90e5aa5cbf947-C001-81",
"845c66e6dfafc21ab90e5aa5cbf947-C001-88",
"845c66e6dfafc21ab90e5aa5cbf947-C001-106",
"845c66e6dfafc21ab90e5aa5cbf947-C001-108"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-83"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-159"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-176"
]
],
"cite_sentences": [
"845c66e6dfafc21ab90e5aa5cbf947-C001-83",
"845c66e6dfafc21ab90e5aa5cbf947-C001-159",
"845c66e6dfafc21ab90e5aa5cbf947-C001-176"
]
},
"@DIF@": {
"gold_contexts": [
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-133"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-170"
],
[
"845c66e6dfafc21ab90e5aa5cbf947-C001-180"
]
],
"cite_sentences": [
"845c66e6dfafc21ab90e5aa5cbf947-C001-133",
"845c66e6dfafc21ab90e5aa5cbf947-C001-170",
"845c66e6dfafc21ab90e5aa5cbf947-C001-180"
]
}
}
},
"ABC_cce566b9111abdc7ab7576662922dd_4": {
"x": [
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-2",
"text": "This paper presents a word support model (WSM)."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-3",
"text": "The WSM can effectively perform homophone selection and syllable-word segmentation to improve Chinese input systems."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-4",
"text": "The experimental results show that: (1) the WSM is able to achieve tonal (syllables input with four tones) and toneless (syllables input without four tones) syllable-to-word (STW) accuracies of 99% and 92%, respectively, among the converted words; and (2) while applying the WSM as an adaptation processing, together with the Microsoft Input Method Editor 2003 (MSIME) and an optimized bigram model, the average tonal and toneless STW improvements are 37% and 35%, respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-7",
"text": "According to (Becker, 1985; Huang, 1985; Gu et al., 1991; Chung, 1993; Kuo, 1995; Fu et al., 1996; Lee et al., 1997; Hsu et al., 1999; Chen et al., 2000; Tsai and Hsu, 2002; Gao et al., 2002; Lee, 2003; Tsai, 2005) , the approaches of Chinese input methods (i.e. Chinese input systems) can be classified into two types: (1) keyboard based approach: including phonetic and pinyin based (Chang et al., 1991; Hsu et al., 1993; Hsu, 1994; Hsu et al., 1999; Kuo, 1995; Lua and Gan, 1992) , arbitrary codes based (Fan et al., 1988) and structure scheme based (Huang, 1985) ; and (2) non-keyboard based approach: including optical character recognition (OCR) (Chung, 1993) , online handwriting and speech recognition (Fu et al., 1996; Chen et al., 2000) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-8",
"text": "Currently, the most popular Chinese input system is phonetic and pinyin based approach, because Chinese people are taught to write phonetic and pinyin syllables of each Chinese character in primary school."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-9",
"text": "In Chinese, each Chinese word can be a mono-syllabic word, such as \"\u9f20(mouse)\", a bisyllabic word, such as \"\u888b\u9f20(kangaroo)\", or a multi-syllabic word, such as \"\u7c73\uf934\u9f20(Mickey mouse).\" The corresponding phonetic and pinyin syllables of each Chinese word is called syllable-words, such as \"dai4 shu3\" is the pinyin syllable-word of \"\u888b\u9f20(kangaroo).\" According to our computation, the {minimum, maximum, average} words per each distinct mono-syllableword and poly-syllable-word (including bisyllable-word and multi-syllable-word) in the CKIP dictionary (Chinese Knowledge Information Processing Group, 1995) are {1, 28, 2.8} and {1, 7, 1.1}, respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-10",
"text": "The CKIP dictionary is one of most commonly-used Chinese dictionaries in the research field of Chinese natural language processing (NLP)."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-11",
"text": "Since the size of problem space for syllable-to-word (STW) conversion is much less than that of syllable-tocharacter (STC) conversion, the most pinyinbased Chinese input systems (Hsu, 1994; Hsu et al., 1999; Tsai and Hsu, 2002; Gao et al., 2002; Microsoft Research Center in Beijing; Tsai, 2005) are addressed on STW conversion."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-12",
"text": "On the other hand, STW conversion is the main task of Chinese Language Processing in typical Chinese speech recognition systems (Fu et al., 1996; Lee et al., 1993; Chien et al., 1993; Su et al., 1992) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-13",
"text": "As per (Chung, 1993; Fong and Chung, 1994; Tsai and Hsu, 2002; Gao et al., 2002; Lee, 2003; Tsai, 2005) , homophone selection and syllableword segmentation are two critical problems in developing a Chinese input system."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-14",
"text": "Incorrect homophone selection and syllable-word seg-mentation will directly influence the STW conversion accuracy."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-36",
"text": "**AUTO-GENERATION OF WP DATABASE**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-15",
"text": "Conventionally, there are two approaches to resolve the two critical problems: (1) linguistic approach: based on syntax parsing, semantic template matching and contextual information (Hsu, 1994; Fu et al., 1996; Hsu et al., 1999; Kuo, 1995; Tsai and Hsu, 2002) ; and (2) statistical approach: based on the n-gram models where n is usually 2, i.e. bigram model (Lin and Tsai, 1987; Gu et al., 1991; Fu et al., 1996; Ho et al., 1997; Sproat, 1990; Gao et al., 2002; Lee 2003) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-16",
"text": "From the studies (Hsu 1994; Tsai and Hsu, 2002; Gao et al., 2002; Kee, 2003; Tsai, 2005) , the linguistic approach requires considerable effort in designing effective syntax rules, semantic templates or contextual information, thus, it is more user-friendly than the statistical approach on understanding why such a system makes a mistake."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-17",
"text": "The statistical language model (SLM) used in the statistical approach requires less effort and has been widely adopted in commercial Chinese input systems."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-18",
"text": "In our previous work (Tsai, 2005) , a wordpair (WP) identifier was proposed and shown a simple and effective way to improve Chinese input systems by providing tonal and toneless STW accuracies of 98.5% and 90.7% on the identified poly-syllabic words, respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-19",
"text": "In (Tsai, 2005) , we have shown that the WP identifier can be used to reduce the over weighting and corpus sparseness problems of bigram models and achieve better STW accuracy to improve Chinese input systems."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-20",
"text": "As per our computation, poly-syllabic words cover about 70% characters of Chinese sentences."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-21",
"text": "Since the identified character ratio of the WP identifier (Tsai, 2005 ) is about 55%, there are still about 15% improving room left."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-22",
"text": "The objective of this study is to illustrate a word support model (WSM) that is able to improve our WP-identifier by achieving better identified character ratio and STW accuracy on the identified poly-syllabic words with the same word-pair database."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-23",
"text": "We conduct STW experiments to show the tonal and toneless STW accuracies of a commercial input product (Microsoft Input Method Editor 2003, MSIME) , and an optimized bigram model, BiGram (Tsai, 2005) , can both be improved by our WSM and achieve better STW improvements than that of these systems with the WP identifier."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-24",
"text": "The remainder of this paper is arranged as follows."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-25",
"text": "In Section 2, we present an auto wordpair (AUTO-WP) generation used to generate the WP database."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-26",
"text": "Then, we develop a word support model with the WP database to perform STW conversion on identifying words from the Chinese syllables."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-27",
"text": "In Section 3, we report and analyze our STW experimental results."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-28",
"text": "Finally, in Section 4, we give our conclusions and suggest some future research directions."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-30",
"text": "**DEVELOPMENT OF WORD SUPPORT MODEL**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-31",
"text": "The system dictionary of our WSM is comprised of 82,531 Chinese words taken from the CKIP dictionary and 15,946 unknown words autofound in the UDN2001 corpus by a Chinese Word Auto-Confirmation (CWAC) system (Tsai et al., 2003) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-32",
"text": "The UDN2001 corpus is a collection of 4,539624 Chinese sentences extracted from whole 2001 UDN (United Daily News, 2001) Website in Taiwan (Tsai and Hsu, 2002) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-33",
"text": "The system dictionary provides the knowledge of words and their corresponding pinyin syllable-words."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-34",
"text": "The pinyin syllable-words were translated by phoneme-to-pinyin mappings, such as \"\u3129\u02ca\"-to-\"ju2.\""
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-37",
"text": "Following (Tsai, 2005) , the three steps of autogenerating word-pairs (AUTO-WP) for a given Chinese sentence are as below: (the details of AUTO-WP can be found in (Tsai, 2005)) Step 1. Get forward and backward word segmentations: Generate two types of word segmentations for a given Chinese sentence by forward maximum matching (FMM) and backward maximum matching (BMM) techniques (Chen et al., 1986; Tsai et al., 2004) with the system dictionary."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-38",
"text": "Step 2. Get initial WP set: Extract all the combinations of word-pairs from the FMM and the BMM segmentations of Step 1 to be the initial WP set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-39",
"text": "Step 3. Get finial WP set: Select out the wordpairs comprised of two poly-syllabic words from the initial WP set into the finial WP set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-40",
"text": "For the final WP set, if the word-pair is not found in the WP data-base, insert it into the WP database and set its frequency to 1; otherwise, increase its frequency by 1."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-42",
"text": "**WORD SUPPORT MODEL**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-43",
"text": "The four steps of our WSM applied to identify words for a given Chinese syllables is as follows:"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-44",
"text": "Step 1."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-45",
"text": "Input tonal or toneless syllables."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-46",
"text": "Step 2. Generate all possible word-pairs comprised of two poly-syllabic words for the input syllables to be the WP set of Step 3."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-47",
"text": "Step 3."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-48",
"text": "Select out the word-pairs that match a word-pair in the WP database to be the WP set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-49",
"text": "Then, compute the word support degree (WS degree) for each distinct word of the WP set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-50",
"text": "The WS degree is defined to be the total number of the word found in the WP set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-51",
"text": "Finally, arrange the words and their corresponding WS degrees into the WSM set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-52",
"text": "If the number of words with the same syllableword and WS degree is greater than one, one of them is randomly selected into the WSM set."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-53",
"text": "Step 4. Replace words of the WSM set in descending order of WS degree with the input syllables into a WSM-sentence."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-54",
"text": "If no words can be identified in the input syllables, a NULL WSM-sentence is produced."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-55",
"text": "Table 1 is a step by step example to show the four steps of applying our WSM on the Chinese syllables \"sui1 ran2 fu3 shi2 jin4 shi4 sui4 yue4 xi1 xu1(\u96d6\u7136\u4fef\uf973\u76e1\u662f\u6b72\u6708\u550f\u5653).\" For this input syllables, we have a WSM-sentence \"\u96d6 \u7136\u4fef\uf973\u76e1\u662f\u6b72\u6708\u550f\u5653."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-56",
"text": "\" For the same syllables, outputs of the MSIME, the BiGram and the WP identifier are \"\u96d6\u7136\u8150\u8755\u9032\u58eb\u6b72\u6708\u550f\u5653,\" \"\u96d6\u7136 \u4fef\uf973\u76e1\u662f\u6b72\u6708\u550f\u5653\" and \"\u96d6\u7136 fu3 shi2 \u8fd1\u8996 sui4 yue4 xi1 xu1.\""
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-58",
"text": "**STW EXPERIMENTS**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-59",
"text": "To evaluate the STW performance of our WSM, we define the STW accuracy, identified character ratio (ICR) and STW improvement, by the following equations:"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-60",
"text": "STW accuracy = # of correct characters / # of total characters."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-61",
"text": "Identified character ratio (ICR) = # of characters of identified WP / # of total characters in testing sentences."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-62",
"text": "STW improvement (I) (i.e. STW error reduction rate) = (accuracy of STW system with WPaccuracy of STW system)) / (1 -accuracy of STW system)."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-63",
"text": "Step # Results"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-64",
"text": "Step.1 sui1 ran2 fu3 shi2 jin4 shi4 sui4 yue4 xi1 xu1"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-65",
"text": "Step.2 WP set (word-pair / word-pair frequency) = {\u96d6\u7136-\u8fd1\u8996/6 (key WP for WP identifier), \u4fef\uf973-\u76e1\u662f/4, \u96d6\u7136-\u6b72\u6708/4, \u96d6\u7136-\u76e1\u662f/3, \u4fef\uf973-\u550f\u5653/2, \u96d6\u7136-\u4fef\uf973/2, \u4fef\uf973-\u6b72\u6708/2, \u76e1\u662f-\u550f\u5653/2, \u76e1\u662f-\u6b72\u6708/2, \u96d6\u7136-\u550f\u5653/2, \u6b72\u6708-\u550f\u5653/2}"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-66",
"text": "Step.3 WSM set (word / WS degree) = {\u96d6\u7136/5, \u4fef\uf973/4, \u76e1\u662f/4, \u6b72\u6708/4, \u550f\u5653/4, \u8fd1\u8996/1}"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-67",
"text": "Replaced word set = \u96d6\u7136(sui1 ran2), \u4fef\uf973(fu3 shi2), \u76e1\u662f(jin4 shi4), \u6b72\u6708(sui4 yue4), \u550f\u5653(xi1 xu1)"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-68",
"text": "Step.4 WSM-sentence: Table 1 ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-69",
"text": "An illustration of a WSM-sentence for the Chinese syllables \"sui1 ran2 fu3 shi2 jin4 shi4 sui4 yue4 xi1 xu1(\u96d6\u7136\u4fef\uf973\u76e1\u662f\u6b72\u6708\u550f \u5653).\""
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-71",
"text": "**\u96d6\u7136\u4fef\uf973\u76e1\u662f\u6b72\u6708\u550f\u5653**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-73",
"text": "**BACKGROUND**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-74",
"text": "To conduct the STW experiments, firstly, use the inverse translator of phoneme-to-character (PTC) provided in GOING system to convert testing sentences into their corresponding syllables."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-75",
"text": "All the error PTC translations of GOING PTC were corrected by post human-editing."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-76",
"text": "We conducted the STW experiment in a progressive manner."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-77",
"text": "The results and analysis of the experiments are described in Subsections 3.2 and 3.3."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-79",
"text": "**STW EXPERIMENT RESULTS OF THE WSM**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-80",
"text": "The purpose of this experiment is to demonstrate the tonal and toneless STW accuracies among the identified words by using the WSM with the system WP database."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-81",
"text": "The comparative system is the WP identifier (Tsai, 2005) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-82",
"text": "Table 2 is the experimental results."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-83",
"text": "The WP database and system dictionary of the WP identifier is same with that of the WSM."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-84",
"text": "From Table 2"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-86",
"text": "**STW EXPERIMENT RESULTS OF CHINESE INPUT SYSTEMS WITH THE WSM**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-87",
"text": "We selected Microsoft Input Method Editor 2003 for Traditional Chinese (MSIME) as our experimental commercial Chinese input system."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-88",
"text": "In addition, following (Tsai, 2005) , an optimized bigram model called BiGram was developed."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-89",
"text": "The BiGram STW system is a bigrambased model developing by SRILM (Stolcke, 2002) with Good-Turing back-off smoothing (Manning and Schuetze, 1999) , as well as forward and backward longest syllable-word first strategies (Chen et al., 1986; Tsai et al., 2004) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-90",
"text": "The system dictionary of the BiGram is same with that of the WP identifier and the WSM."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-91",
"text": "Table 3a compares the results of the MSIME, the MSIME with the WP identifier and the MSIME with the WSM on the closed and open test sentences."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-92",
"text": "a STW accuracies and improvements of the words identified by the MSIME (Ms) with the WP identifier b STW accuracies and improvements of the words identified by the MSIME (Ms) with the WSM Table 3a ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-93",
"text": "The results of tonal and toneless STW experiments for the MSIME, the MSIME with the WP identifier and with the WSM."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-94",
"text": "From Table 3a , the tonal and toneless STW improvements of the MSIME by using the WP identifier and the WSM are (18.9%, 10.1%) and (25.6%, 16.6%), respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-95",
"text": "From Table 3b , the tonal and toneless STW improvements of the BiGram by using the WP identifier and the WSM are (8.6%, 11.9%) and (17.1%, 22.0%), respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-96",
"text": "(Note that, as per (Tsai, 2005) , the differences between the tonal and toneless STW accuracies of the BiGram and the TriGram are less than 0.3%)."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-97",
"text": "Table 3c is the results of the MSIME and the BiGram by using the WSM as an adaptation processing with both system and user WP database."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-98",
"text": "From Table 3c , we get the average tonal and toneless STW improvements of the MSIME and the BiGram by using the WSM as an adaptation processing are 37.2% and 34.6%, respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-99",
"text": "Table 3c ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-100",
"text": "The results of tonal and toneless STW experiments for the MSIME and the BiGram using the WSM as an adaptation processing."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-102",
"text": "**MS+WSM (ICR**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-103",
"text": "To sum up the above experiment results, we conclude that the WSM can achieve a better STW accuracy than that of the MSIME, the BiGram and the WP identifier on the identifiedwords portion."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-104",
"text": "(Appendix A presents two cases of STW results that were obtained from this study)."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-106",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-107",
"text": "We examine the Table 4 , we have three observations:"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-108",
"text": "(1) The coverage of unknown word problem for tonal and toneless STW conversions is similar."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-109",
"text": "In most Chinese input systems, unknown word extraction is not specifically a STW problem, therefore, it is usually taken care of through online and offline manual editing processing (Hsu et al, 1999) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-110",
"text": "The results of Table 4 show that the most STW errors should be caused by ISWS and HS problems, not UW problem."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-111",
"text": "This observation is similarly with that of our previous work (Tsai, 2005) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-112",
"text": "(2) The major problem of error conversions in tonal and toneless STW systems is different."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-113",
"text": "This observation is similarly with that of (Tsai, 2005) ."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-114",
"text": "From Table 4 , the major improving targets of tonal STW performance are the HS errors because more than 50% tonal STW errors caused by HS problem."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-115",
"text": "On the other hand, since the ISWS errors cover more than 50% toneless STW errors, the major targets of improving toneless STW performance are the ISWS errors."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-116",
"text": "This observation should answer the question \"Why the STW performance of Chinese input systems (MSIME and BiGram) with the WSM is better than that of these systems with the WP-identifier?\""
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-117",
"text": "To sum up the above three observations and all the STW experimental results, we conclude that the WSM is able to achieve better STW improvements than that of the WP identifier is because: (1) the identified character ratio of the WSM is 15% greater than that of the WP identifier with the same WP database and dictionary, and meantime (2) the WSM not only can maintain the ratio of the three STW error types but also can reduce the total number of error characters of converted words than that of the WP identifier."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-119",
"text": "**CONCLUSIONS AND FUTURE DIRECTIONS**"
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-120",
"text": "In this paper, we present a word support model (WSM) to improve the WP identifier (Tsai, 2005) and support the Chinese Language Processing on the STW conversion problem."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-121",
"text": "All of the WP data can be generated fully automatically by applying the AUTO-WP on the given corpus."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-122",
"text": "We are encouraged by the fact that the WSM with WP knowledge is able to achieve state-of-the-art tonal and toneless STW accuracies of 99% and 92%, respectively, for the identified poly-syllabic words."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-123",
"text": "The WSM can be easily integrated into existing Chinese input systems by identifying words as a post processing."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-124",
"text": "Our experimental results show that, by applying the WSM as an adaptation processing together with the MSIME (a trigram-like model) and the BiGram (an optimized bigram model), the average tonal and toneless STW improvements of the two Chinese input systems are 37% and 35%, respectively."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-125",
"text": "Currently, our WSM with the mixed WP database comprised of UDN2001 and AS WP database is able to achieve more than 98% identified character ratios of poly-syllabic words in tonal and toneless STW conversions among the UDN2001 and the AS corpus."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-126",
"text": "Although there is room for improvement, we believe it would not produce a noticeable effect as far as the STW accuracy of poly-syllabic words is concerned."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-127",
"text": "We will continue to improve our WSM to cover more characters of the UDN2001 and the AS corpus by those word-pairs comprised of at least one mono-syllabic word, such as \"\u6211\u5011 (we)-\u662f(are)\"."
},
{
"sent_id": "cce566b9111abdc7ab7576662922dd-C001-128",
"text": "In other directions, we will extend it to other Chinese NLP research topics, especially word segmentation, main verb identification and Subject-Verb-Object (SVO) autoconstruction."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"cce566b9111abdc7ab7576662922dd-C001-7"
],
[
"cce566b9111abdc7ab7576662922dd-C001-11"
],
[
"cce566b9111abdc7ab7576662922dd-C001-13"
],
[
"cce566b9111abdc7ab7576662922dd-C001-16"
],
[
"cce566b9111abdc7ab7576662922dd-C001-18"
],
[
"cce566b9111abdc7ab7576662922dd-C001-19"
],
[
"cce566b9111abdc7ab7576662922dd-C001-96"
]
],
"cite_sentences": [
"cce566b9111abdc7ab7576662922dd-C001-7",
"cce566b9111abdc7ab7576662922dd-C001-11",
"cce566b9111abdc7ab7576662922dd-C001-13",
"cce566b9111abdc7ab7576662922dd-C001-16",
"cce566b9111abdc7ab7576662922dd-C001-18",
"cce566b9111abdc7ab7576662922dd-C001-19",
"cce566b9111abdc7ab7576662922dd-C001-96"
]
},
"@MOT@": {
"gold_contexts": [
[
"cce566b9111abdc7ab7576662922dd-C001-21",
"cce566b9111abdc7ab7576662922dd-C001-22"
]
],
"cite_sentences": [
"cce566b9111abdc7ab7576662922dd-C001-21"
]
},
"@USE@": {
"gold_contexts": [
[
"cce566b9111abdc7ab7576662922dd-C001-23"
],
[
"cce566b9111abdc7ab7576662922dd-C001-37"
],
[
"cce566b9111abdc7ab7576662922dd-C001-80",
"cce566b9111abdc7ab7576662922dd-C001-81"
],
[
"cce566b9111abdc7ab7576662922dd-C001-88"
]
],
"cite_sentences": [
"cce566b9111abdc7ab7576662922dd-C001-23",
"cce566b9111abdc7ab7576662922dd-C001-37",
"cce566b9111abdc7ab7576662922dd-C001-81",
"cce566b9111abdc7ab7576662922dd-C001-88"
]
},
"@SIM@": {
"gold_contexts": [
[
"cce566b9111abdc7ab7576662922dd-C001-111"
],
[
"cce566b9111abdc7ab7576662922dd-C001-113"
]
],
"cite_sentences": [
"cce566b9111abdc7ab7576662922dd-C001-111",
"cce566b9111abdc7ab7576662922dd-C001-113"
]
},
"@EXT@": {
"gold_contexts": [
[
"cce566b9111abdc7ab7576662922dd-C001-120"
]
],
"cite_sentences": [
"cce566b9111abdc7ab7576662922dd-C001-120"
]
}
}
},
"ABC_9795a839cb79ed971de4c325e01e74_4": {
"x": [
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-104",
"text": "Players."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-177",
"text": "**RESULTS**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-179",
"text": "**ALICE SL VS. ALICE RL**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-2",
"text": "As AI continues to advance, human-AI teams are inevitable."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-3",
"text": "However, progress in AI is routinely measured in isolation, without a human in the loop."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-4",
"text": "It is crucial to benchmark progress in AI, not just in isolation, but also in terms of how it translates to helping humans perform certain tasks, i.e., the performance of human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-5",
"text": "In this work, we design a cooperative game -GuessWhichto measure human-AI team performance in the specific context of the AI being a visual conversational agent."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-6",
"text": "GuessWhich involves live interaction between the human and the AI."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-7",
"text": "The AI, which we call ALICE, is provided an image which is unseen by the human."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-8",
"text": "Following a brief description of the image, the human questions ALICE about this secret image to identify it from a fixed pool of images."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-9",
"text": "We measure performance of the human-ALICE team by the number of guesses it takes the human to correctly identify the secret image after a fixed number of dialog rounds with AL-ICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-10",
"text": "We compare performance of the human-ALICE teams for two versions of ALICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-11",
"text": "Our human studies suggest a counterintuitive trend -that while AI literature shows that one version outperforms the other when paired with an AI questioner bot, we find that this improvement in AI-AI performance does not translate to improved human-AI performance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-12",
"text": "This suggests a mismatch between benchmarking of AI in isolation and in the context of human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-13",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-14",
"text": "**INTRODUCTION**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-15",
"text": "As Artificial Intelligence (AI) systems become increasingly accurate and interactive (e.g. Alexa, Siri, Cortana, Google Assistant), human-AI teams are inevitably going to become more commonplace."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-16",
"text": "To be an effective teammate, an AI must overcome the challenges involved with adapting to humans; however, progress in AI is routinely measured in isolation, without a human in the loop."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-17",
"text": "In this work, we focus specifically on the evaluation of visual conversational agents and develop a human computation game to benchmark their performance as members of human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-18",
"text": "Visual conversational agents (Das et al. 2017a; Das et al. 2017b; ) are AI agents Figure 1 : A human and an AI (a visual conversation agent called ALICE) play the proposed GuessWhich game."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-19",
"text": "At the start of the game (top), ALICE is provided an image (shown above ALICE) which is unknown to the human."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-20",
"text": "Both ALICE and the human are then provided a brief description of the image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-21",
"text": "The human then attempts to identify the secret image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-22",
"text": "In each subsequent round of dialog, the human asks a question about the unknown image, receives an answer from ALICE, and makes a best guess of the secret image from a fixed pool of images."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-23",
"text": "After 9 rounds of dialog, the human makes consecutive guesses until the secret image is identified."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-24",
"text": "The fewer guesses the human needs to identify the secret image, the better the human-AI team performance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-25",
"text": "trained to understand and communicate about the contents of a scene in natural language."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-26",
"text": "For example, in Fig. 1 , the visual conversational agent (shown on the right) replies to answers questions about a scene while inferring context from the dialog history -Human: \"What is he doing?\" Agent: \"Playing frisbee\"."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-27",
"text": "These agents are typically trained to mimic large corpora of human-human dialogs and are evaluated automatically on how well they retrieve actual human responses (ground truth) in novel dialogs."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-28",
"text": "Recent work has evaluated these models more pragmatically by evaluating how well pairs of visual conversational agents perform on goal-based conversational tasks rather than response retrieval from fixed dialogs."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-29",
"text": "Specifically, (Das et al. 2017b ) train two visual conversational agents -a questioning bot QBOT, and an answering bot ABOT -for an image-guessing task."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-30",
"text": "Starting from a description of the scene, QBOT and ABOT converse over multiple rounds of questions (QBOT) and answers (ABOT) in order to improve QBOT's understanding of a secret image known only to ABOT."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-31",
"text": "After a fixed number of rounds, QBOT must guess the secret image from a large pool and both QBOT and ABOT are evaluated based on this guess."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-32",
"text": "(Das et al. 2017b ) compare supervised baseline models with QBOT-ABOT teams trained through reinforcement learning based self-talk on this image-guessing task."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-33",
"text": "They find that the AI-AI teams improve significantly at guessing the correct image after self-talk updates compared to the supervised pretraining."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-34",
"text": "While these results indicate that the self-talk fine-tuned agents are better visual conversational agents, crucially, it remains unclear if these agents are indeed better at this task when interacting with humans."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-35",
"text": "GuessWhich."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-36",
"text": "In this work, we propose to evaluate if and how this progress in AI-AI evaluation translates to the performance of human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-37",
"text": "Inspired by the popular GuessWhat or 20-Questions game, we design a human computation game -GuessWhich -which requires collaboration between human and visual conversational AI agents."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-38",
"text": "Mirroring the setting of (Das et al. 2017b) , GuessWhich is an image-guessing game that consists of 2 participants -questioner and answerer."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-39",
"text": "At the start of the game, the answerer is provided an image that is unknown to the questioner and both questioner and answerer are given a brief description of the image content."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-40",
"text": "The questioner interacts with the answerer for a fixed number of rounds of question-answer (dialog) to identify the secret image from a fixed pool of images (see Fig. 1 )."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-41",
"text": "We evaluate human-AI team performance in GuessWhich, for the setting where the questioner is a human and the answerer is an AI (that we denote ALICE)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-42",
"text": "Specifically, we evaluate two versions of ALICE for GuessWhich:"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-43",
"text": "1. ALICE SL which is trained in a supervised manner on the Visual Dialog dataset (Das et al. 2017a ) to mimic the answers given by humans when engaged in a conversation with other humans about an image, and 2."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-44",
"text": "ALICE RL which is pre-trained with supervised learning and fine-tuned via reinforcement learning for an imageguessing task as in (Das et al. 2017b) ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-45",
"text": "It is important to appreciate the difficulty and sensitivity of the GuessWhich game as an evaluation tool -agents have to understand human questions and respond with accurate, consistent, fluent and informative answers for the human-AI team to do well."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-46",
"text": "Furthermore, they have to be robust to their own mistakes, i.e., if an agent makes an error at a particular round, that error is now part of its conversation history, and it must be able to correct itself rather than be consistently inaccurate."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-47",
"text": "Similarly, human players must also learn to adapt to ALICE's sometime noisy and inaccurate responses."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-48",
"text": "At its core, GuessWhich is a game-with-a-purpose (GWAP) that leverages human computation to evaluate visual conversational agents."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-49",
"text": "Traditionally, GWAP (Von Ahn and Dabbish 2008) have focused on human-human collaboration, i.e. collecting data by making humans play games to label images (Von Ahn and Dabbish 2004) , music (Law et al. 2007 ) and movies (Michelucci 2013) ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-50",
"text": "We extend this to human-AI teams and to the best of our knowledge, our work is the first to evaluate visual conversational agents in an interactive setting where humans are continuously engaging with agents to succeed at a cooperative game."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-51",
"text": "Contributions."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-52",
"text": "More concretely, we make the following contributions in this work:"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-53",
"text": "\u2022 We design an interactive image-guessing game (GuessWhich) for evaluating human-AI team performance in the specific context of the AIs being visual conversational agents."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-54",
"text": "GuessWhich pairs humans with ALICE, an AI capable of answering a sequence of questions about images."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-55",
"text": "ALICE is assigned a secret image and answers questions asked about that image from a human for 9 rounds to help them identify the secret image (Sec. 4)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-56",
"text": "\u2022 We evaluate human-AI team performance on this game for both supervised learning (SL) and reinforcement learning (RL) versions of ALICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-57",
"text": "Our main experimental finding is that despite significant differences between SL and RL agents reported in previous work (Das et al. 2017b) , we find no significant difference in performance between ALICE SL or ALICE RL when paired with human partners (Sec. 6.1)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-58",
"text": "This suggests that while self-talk and RL are interesting directions to pursue for building better visual conversational agents, there appears to be a disconnect between AI-AI and human-AI evaluations -progress on former does not seem predictive of progress on latter."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-59",
"text": "This is an important finding to guide future research."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-61",
"text": "**RELATED WORK**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-62",
"text": "Given that our goal is to evaluate visual conversational agents through a human computation game, we draw connections to relevant work on visual conversational agents, human computation games, and dialog evaluation below."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-63",
"text": "Visual Conversational Agents."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-64",
"text": "Our AI agents are visual conversational models, which have recently emerged as a popular research area in visually-grounded language modeling (Das et al. 2017a; Das et al. 2017b; . (Das et al. 2017a ) introduced the task of Visual Dialog and collected the VisDial dataset by pairing subjects on Amazon Mechanical Turk (AMT) to chat about an image (with assigned roles of questioner and answerer)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-65",
"text": "(Das et al. 2017b ) pre-trained questioner and answerer agents on this VisDial dataset via supervised learning and fine-tuned them via self-talk (reinforcement learning), observing that RL-fine-tuned QBOT-ABOT are better at image-guessing after interacting with each other."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-66",
"text": "However, (Aras et al. 2010; Chamberlain, Poesio, and Kruschwitz 2008) , movies (Michelucci 2013) etc."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-67",
"text": "While such games have traditionally focused on human-human collaboration, we extend these ideas to human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-68",
"text": "Rather than collecting labeled data, our game is designed to measure the effectiveness of the AI in the context of human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-69",
"text": "Evaluating Conversational Agents."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-70",
"text": "Goal-driven (nonvisual) conversational models have typically been evaluated on task-completion rate or time-to-task-completion (Paek 2001) , so shorter conversations are better."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-71",
"text": "At the other end of the spectrum, free-form conversation models are often evaluated by metrics that rely on n-gram overlaps, such as BLEU, METEOR, ROUGE, but these have been shown to correlate poorly with human judgment (Liu et al. 2016) ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-72",
"text": "Human evaluation of conversations is typically in the format where humans rate the quality of machine utterances given context, without actually taking part in the conversation, as in (Das et al. 2017b ) and (Li et al. 2016) ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-73",
"text": "To the best of our knowledge, we are the first to evaluate conversational models via team performance where humans are continuously interacting with agents to succeed at a downstream task."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-74",
"text": "Turing Test."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-75",
"text": "Finally, our GuessWhich game is in line with ideas in (Grosz 2012), re-imagining the traditional Turing Test for state-of-the-art AI systems, taking the pragmatic view that an effective AI teammate need not appear humanlike, act or be mistaken for one, provided its behavior does not feel jarring or baffle teammates, leaving them wondering not about what it is thinking but whether it is."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-76",
"text": "Next, we formally define the AI agent ALICE (Sec. 3), describe the GuessWhich game setup (Sec. 4 and 5), and present results and analysis from human studies (Sec. 6)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-78",
"text": "**THE AI: ALICE**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-79",
"text": "Recall from Section 1 that our goal is to evaluate how progress in AI measured through automatic evaluation translates to performance of human-AI teams in the context of visual conversational agents."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-80",
"text": "Specifically, we are considering the question-answering agent ABOT from (Das et al. 2017b) as ABOT is the agent more likely to be deployed with a human partner in real applications (e.g. to answer questions about visual content to aid a visually impaired user)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-81",
"text": "For completeness, we will review this work in this section."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-82",
"text": "(Das et al. 2017b ) formulate a self-supervised imageguessing task between a questioner bot (QBOT) and an answerer bot (ABOT) which plays out over multiple rounds of dialog."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-83",
"text": "At the start of the task, QBOT and ABOT are shown a one sentence description (i.e. a caption) of an image (unknown to QBOT)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-84",
"text": "The pair can then engage in question and answer based dialog for a fixed number of iterations after which QBOT must try to select the secret image from a pool."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-85",
"text": "The goal of the QBOT-ABOT team is two-fold, QBOT should: 1) build a mental model of the unseen image purely from the dialog and 2) be able to retrieve that image from a line-up of images."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-86",
"text": "Both QBOT and ABOT are modeled as Hierarchical Recurrent Encoder-Decoder neural networks (Das et al. 2017a; ) which encode each round of dialog independently via a recurrent neural network (RNN) before accumulating this information through time with an additional RNN (resulting in hierarchical encoding)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-87",
"text": "This representation (and a convolutional neural network based image encoding in ABOT's case) are used as input to a decoder RNN which produces an agent's utterance (question for QBOT and answer for ABOT) based on the dialog (and image for ABOT)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-88",
"text": "In addition, QBOT includes an image feature regression network that predicts a representation of the secret image based on dialog history."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-89",
"text": "We refer to (Das et al. 2017b) for complete model details."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-90",
"text": "These agents are pre-trained with supervised dialog data from the VisDial dataset (Das et al. 2017a ) with a Maximum Likelihood Estimation objective."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-91",
"text": "This pre-training ensures that agents can generally recognize objects/scenes and utter English."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-92",
"text": "Following this, the models are fine-tuned by 'smoothly' transitioning to a deep reinforcement learning framework to directly improve image-guessing performance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-93",
"text": "This annealed transition avoids abrupt divergence of the dialog in face of an incorrect question-answer pair in the QBOT-ABOT exchange."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-94",
"text": "During RL based self-talk, the agents' parameters are updated by gradients corresponding to rewards depending on individual good or bad exchanges."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-95",
"text": "We refer to the baseline supervised learning based ABOT as ALICE SL and the RL fine-tuned bot as ALICE RL ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-96",
"text": "(Das et al. 2017b) found that the AI-AI pair succeeds in retrieving the correct image more often after being fine-tuned with RL."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-97",
"text": "In the following section, we outline our GuessWhich game designed to evaluate whether this improvement between ALICE SL and ALICE RL in automatic metrics translates to human-AI collaborations."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-99",
"text": "**OUR GUESSWHICH GAME**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-100",
"text": "We begin by describing our game setting; outlining the players and gameplay mechanics."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-101",
"text": "A video of an example game being played can be found at https://vimeo.com/"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-103",
"text": "**229488160.**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-105",
"text": "We replace QBOT in the AI-AI dialog with humans to perform a collaborative task of identifying a secret image from a pool."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-106",
"text": "In the following, we will refer to ABOT as ALICE and the human player as H. We evaluate two versions of ALICE -ALICE SL and ALICE RL , where SL and RL correspond to agents trained in a supervised setting and finetuned with reinforcement learning respectively."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-107",
"text": "Gameplay."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-108",
"text": "In our game setting, ALICE is assigned a secret image I c (unknown to H) from a pool of images I = {I 1 , I 2 , ..., I n } taken from the COCO dataset (Lin et al. 2014) ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-109",
"text": "Prior to beginning the dialog, both ALICE and H are Figure 2: GuessWhich Interface: A user asks a question to ALICE in each round and ALICE responds with an answer."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-110",
"text": "The user then selects an appropriate image which they think is the secret image after each round of conversation."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-111",
"text": "At the end of the dialog, user successively clicks on their best guesses until they correctly identify the secret image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-112",
"text": "provided a brief description (i.e. a caption) of I c generated by Neuraltalk2 (Karpathy 2016) , an open-source implementation of (Vinyals et al. 2015) ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-113",
"text": "H then makes a guess about the secret image by selecting one from the pool I based only on the caption, i.e. before the dialog begins."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-114",
"text": "In each of the following rounds, H asks ALICE a question q t about the secret image I c in order to better identify it from the pool and ALICE responds with an answer a t ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-115",
"text": "After each round, H must select an image I t that they feel is most likely the secret image I c from pool I based on the dialog so far."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-116",
"text": "At the end of k = 9 rounds of dialog, H is asked to successively click on their best guess."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-117",
"text": "At each click, the interface gives H feedback on whether their guess is correct or not and this continues until H guesses the true secret image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-118",
"text": "In this way, H induces a partial ranking of the pool up to the secret image based on their mental model of I c from the dialog."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-120",
"text": "**POOL SELECTION**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-121",
"text": "When creating a pool of images, our aim is to ensure that the game is challenging and engaging, and not too easy or too hard."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-122",
"text": "Thus, we construct each pool of images I in two stepsfirst, we choose the secret image I c , and then sample similar images as distractors for I c ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-123",
"text": "Fig. 2 shows a screenshot of our game interface including a sample image pool and chat."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-124",
"text": "Secret Image Selection."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-125",
"text": "VisDial v0.5 is constructed on 68k COCO images which contain complex everyday scenes with 80 object categories."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-126",
"text": "ABOT is trained and validated on VisDial v0.5 train and val splits respectively."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-127",
"text": "As the images for both these splits come from COCO-train, we sample secret images and pools from COCO-validation to avoid overlap."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-128",
"text": "To select representative secret images and diverse image pools, we do the following."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-129",
"text": "For each image in the COCO validation set, we extract the penultimate layer ('fc7') activations of a standard deep convolutational neural network (VGG-19 from (Simonyan and Zisserman 2015))."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-130",
"text": "For each of the 80 categories, we average the embedding vector of all images containing that category."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-131",
"text": "We then pick those images closest to the mean embeddings, yielding 80 candidates."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-132",
"text": "Generating Distractor Images."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-133",
"text": "The distractor images are designed to be semantically similar to the secret image I c ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-134",
"text": "For each candidate secret image, we created 3 concentric hyper-spheres as euclidean balls (of radii increasing in arithmetic progression) centered on the candidate secret image in fc7 embedding space, and sampled images from each sphere in a fixed proportion to generate a pool corresponding to the secret image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-135",
"text": "The radius of the largest sphere was varied and manually validated to ensure pool difficulty."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-136",
"text": "The sampling proportion can be varied to generate pools of varying difficulty."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-137",
"text": "Of the 80 candidate pools, we picked 10 that were of medium difficulty based on manual inspection."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-139",
"text": "**DATA COLLECTION AND PLAYER REWARD STRUCTURE**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-140",
"text": "We use AMT to solicit human players for our game."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-141",
"text": "Each Human Intelligence Task (HIT) consists of 10 games (each game corresponds to one pool) and we find that overall 76.7% of users who started a HIT completed it i.e. played all 10 games."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-142",
"text": "We note that incomplete game data was discarded and does not contribute to the analysis presented in subsequent sections."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-143",
"text": "We published HITs until 28 games with both ALICE SL and ALICE RL were completed."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-144",
"text": "This results in a total of 560 games split between the agents, with each game consisting of 9 rounds of dialog and 10 rounds of guessing."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-145",
"text": "Workers are paid a base pay of $5 per HIT (\u223c$10/hour)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-146",
"text": "To incentivize workers to try their best at guessing the secret image, workers are paid a two-part bonus -(1) based on the number of times their best guess matched the true secret image after each round (up to $1 per HIT), and (2) based on the rank of the true secret image in their final sorting at the end of dialog (up to $2 per HIT)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-147",
"text": "This final ranking explicitly captures the workers' mental model of the secret image (unlike the per-round, best-guess estimates), and is closer to the overall purpose of the game (identifying the secret image at the end of the dialog)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-148",
"text": "As such, this final sorting is given a higher potential bonus."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-150",
"text": "**EVALUATION**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-151",
"text": "Since the game is structured as a retrieval task, we evaluate the human-AI collaborative performance using standard retrieval metrics."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-152",
"text": "Note that the successive selection of images by H at the end of the dialog tells us the rank of the true secret image in a sorting of the image pool based on H's mental model."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-153",
"text": "For example, if H makes 4 guesses before correctly selecting the secret image, then H's mental model ranked the secret image 5th within the pool."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-154",
"text": "To evaluate human-AI collaboration, we use the following metrics: (1) Mean Rank (MR), which is the mean rank of the secret image (i.e. number of guesses it takes to identify the secret image)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-155",
"text": "Lower values indicate better performance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-156",
"text": "(2) Mean Reciprocal Rank (MRR), which is the mean of the reciprocal of the rank of the secret image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-157",
"text": "MRR penalizes differences in lower ranks (e.g., between 1 and 2) greater than those in higher ranks (e.g., between 19 and 20)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-158",
"text": "Higher values indicate better performance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-159",
"text": "At the end of each round, H makes their best guess of the secret image."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-160",
"text": "To get a coarse estimate of the rank of the secret image in each round, we sort the image pool based on distance in fc7 embedding space from H's best guess."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-161",
"text": "This can be used to assess accuracy of H's mental model of the secret image after each round of dialog (e.g., Fig. 4b )."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-163",
"text": "**INFRASTRUCTURE**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-164",
"text": "We briefly outline the backend architecture of GuessWhich in this section."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-165",
"text": "Unlike most human-labeling tasks that are one-way and static in nature (i.e., only involving a human labeling static data), evaluating AI agents via our game requires live interaction between the AI agent and the human."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-166",
"text": "We develop a robust workflow that can maintain a queue of workers and pair them up in real-time with an AI agent."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-167",
"text": "We deploy ALICE SL and ALICE RL on an AWS EC2 (AWS 2017) GPU instance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-168",
"text": "We use Django (a Model-ViewController web framework written in Python) which helps in monitoring HITs in real-time."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-169",
"text": "We use (RabbitMQ 2017) , an open source message broker, to queue inference jobs that generate dialog responses from the model."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-170",
"text": "Our backend is Figure 3 : We outline the backend architecture of our implementation of GuessWhich."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-171",
"text": "Since GuessWhich requires a live interaction between the human and the AI, we design a workflow that can handle multiple queues and can quickly pair a human with an AI agent."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-172",
"text": "asynchronously connected to the client browser via websockets such that whenever an inference job is completed, a websocket polls the AI response and delivers it to the human in real-time."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-173",
"text": "We store and fetch data efficiently to and from a PostgreSQL database."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-174",
"text": "Fig. 3 shows a schematic diagram of the backend architecture."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-175",
"text": "Our complete backend infrastructure and code will be made publicly available for others to easily make use of our human-AI game interface."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-202",
"text": "Interestingly, we observe that AI-ALICE teams outperform human-ALICE teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-203",
"text": "On average, a QBOT (SL)-ALICE SL team takes about 5.6 guesses to arrive at the correct secret image (as opposed to 6.86 guesses for a human-ALICE SL Table 2 : Performance of Human-ALICE and QBOT-ALICE teams measured by MR (lower is better)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-204",
"text": "We observe that AI-AI teams outperform human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-205",
"text": "team)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-206",
"text": "Similarly, a QBOT (RL)-ALICE RL team takes 4.7 guesses as opposed to a human-ALICE RL team which takes 7.19 guesses."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-207",
"text": "When we compare AI-AI teams (see Row 2 and 3) under different settings, we observe that teams having QBOT (RL) as the questioner outperform those with QBOT (SL)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-208",
"text": "Qualitatively, we found that QBOT (SL) tends to ask repeating questions in a dialog and that questions from QBOT (RL) tend to be more visually grounded compared to QBOT (SL)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-209",
"text": "Also, note that among the four teams ALICE does not seem to affect performance across SL and RL."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-210",
"text": "Since we observe that QBOT (RL) tends to be a better questioner on average compared to QBOT (SL), as future work, it will be interesting to explore a setting where we evaluate QBOT via a similar game with the human playing the role of answerer in a QBOT-human team."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-211",
"text": "MR with varying rounds of dialog."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-212",
"text": "Fig. 4b shows a coarse estimate of the mean rank of the secret image across rounds of a dialog, averaged across games and workers."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-213",
"text": "As explained in Sec. 4.3, image ranks are computed via distance in embedding space from the guessed image (and hence, are only an estimate)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-214",
"text": "We see that the human-ALICE team performs about the same for both ALICE SL and ALICE RL across rounds of dialog in a game."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-215",
"text": "When compared with a baseline agent that makes random guesses after every round of dialog, the human-ALICE team clearly performs better."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-216",
"text": "Worker ratings for ALICE SL and ALICE RL on 6 metrics."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-217",
"text": "Higher is better."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-218",
"text": "Error bars are 95% confidence intervals from 1000 bootstrap samples."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-219",
"text": "Humans perceive no significant differences between ALICE SL and ALICE RL across the 6 feedback metrics."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-220",
"text": "Statistical tests."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-221",
"text": "Observe that on both the metrics (MR and MRR), the differences between performances of ALICE SL and ALICE RL are within error margins."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-222",
"text": "Since both standard error and bootstrap based 95% confidence intervals overlap significantly, we ran further statistical tests."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-223",
"text": "We find no significant difference between the mean ranks of ALICE SL and ALICE RL under a Mann-Whitney U test (p = 0.44)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-224",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-225",
"text": "**HUMAN PERCEPTION OF AI TEAMMATE**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-226",
"text": "At the end of each HIT, we asked workers for feedback on ALICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-227",
"text": "Specifically, we asked workers to rate ALICE on a 5-point scale (where 1=Strongly disagree, 5=Strongly agree), along 6 dimensions."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-228",
"text": "As shown in Fig. 5 , ALICE was rated on -how accurate they thought it was (accuracy), how consistent its answers were with its previous answers (consistency), how well it understood the secret image (image understanding), how detailed its answers were (detail), how well it seemed to understand their questions (question understanding) and how fluent its answers were (fluency)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-229",
"text": "We see in Fig. 5 that humans perceive both ALICE SL and ALICE RL as comparable in terms of all metrics."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-230",
"text": "The small differences in perception are not statistically significant."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-231",
"text": "Fig. 6 shows the distribution of questions that human subjects ask ALICE in GuessWhich."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-232",
"text": "Akin to the format of the human-human GuessWhat game, we observe that binary (yes/no) questions are overwhelmingly the most common question type, for instance, \"Is there/the/he ...?\" (region shaded yellow in the figure), \"Are there ...?\" (region shaded red), etc."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-233",
"text": "The next most frequent question is \"What color ...?\"."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-234",
"text": "These questions may be those that help the human discriminate the secret image the best."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-235",
"text": "It could also be that humans are attempting to play to the perceived strengths of ALICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-236",
"text": "As people play multiple games with ALICE, it is possible that they discover ALICE's strengths and learn to ask questions that play to its strengths."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-237",
"text": "Another common question type is counting questions, such as \"How many ...?\"."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-238",
"text": "Interestingly, some workers adopt the strategy of querying ALICE with a single word (e.g., nouns such as \"people\", \"pictures\", etc.) or a phrase (e.g., \"no people\", Figure 6 : Distribution of first n-grams for questions asked to ALICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-239",
"text": "Word ordering starts from the center and radiates outwards."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-240",
"text": "Arc length is proportional to the number of questions containing the word."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-241",
"text": "The most common question-types are binary -followed by 'What color..' questions."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-242",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-243",
"text": "**QUESTIONING STRATEGIES**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-244",
"text": "\"any cars\", etc.)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-245",
"text": "This strategy, while minimizing human effort, does not appear to change ALICE's performance."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-246",
"text": "Fig. 7 shows a game played by two different subjects."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-247",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-248",
"text": "**CHALLENGES**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-249",
"text": "There exist several challenges that are unique to human computation in the context of evaluating human-AI teams, for instance, making our games engaging while still ensuring fair and accurate evaluation."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-250",
"text": "In this section, we briefly discuss some of the challenges we faced and our solutions to them."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-251",
"text": "Knowledge Leak."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-252",
"text": "It has been shown that work division in crowdsourcing tasks follows a Pareto principle (Little 2009), as a small fraction of workers usually complete a majority of the work."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-253",
"text": "In the context of evaluating an AI based on performance of a human-AI team, this poses a challenge."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-254",
"text": "Recently, (Chandrasekaran et al. 2017) showed that human subjects can predict the responses of an AI more accurately with higher familiarity with the AI."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-255",
"text": "That is, a human's knowledge gained from familiarity with their AI teammate, can bias the performance of the human-AI team -knowledge from previous tasks might leak to later tasks."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-256",
"text": "To prevent a biased evaluation of team performance due to human subjects who have differing familiarity with ALICE, every person only plays a fixed number of games (10) with ALICE."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-257",
"text": "Thus, a human subject can only accept one task on AMT, which involves playing 10 games."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-258",
"text": "The downside to this is that our ability to conduct a fair evaluation of an AI in an interactive, game-like setting is constrained by the number of unique workers who accept our tasks."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-259",
"text": "Engagement vs. Fairness."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-260",
"text": "In order to improve userengagement while playing our games, we offer subjects performance-based incentives that are tied to the success of Figure 7: We contrast two games played by different workers with ALICE SL and ALICE RL on the same pool (secret image outlined in green)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-261",
"text": "In both cases, the workers are able to find the secret image within three guesses."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-262",
"text": "It is also interesting to note how the answers provided by ALICE are different in the two cases."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-263",
"text": "the human-AI team."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-264",
"text": "There is one potential issue with this however."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-265",
"text": "Owing to the inherent complexity of the visual dialog task, ALICE tends to be inaccurate at times."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-266",
"text": "This increases both the difficulty and unpredictability of the game, as it tends to be more accurate for certain types of questions compared to others."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-267",
"text": "We observe that this often leads to unsuccessful game-plays, sometimes due to errors accumulating from successive incorrect responses from ALICE to questions from the human."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-268",
"text": "In a few other cases, the human is misled by ALICE by a single wrong answer or by the seed caption that tends to be inaccurate at times."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-269",
"text": "While we would like to keep subjects engaged in the game to the best extent possible by providing performance-based incentives, issuing a performance bonus that depends on both the human and ALICE (who is imperfect), can be dissatisfying."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-270",
"text": "To be fair to the subjects performing the task while still rewarding good performance, we split our overall budget for each HIT into a suitable fraction between the base pay (majority), and the performance bonus."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-271",
"text": "----------------------------------"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-272",
"text": "**CONCLUSION**"
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-273",
"text": "In contrast to the common practice of measuring AI progress in isolation, our work proposes benchmarking AI agents via interactive downstream tasks (cooperative games) performed by human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-274",
"text": "In particular, we evaluate visual conversational agents in the context of human-AI teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-275",
"text": "We design a cooperative game -GuessWhich -that involves a human engaging in a dialog with an answerer-bot (ALICE) to identify a secret image known to ALICE but unknown to the human from a pool of images."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-276",
"text": "At the end of the dialog, the human is asked to pick out the secret image from the image pool by making successive guesses."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-277",
"text": "We find that AL-ICE RL (fine-tuned with reinforcement learning) that has been found to be more accurate in AI literature than it's supervised learning counterpart when evaluated via a questioner bot (QBOT)-ALICE team, is not more accurate when evaluated via a human-ALICE team."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-278",
"text": "This suggests that there is a disconnect between between benchmarking of AI in isolation versus in the context of human-AI interaction."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-279",
"text": "An interesting direction of future work could be to evaluate QBOT via QBOT-human teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-280",
"text": "We describe the game structure and the backend architecture and discuss the unique computation and infrastructure challenges that arise when designing such live interactive settings on AMT relative to static human-labeling tasks."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-281",
"text": "Our code and infrastructure will be made publicly available."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-180",
"text": "We compare the performance of the two agents ALICE SL and ALICE RL in the GuessWhich game."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-181",
"text": "These bots are state-ofthe-art visual dialog agents with respect to emulating human responses and generating visually discriminative responses in AI-AI dialog."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-182",
"text": "(Das et al. 2017b ) evaluate these agents against strong baselines and report AI-AI team results that are significantly better than chance on a pool of \u223c10k images (rank \u223c1000 for SL, rank \u223c500 for RL)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-183",
"text": "In addition to evaluating them in the context of human-AI teams we also report QBOT-ALICE team performances for reference."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-184",
"text": "In Table 1 , we compare the performances of human-ALICE SL and human-ALICE RL teams according to Mean Rank (MR) and Mean Reciprocal Rank (MRR) of the secret image based on the guesses H makes at the end of dialog."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-185",
"text": "We observe that at the end of each game (9 rounds of dialog), human subjects correctly guessed the secret image on their 6.86th attempt (Mean Rank) when ALICE SL was their teammate."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-186",
"text": "With ALICE RL as their teammate, the average number of guesses required was 7.19."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-187",
"text": "We also observe that ALICE RL outperforms ALICE SL on the MRR metric."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-188",
"text": "On both metrics, however, the differences are within the standard error margins (reported in the table) and not statisti- Table 1 : Performance of Human-ALICE teams with AL-ICE SL and ALICE RL measured by MR (lower is better) and MRR (higher is better)."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-189",
"text": "Error bars are 95% CIs from 1000 bootstrap samples."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-190",
"text": "Unlike (Das et al., 2017b) , we find no significant difference between ALICE SL and ALICE RL ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-191",
"text": "cally significant."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-192",
"text": "As we collected additional data, the error margins became smaller but the means also became closer."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-193",
"text": "This interesting finding stands in stark contrast to the results reported by (Das et al. 2017b) , where ALICE RL was found to be significantly more accurate than ALICE SL when evaluated in an AI-AI team."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-194",
"text": "Our results suggest that the improvements of RL over SL (in AI-AI teams) do not seem to translate to when the agents are paired with a human in a similar setting."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-195",
"text": "MR with varying number of games."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-196",
"text": "In Fig. 4a , we plot the mean rank (MR) of the secret image across different games."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-197",
"text": "We see that the human-ALICE team performs about the same for both ALICE SL and ALICE RL except Game 5, where ALICE SL seems to marginally outperform ALICE RL ."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-198",
"text": "We compare the performance of these teams against a baseline model that makes a string of random guesses at the end of the game."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-199",
"text": "The human-ALICE teams outperforms this random baseline with a relative improvement of about 25%."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-200",
"text": "AI-ALICE teams versus human-ALICE teams."
},
{
"sent_id": "9795a839cb79ed971de4c325e01e74-C001-201",
"text": "In Table 2 , we compare team performances by pairing three kinds of questioners -human, QBOT (SL) and QBOT (RL) with AL-ICE SL and ALICE RL (6 teams in total) to gain insights about how the questioner and ALICE influence team performances."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"9795a839cb79ed971de4c325e01e74-C001-18"
],
[
"9795a839cb79ed971de4c325e01e74-C001-29"
],
[
"9795a839cb79ed971de4c325e01e74-C001-32"
],
[
"9795a839cb79ed971de4c325e01e74-C001-64"
],
[
"9795a839cb79ed971de4c325e01e74-C001-65"
],
[
"9795a839cb79ed971de4c325e01e74-C001-72"
],
[
"9795a839cb79ed971de4c325e01e74-C001-82"
],
[
"9795a839cb79ed971de4c325e01e74-C001-182"
]
],
"cite_sentences": [
"9795a839cb79ed971de4c325e01e74-C001-18",
"9795a839cb79ed971de4c325e01e74-C001-29",
"9795a839cb79ed971de4c325e01e74-C001-32",
"9795a839cb79ed971de4c325e01e74-C001-64",
"9795a839cb79ed971de4c325e01e74-C001-65",
"9795a839cb79ed971de4c325e01e74-C001-72",
"9795a839cb79ed971de4c325e01e74-C001-82",
"9795a839cb79ed971de4c325e01e74-C001-182"
]
},
"@SIM@": {
"gold_contexts": [
[
"9795a839cb79ed971de4c325e01e74-C001-37",
"9795a839cb79ed971de4c325e01e74-C001-38"
],
[
"9795a839cb79ed971de4c325e01e74-C001-41",
"9795a839cb79ed971de4c325e01e74-C001-42",
"9795a839cb79ed971de4c325e01e74-C001-43",
"9795a839cb79ed971de4c325e01e74-C001-44"
]
],
"cite_sentences": [
"9795a839cb79ed971de4c325e01e74-C001-38",
"9795a839cb79ed971de4c325e01e74-C001-44"
]
},
"@USE@": {
"gold_contexts": [
[
"9795a839cb79ed971de4c325e01e74-C001-37",
"9795a839cb79ed971de4c325e01e74-C001-38"
],
[
"9795a839cb79ed971de4c325e01e74-C001-41",
"9795a839cb79ed971de4c325e01e74-C001-42",
"9795a839cb79ed971de4c325e01e74-C001-43",
"9795a839cb79ed971de4c325e01e74-C001-44"
]
],
"cite_sentences": [
"9795a839cb79ed971de4c325e01e74-C001-38",
"9795a839cb79ed971de4c325e01e74-C001-44"
]
},
"@DIF@": {
"gold_contexts": [
[
"9795a839cb79ed971de4c325e01e74-C001-57"
],
[
"9795a839cb79ed971de4c325e01e74-C001-190"
],
[
"9795a839cb79ed971de4c325e01e74-C001-192",
"9795a839cb79ed971de4c325e01e74-C001-193"
]
],
"cite_sentences": [
"9795a839cb79ed971de4c325e01e74-C001-57",
"9795a839cb79ed971de4c325e01e74-C001-190",
"9795a839cb79ed971de4c325e01e74-C001-193"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"9795a839cb79ed971de4c325e01e74-C001-80"
],
[
"9795a839cb79ed971de4c325e01e74-C001-89"
]
],
"cite_sentences": [
"9795a839cb79ed971de4c325e01e74-C001-80",
"9795a839cb79ed971de4c325e01e74-C001-89"
]
}
}
},
"ABC_d1decbc03929cbf67a412d0a3a2a66_4": {
"x": [
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-2",
"text": "We argue that groups of unannotated texts with overlapping and non-contradictory semantics represent a valuable source of information for learning semantic representations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-3",
"text": "A simple and efficient inference method recursively induces joint semantic representations for each group and discovers correspondence between lexical entries and latent semantic concepts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-4",
"text": "We consider the generative semantics-text correspondence model (Liang et al., 2009) and demonstrate that exploiting the noncontradiction relation between texts leads to substantial improvements over natural baselines on a problem of analyzing human-written weather forecasts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-7",
"text": "In recent years, there has been increasing interest in statistical approaches to semantic parsing."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-8",
"text": "However, most of this research has focused on supervised methods requiring large amounts of labeled data."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-9",
"text": "The supervision was either given in the form of meaning representations aligned with sentences (Zettlemoyer and Collins, 2005; Ge and Mooney, 2005; Mooney, 2007) or in a somewhat more relaxed form, such as lists of candidate meanings for each sentence (Kate and Mooney, 2007; Chen and Mooney, 2008) or formal representations of the described world state for each text (Liang et al., 2009) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-10",
"text": "Such annotated resources are scarce and expensive to create, motivating the need for unsupervised or semi-supervised techniques (Poon and Domingos, 2009 )."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-11",
"text": "However, unsupervised methods have their own challenges: they are not always able to discover semantic equivalences of lexical entries or logical forms or, on the contrary, cluster semantically different or even opposite expressions (Poon and Domingos, 2009 )."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-12",
"text": "Unsupervised approaches can only rely on distributional similarity of contexts (Harris, 1968) to decide on semantic relatedness of terms, but this information may be sparse and not reliable (Weeds and Weir, 2005) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-13",
"text": "For example, when analyzing weather forecasts it is very hard to discover in an unsupervised way which of the expressions among \"south wind\", \"wind from west\" and \"southerly\" denote the same wind direction and which are not, as they all have a very similar distribution of their contexts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-14",
"text": "The same challenges affect the problem of identification of argument roles and predicates."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-15",
"text": "In this paper, we show that groups of unannotated texts with overlapping and non-contradictory semantics provide a valuable source of information."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-16",
"text": "This form of weak supervision helps to discover implicit clustering of lexical entries and predicates, which presents a challenge for purely unsupervised techniques."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-17",
"text": "We assume that each text in a group is independently generated from a full latent semantic state corresponding to the group."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-18",
"text": "Importantly, the texts in each group do not have to be paraphrases of each other, as they can verbalize only specific parts (aspects) of the full semantic state, yet statements about the same aspects must not contradict each other."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-19",
"text": "Simultaneous inference of the semantic state for the noncontradictory and semantically overlapping documents would restrict the space of compatible hypotheses, and, intuitively, 'easier' texts in a group will help to analyze the 'harder' ones."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-105",
"text": "Each record is characterized by a record type t \u2208 {1,..., T }, which defines the set of fields F (t) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-20",
"text": "1 As an illustration of why this weak supervision may be valuable, consider a group of two non-contradictory texts, where one text mentions \"2.2 bn GBP decrease in profit\", whereas another one includes a passage \"profit fell by 2.2 billion pounds\"."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-21",
"text": "Even if the model has not observed gust min=0, max=29, min=20, max=32, mean=26) thunderChance(time=6-21,mode=chance) freezingRainChance (time=17-30,mode=--) sleetChance (time='6-21',mode=--) skycover (time=6-21,bucket=75-100) windSpeed (time=6-21; min=14,max=22,mean=19, bucket=10-20) rainChance(time=6-21,mode=chance)"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-22",
"text": "windChill (time=6-21,min=0,max=0,mean=0) ...... Figure 1 : An example of three non-contradictory weather forecasts and their alignment to the semantic representation."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-23",
"text": "Note that the semantic representation (the block in the middle) is not observable in training."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-24",
"text": "the word \"fell\" before, it is likely to align these phrases to the same semantic form because of similarity of their arguments."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-25",
"text": "And this alignment would suggest that \"fell\" and \"decrease\" refer to the same process, and should be clustered together."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-26",
"text": "This would not happen for the pair \"fell\" and \"increase\" as similarity of their arguments would normally entail contradiction."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-27",
"text": "Similarly, in the example mentioned earlier, when describing a forecast for a day with expected south winds, texts in the group can use either \"south wind\" or \"southerly\" to indicate this fact but no texts would verbalize it as \"wind from west\", and therefore these expressions will be assigned to different semantic clusters."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-28",
"text": "However, it is important to note that the phrase \"wind from west\" may still appear in the texts, but in reference to other time periods, underlying the need for modeling alignment between grouped texts and their latent meaning representation."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-29",
"text": "As much of the human knowledge is redescribed multiple times, we believe that noncontradictory and semantically overlapping texts are often easy to obtain."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-30",
"text": "For example, consider semantic analysis of news articles or biographies."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-31",
"text": "In both cases we can find groups of documents referring to the same events or persons, and though they will probably focus on different aspects and have different subjective passages, they are likely to agree on the core information (Shinyama and Sekine, 2003) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-32",
"text": "Alternatively, if such groupings are not available, it may still be easier to give each semantic representation (or a state) to multiple annotators and ask each of them to provide a textual description, instead of annotating texts with semantic expressions."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-33",
"text": "The state can be communicated to them in a visual or audio form (e.g., as a picture or a short video clip) ensuring that their interpretations are consistent."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-34",
"text": "Unsupervised learning with shared latent semantic representations presents its own challenges, as exact inference requires marginalization over possible assignments of the latent semantic state, consequently, introducing non-local statistical dependencies between the decisions about the semantic structure of each text."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-35",
"text": "We propose a simple and fairly general approximate inference algorithm for probabilistic models of semantics which is efficient for the considered model, and achieves favorable results in our experiments."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-36",
"text": "In this paper, we do not consider models which aim to produce complete formal meaning of text (Zettlemoyer and Collins, 2005; Mooney, 2007; Poon and Domingos, 2009) , instead focusing on a simpler problem studied in (Liang et al., 2009) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-37",
"text": "They investigate grounded language acquisition set-up and assume that semantics (world state) can be represented as a set of records each consisting of a set of fields."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-38",
"text": "Their model segments text into utterances and identifies records, fields and field values discussed in each utterance."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-39",
"text": "Therefore, one can think of this problem as an extension of the semantic role labeling problem (Carreras and Marquez, 2005) , where predicates (i.e. records in our notation) and their arguments should be identified in text, but here arguments are not only assigned to a specific role (field) but also mapped to an underlying equivalence class (field value)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-40",
"text": "For example, in the weather forecast domain field sky cover should get the same value given expressions \"overcast\" and \"very cloudy\" but a different one if the expres-sions are \"clear\" or \"sunny\"."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-41",
"text": "This model is hard to evaluate directly as text does not provide information about all the fields and does not necessarily provide it at the sufficient granularity level."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-42",
"text": "Therefore, it is natural to evaluate their model on the database-text alignment problem (Snyder and Barzilay, 2007) , i.e. measuring how well the model predicts the alignment between the text and the observable records describing the entire world state."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-43",
"text": "We follow their set-up, but assume that instead of having access to the full semantic state for every training example, we have a very small amount of data annotated with semantic states and a larger number of unannotated texts with noncontradictory semantics."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-44",
"text": "We study our set-up on the weather forecast data (Liang et al., 2009) where the original textual weather forecasts were complemented by additional forecasts describing the same weather states (see figure 1 for an example)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-45",
"text": "The average overlap between the verbalized fields in each group of noncontradictory forecasts was below 35%, and more than 60% of fields are mentioned only in a single forecast from a group."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-46",
"text": "Our model, learned from 100 labeled forecasts and 259 groups of unannotated non-contradictory forecasts (750 texts in total), achieved 73.9% F 1 ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-47",
"text": "This compares favorably with 69.1% shown by a semi-supervised learning approach, though, as expected, does not reach the score of the model which, in training, observed semantics states for all the 750 documents (77.7% F 1 )."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-48",
"text": "The rest of the paper is structured as follows."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-49",
"text": "In section 2 we describe our inference algorithm for groups of non-contradictory documents."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-50",
"text": "Section 3 redescribes the semantics-text correspondence model (Liang et al., 2009) in the context of our learning scenario."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-51",
"text": "In section 4 we provide an empirical evaluation of the proposed method."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-52",
"text": "We conclude in section 5 with an examination of additional related work."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-54",
"text": "**INFERENCE WITH NON-CONTRADICTORY DOCUMENTS**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-55",
"text": "In this section we will describe our inference method on a higher conceptual level, not specifying the underlying meaning representation and the probabilistic model."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-56",
"text": "An instantiation of the algorithm for the semantics-text correspondence model is given in section 3.2."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-57",
"text": "Statistical models of parsing can often be regarded as defining the probability distribution of meaning m and its alignment a with the given text w, P (m, a, w) = P (a, w|m)P (m)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-58",
"text": "The semantics m can be represented either as a logical formula (see, e.g., (Poon and Domingos, 2009 )) or as a set of field values if database records are used as a meaning representation (Liang et al., 2009 )."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-104",
"text": "Liang et al. (2009) As explained in the introduction, the world states s are represented by sets of records (see the block in the middle of figure 1 for an example of a world state)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-59",
"text": "The alignment a defines how semantics is verbalized in the text w, and it can be represented by a meaning derivation tree in case of full semantic parsing (Poon and Domingos, 2009) or, e.g., by a hierarchical segmentation into utterances along with an utterance-field alignment in a more shallow variation of the problem."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-60",
"text": "In semantic parsing, we aim to find the most likely underlying semantics and alignment given the text:"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-61",
"text": "In the supervised case, where a and m are observable, estimation of the generative model parameters is generally straightforward."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-62",
"text": "However, in a semi-supervised or unsupervised case variational techniques, such as the EM algorithm (Dempster et al., 1977) , are often used to estimate the model."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-63",
"text": "As common for complex generative models, the most challenging part is the computation of the posterior distributions P (a, m|w) on the E-step which, depending on the underlying model P (m, a, w), may require approximate inference."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-64",
"text": "As discussed in the introduction, our goal is to integrate groups of non-contradictory documents into the learning procedure."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-65",
"text": "Let us denote by w 1 ,..., w K a group of non-contradictory documents."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-66",
"text": "As before, the estimation of the posterior probabilities P (m i , a i |w 1 . . ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-67",
"text": "w K ) presents the main challenge."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-68",
"text": "Note that the decision about m i is now conditioned on all the texts w j rather than only on w i ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-69",
"text": "This conditioning is exactly what drives learning, as the information about likely semantics m j of text j affects the decision about choice of m i :"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-70",
"text": "where"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-71",
"text": "is the probability of the semantics m i given all the meanings m \u2212i ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-72",
"text": "This probability assigns zero weight to inconsistent meanings, i.e. such mean-"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-73",
"text": "is not satisfiable, 2 and models dependencies between components in the composite meaning representation (e.g., arguments values of predicates)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-74",
"text": "As an illustration, in the forecast domain it may express that clouds, and not sunshine, are likely when it is raining."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-75",
"text": "Note, that this probability is different from the probability that m i is actually verbalized in the text."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-76",
"text": "Unfortunately, these dependencies between m i and w j are non-local."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-77",
"text": "Even though the dependencies are only conveyed via {m j : j = i} the space of possible meanings m is very large even for relatively simple semantic representations, and, therefore, we need to resort to efficient approximations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-78",
"text": "One natural approach would be to use a form of belief propagation (Pearl, 1982; Murphy et al., 1999) , where messages pass information about likely semantics between the texts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-79",
"text": "However, this approach is still expensive even for simple models, both because of the need to represent distributions over m and also because of the large number of iterations of message exchange needed to reach convergence (if it converges)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-80",
"text": "An even simpler technique would be to parse texts in a random order conditioning each meaning m k for k \u2208 {1,..., K} on all the previous semantics m k, and these decisions cannot be revised later."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-83",
"text": "We propose a simple algorithm which aims to find an appropriate order of the greedy inference by estimating how well each candidate semantic\u015d m k would explain other texts and at each step selecting k (andm k ) which explains them best."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-84",
"text": "The algorithm, presented in figure 2 3 , constructs an ordering of texts n = (n 1 ,..., n K )"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-85",
"text": "for j / \u2208 n do 4:m j := arg max m j P (m j , w j |m ) 5: end for 6:"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-86",
"text": "m i :=m n i 8: end for 9: n K := {1,..., K}\\n 10:"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-87",
"text": "Figure 2: The approximate inference algorithm."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-88",
"text": "and corresponding meaning representations m = (m 1 ,..., m K ), where m k is the predicted meaning representation of text w n k ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-89",
"text": "It starts with an empty ordering n = () and an empty list of meanings m = () (line 1)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-90",
"text": "Then it iteratively predicts meaning representationsm j conditioned on the list of semantics m = (m 1 ,..., m i\u22121 ) fixed on the previous stages and does it for all the remaining texts w j (lines 3-5)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-91",
"text": "The algorithm selects a single meaningm j which maximizes the probability of all the remaining texts and excludes the text j from future consideration (lines 6-7)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-92",
"text": "Though the semantics m k (k / \u2208 n\u222a{j}) used in the estimates (line 6) can be inconsistent with each other, the final list of meanings m is guaranteed to be consistent."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-93",
"text": "It holds because on each iteration we add a single meaningm n i to m (line 7), and m n i is guaranteed to be consistent with m , as the semanticsm n i was conditioned on the meaning m during inference (line 4)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-94",
"text": "An important aspect of this algorithm is that unlike usual greedy inference, the remaining ('future') texts do affect the choice of meaning representations made on the earlier stages."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-95",
"text": "As soon as semantics m k are inferred for every k, we find ourselves in the set-up of learning with unaligned semantic states considered in (Liang et al., 2009) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-96",
"text": "The induced alignments a 1 ,..., a K of semantics m to texts w 1 ,..., w K at the same time induce alignments between the texts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-97",
"text": "The problem of producing multiple sequence alignment, especially in the context of sentence alignments, has been extensively studied in NLP (Barzilay and Lee, 2003) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-98",
"text": "In this paper, we use semantic structures as a pivot for finding the best alignment in the hope that presence of meaningful text alignments will improve the quality of the resulting semantic structures by enforcing a form of agreement between them."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-100",
"text": "**A MODEL OF SEMANTICS**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-101",
"text": "In this section we redescribe the semantics-text correspondence model (Liang et al., 2009 ) with an extension needed to model examples with latent states, and also explain how the inference algorithm defined in section 2 can be applied to this model."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-103",
"text": "**MODEL DEFINITION**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-106",
"text": "There are n (t) records of type t and this number may change from document to document."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-107",
"text": "For example, there may be more than a single record of type wind speed, as they may refer to different time periods but all these records have the same set of fields, such as minimal, maximal and average wind speeds."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-108",
"text": "Each field has an associated type: in our experiments we consider only categorical and integer fields."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-109",
"text": "We write s (t) n,f = v to denote that n-th record of type t has field f set to value v."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-110",
"text": "Each document k verbalizes a subset of the entire world state, and therefore semantics m k of the document is an assignment to |m k | verbalized fields:"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-111",
"text": "nq,fq = v q ), where t q , n q , f q are the verbalized record types, records and fields, respectively, and v q is the assigned field value."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-112",
"text": "The probability of meaning m k then equals the probability of this assignment with other state variables left non-observable (and therefore marginalized out)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-113",
"text": "In this formalism checking for contradiction is trivial: two meaning representations contradict each other if they assign different values to the same field of the same record."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-114",
"text": "The semantics-text correspondence model defines a hierarchical segmentation of text: first, it segments the text into fragments discussing different records, then the utterances corresponding to each record are further segmented into fragments verbalizing specific fields of that record."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-115",
"text": "An example of a segmented fragment is presented in figure 4 ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-116",
"text": "The model has a designated null-record which is aligned to words not assigned to any record."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-117",
"text": "Additionally there is a null-field in each record to handle words not specific to any field."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-118",
"text": "In figure 3 the corresponding graphical model is presented."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-119",
"text": "The formal definition of the model for documents w 1 ,..., w K sharing a semantic state is as follows:"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-120",
"text": "\u2022 Generation of world state s:"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-121",
"text": "-For each type \u03c4 \u2208 {1,..., T } choose a number of records of that type n (\u03c4 ) \u223c Unif(1,..., nmax)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-122",
"text": "nf for all fields f \u2208 F (\u03c4 ) from the type-specific distribution."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-123",
"text": "\u2022 Generation of the verbalizations, for each document Note that, when generating fields, the Markov chain is defined over fields and the transition parameters are independent of the field values r if ij ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-124",
"text": "On the contrary, when drawing a word, the distribution of words is conditioned on the value of the corresponding field."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-125",
"text": "The form of word generation distributions P (w|f ij , r if ij ) depends on the type of the field f i,j ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-126",
"text": "For categorical fields, the distribution of words is modeled as a distinct multinomial for each field value."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-127",
"text": "Verbalizations of numerical fields are generated via a perturbation on the field value r if ij : the value r if ij can be perturbed by either rounding it (up or down) or distorting (up or down, modeled by a geometric distribution)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-128",
"text": "The parameters corresponding to each form of generation are estimated during learning."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-129",
"text": "For details on these emission models, as well as for details on modeling record and field transitions, we refer the reader to the original publication (Liang et al., 2009 )."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-130",
"text": "In our experiments, when choosing a world state s, we generate the field values independently."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-131",
"text": "This is clearly a suboptimal regime as often there are very strong dependencies between field values: e.g., in the weather domain many record types contain groups of related fields defining minimal, maximal and average values of some parameter."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-132",
"text": "Extending the method to model, e.g., pairwise dependencies between field values is relatively straightforward."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-133",
"text": "As explained above, semantics of a text m is defined by the assignment of state variables s. Analogously, an alignment a between semantics m and a text w is represented by all the remaining latent variables: by the sequence of record types t = (t 1 ,..., t |t| ), choice of records r i for each t i , the field sequence f i and the segment length c ij for every field f ij ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-135",
"text": "**LEARNING AND INFERENCE**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-136",
"text": "We select the model parameters \u03b8 by maximizing the marginal likelihood of the data, where the data D is given in the form of groups w = {w 1 ,..., w K } sharing the same latent state: 5 max \u03b8 w\u2208D s"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-137",
"text": "To estimate the parameters, we use the Expectation-Maximization algorithm (Dempster et al., 1977) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-138",
"text": "When the world state is observable, learning does not require any approximations, as dynamic programming (a form of the forward-backward algorithm) can be used to infer the posterior distribution on the E-step (Liang et al., 2009) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-139",
"text": "However, when the state is latent, dependencies are not local anymore, and approximate inference is required."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-140",
"text": "We use the algorithm described in section 2 (figure 2) to infer the state."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-141",
"text": "In the context of the semantics-text correspondence model, as we discussed above, semantics m defines the subset of admissible world states."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-142",
"text": "In order to use the algorithm, we need to understand how the conditional probabilities of the form P (m |m) are computed, as they play the key role in the inference procedure (see equation (2))."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-143",
"text": "If there is a contradiction (m \u22a5m) then P (m |m) = 0, conversely, if m is subsumed by m (m \u2192 m ) then this probability is 1. Otherwise, P (m |m) equals the probability of new assignments \u2227 |m \\m| q=1 (s"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-144",
"text": "(defined by m \\m) conditioned on the previously fixed values of s (given by m)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-145",
"text": "Summarizing, when predicting the most likely semanticsm j (line 4), for each span the decoder weighs alternatives of either (1) aligning this span to the previously induced meaning m , or (2) aligning it to a new field and paying the cost of generation of its value."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-146",
"text": "The exact computation of the most probable semantics (line 4 of the algorithm) is intractable, and we have to resort to an approximation."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-147",
"text": "Instead of predicting the most probable semanticsm j we search for the most probable pair (\u00e2 j ,m j ), thus assuming that the probability mass is mostly concentrated on a single alignment."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-148",
"text": "The alignment a j is then discarded and not used in any other computations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-149",
"text": "Though the most likely alignment\u00e2 j for a fixed semantic representationm j can be found efficiently using a Viterbi algorithm, computing the most probable pair (\u00e2 j ,m j ) is still intractable."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-150",
"text": "We use a modification of the beam search algorithm, where we keep a set of candidate meanings (partial semantic representations) and compute an alignment for each of them using a form of the Viterbi algorithm."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-151",
"text": "As soon as the meaning representations m are inferred, we find ourselves in the set-up studied in (Liang et al., 2009 ): the state s is no longer latent and we can run efficient inference on the E-step."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-152",
"text": "Though some fields of the state s may still not be specified by m , we prohibit utterances from aligning to these non-specified fields."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-153",
"text": "On the M-step of EM the parameters are estimated as proportional to the expected marginal counts computed on the E-step."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-154",
"text": "We smooth the distributions of values for numerical fields with convolution smoothing equivalent to the assumption that the fields are affected by distortion in the form of a two-sided geometric distribution with the success rate parameter equal to 0.67."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-155",
"text": "We use add-0.1 smoothing for all the remaining multinomial distributions."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-157",
"text": "**EMPIRICAL EVALUATION**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-158",
"text": "In this section, we consider the semi-supervised set-up, and present evaluation of our approach on on the problem of aligning weather forecast reports to the formal representation of weather."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-160",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-161",
"text": "To perform the experiments we used a subset of the weather dataset introduced in (Liang et al., 2009 )."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-162",
"text": "The original dataset contains 22,146 texts of 28.7 words on average, there are 12 types of records (predicates) and 36.0 records per forecast on average."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-163",
"text": "We randomly chose 100 texts along with their world states to be used as the labeled data."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-164",
"text": "6 To produce groups of noncontradictory texts we have randomly selected a subset of weather states, represented them in a visual form (icons accompanied by numerical and symbolic parameters) and then manually annotated these illustrations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-165",
"text": "These newly-produced forecasts, when combined with the original texts, resulted in 259 groups of non-contradictory texts (650 texts, 2.5 texts per group)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-166",
"text": "An example of such a group is given in figure 1 ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-167",
"text": "The dataset is relatively noisy: there are inconsistencies due to annotation mistakes (e.g., number distortions), or due to different perception of the weather by the annotators (e.g., expressions such as 'warm' or 'cold' are subjective)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-168",
"text": "The overlap between the verbalized fields in each group was estimated to be below 35%."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-169",
"text": "Around 60% of fields are mentioned only in a single forecast from a group, consequently, the texts cannot be regarded as paraphrases of each other."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-170",
"text": "The test set consists of 150 texts, each corresponding to a different weather state."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-171",
"text": "Note that during testing we no longer assume that documents share the state, we treat each document in isolation."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-193",
"text": "The direct evaluation of the meaning recognition (i.e. semantic parsing) accuracy is not possible on this dataset, as the data does not contain information which fields are discussed."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-172",
"text": "We aimed to preserve approximately the same proportion of new and original examples as we had in the training set, therefore, we combined 50 texts originally present in the weather dataset with additional 100 newly-produced texts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-173",
"text": "We annotated these 100 texts by aligning each line to one or more records, 7 whereas for the original texts the alignments were already present."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-174",
"text": "Following Liang et al. (2009) we evaluate the models on how well they predict these alignments."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-175",
"text": "When estimating the model parameters, we followed the training regime prescribed in (Liang et al., 2009) ."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-176",
"text": "Namely, 5 iterations of EM with a basic model (with no segmentation or coherence modeling), followed by 5 iterations of EM with the model which generates fields independently and, at last, 5 iterations with the full model."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-177",
"text": "Only then, in the semi-supervised learning scenarios, we added unlabeled data and ran 5 additional iterations of EM."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-178",
"text": "Instead of prohibiting records from crossing punctuation, as suggested by Liang et al. (2009) , in our implementation we disregard the words not attached to specific fields (attached to the nullfield, see section 3.1) when computing spans of records."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-179",
"text": "To speed-up training, only a single record of each type is allowed to be generated when running inference for unlabeled examples on the E- Table 1 : Results (precision, recall and F 1 ) on the weather forecast dataset."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-180",
"text": "step of the EM algorithm, as it significantly reduces the search space."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-181",
"text": "Similarly, though we preserved all records which refer to the first time period, for other time periods we removed all the records which declare that the corresponding event (e.g., rain or snowfall) is not expected to happen."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-182",
"text": "This preprocessing results in the oracle recall of 93%."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-183",
"text": "We compare our approach (Semi-superv, noncontr) with two baselines: the basic supervised training on 100 labeled forecasts (Supervised BL) and with the semi-supervised training which disregards the non-contradiction relations (Semi-superv BL)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-184",
"text": "The learning regime, the inference procedure and the texts for the semi-supervised baseline were identical to the ones used for our approach, the only difference is that all the documents were modeled as independent."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-185",
"text": "Additionally, we report the results of the model trained with all the 750 texts labeled (Supervised UB), its scores can be regarded as an upper bound on the results of the semi-supervised models."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-186",
"text": "The results are reported in table 1."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-187",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-188",
"text": "**DISCUSSION**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-189",
"text": "Our training strategy results in a substantially more accurate model, outperforming both the supervised and semi-supervised baselines."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-190",
"text": "Surprisingly, its precision is higher than that of the model trained on 750 labeled examples, though admittedly it is achieved at a very different recall level."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-191",
"text": "The estimation of the model with our approach takes around one hour on a standard desktop PC, which is comparable to 40 minutes required to train the semi-supervised baseline."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-192",
"text": "In these experiments, we consider the problem of predicting alignment between text and the corresponding observable world state."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-194",
"text": "Even if it would pro- vide this information, the documents do not verbalize the state at the necessary granularity level to predict the field values."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-195",
"text": "For example, it is not possible to decide to which bucket of the field sky cover the expression 'cloudy' refers to, as it has a relatively uniform distribution across 3 (out of 4) buckets."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-196",
"text": "The problem of predicting text-meaning alignments is interesting in itself, as the extracted alignments can be used in training of a statistical generation system or information extractors, but we also believe that evaluation on this problem is an appropriate test for the relative comparison of the semantic analyzers' performance."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-197",
"text": "Additionally, note that the success of our weaklysupervised scenario indirectly suggests that the model is sufficiently accurate in predicting semantics of an unlabeled text, as otherwise there would be no useful information passed in between semantically overlapping documents during learning and, consequently, no improvement from sharing the state."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-198",
"text": "8 To confirm that the model trained by our approach indeed assigns new words to correct fields and records, we visualize top words for the field characterizing sky cover (table 2)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-199",
"text": "Note that the words \"sun\", \"cloudiness\" or \"gaps\" were not appearing in the labeled part of the data, but seem to be assigned to correct categories."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-200",
"text": "However, correlation between rain and overcast, as also noted in (Liang et al., 2009) , results in the wrong assignment of the rain-related words to the field value corresponding to very cloudy weather."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-202",
"text": "**RELATED WORK**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-203",
"text": "Probably the most relevant prior work is an approach to bootstrapping lexical choice of a generation system using a corpus of alternative pas-sages (Barzilay and Lee, 2002) , however, in their work all the passages were annotated with unaligned semantic expressions."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-204",
"text": "Also, they assumed that the passages are paraphrases of each other, which is stronger than our non-contradiction assumption."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-205",
"text": "Sentence and text alignment has also been considered in the related context of paraphrase extraction (see, e.g., (Dolan et al., 2004; Barzilay and Lee, 2003) ) but this prior work did not focus on inducing or learning semantic representations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-206",
"text": "Similarly, in information extraction, there have been approaches for pattern discovery using comparable monolingual corpora (Shinyama and Sekine, 2003) but they generally focused only on discovery of a single pattern from a pair of sentences or texts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-207",
"text": "Radev (2000) considered types of potential relations between documents, including contradiction, and studied how this information can be exploited in NLP."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-208",
"text": "However, this work considered primarily multi-document summarization and question answering problems."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-209",
"text": "Another related line of research in machine learning is clustering or classification with constraints (Basu et al., 2004) , where supervision is given in the form of constraints."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-210",
"text": "Constraints declare which pairs of instances are required to be assigned to the same class (or required to be assigned to different classes)."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-211",
"text": "However, we are not aware of any previous work that generalized these methods to structured prediction problems, as trivial equality/inequality constraints are probably too restrictive, and a notion of consistency is required instead."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-213",
"text": "**SUMMARY AND FUTURE WORK**"
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-214",
"text": "In this work we studied the use of weak supervision in the form of non-contradictory relations between documents in learning semantic representations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-215",
"text": "We argued that this type of supervision encodes information which is hard to discover in an unsupervised way."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-216",
"text": "However, exact inference for groups of documents with overlapping semantic representation is generally prohibitively expensive, as the shared latent semantics introduces nonlocal dependences between semantic representations of individual documents."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-217",
"text": "To combat it, we proposed a simple iterative inference algorithm."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-218",
"text": "We showed how it can be instantiated for the semantics-text correspondence model (Liang et al., 2009 ) and evaluated it on a dataset of weather forecasts."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-219",
"text": "Our approach resulted in an improvement over the scores of both the supervised baseline and of the traditional semi-supervised learning."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-220",
"text": "There are many directions we plan on investigating in the future for the problem of learning semantics with non-contradictory relations."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-221",
"text": "A promising and challenging possibility is to consider models which induce full semantic representations of meaning."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-222",
"text": "Another direction would be to investigate purely unsupervised set-up, though it would make evaluation of the resulting method much more complex."
},
{
"sent_id": "d1decbc03929cbf67a412d0a3a2a66-C001-223",
"text": "One potential alternative would be to replace the initial supervision with a set of posterior constraints (Graca et al., 2008) or generalized expectation criteria (McCallum et al., 2007) ."
}
],
"y": {
"@UNSURE@": {
"gold_contexts": [
[
"d1decbc03929cbf67a412d0a3a2a66-C001-4"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-36"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-129"
]
],
"cite_sentences": [
"d1decbc03929cbf67a412d0a3a2a66-C001-4",
"d1decbc03929cbf67a412d0a3a2a66-C001-36",
"d1decbc03929cbf67a412d0a3a2a66-C001-129"
]
},
"@BACK@": {
"gold_contexts": [
[
"d1decbc03929cbf67a412d0a3a2a66-C001-9"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-50"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-58"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-138"
]
],
"cite_sentences": [
"d1decbc03929cbf67a412d0a3a2a66-C001-9",
"d1decbc03929cbf67a412d0a3a2a66-C001-50",
"d1decbc03929cbf67a412d0a3a2a66-C001-58",
"d1decbc03929cbf67a412d0a3a2a66-C001-138"
]
},
"@USE@": {
"gold_contexts": [
[
"d1decbc03929cbf67a412d0a3a2a66-C001-44"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-95"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-151"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-161"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-174"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-175"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-218"
]
],
"cite_sentences": [
"d1decbc03929cbf67a412d0a3a2a66-C001-44",
"d1decbc03929cbf67a412d0a3a2a66-C001-95",
"d1decbc03929cbf67a412d0a3a2a66-C001-151",
"d1decbc03929cbf67a412d0a3a2a66-C001-161",
"d1decbc03929cbf67a412d0a3a2a66-C001-174",
"d1decbc03929cbf67a412d0a3a2a66-C001-175",
"d1decbc03929cbf67a412d0a3a2a66-C001-218"
]
},
"@EXT@": {
"gold_contexts": [
[
"d1decbc03929cbf67a412d0a3a2a66-C001-101"
]
],
"cite_sentences": [
"d1decbc03929cbf67a412d0a3a2a66-C001-101"
]
},
"@SIM@": {
"gold_contexts": [
[
"d1decbc03929cbf67a412d0a3a2a66-C001-151"
],
[
"d1decbc03929cbf67a412d0a3a2a66-C001-200"
]
],
"cite_sentences": [
"d1decbc03929cbf67a412d0a3a2a66-C001-151",
"d1decbc03929cbf67a412d0a3a2a66-C001-200"
]
},
"@DIF@": {
"gold_contexts": [
[
"d1decbc03929cbf67a412d0a3a2a66-C001-178"
]
],
"cite_sentences": [
"d1decbc03929cbf67a412d0a3a2a66-C001-178"
]
}
}
},
"ABC_ca1391f1f908fc081589b1a7dd8229_4": {
"x": [
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-142",
"text": "Then it processes the same sentences with the fine grammar."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-2",
"text": "Due to their origin in computer graphics, graphics processing units (GPUs) are highly optimized for dense problems, where the exact same operation is applied repeatedly to all data points."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-3",
"text": "Natural language processing algorithms, on the other hand, are traditionally constructed in ways that exploit structural sparsity."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-4",
"text": "Recently, Canny et al. (2013) presented an approach to GPU parsing that sacrifices traditional sparsity in exchange for raw computational power, obtaining a system that can compute Viterbi parses for a high-quality grammar at about 164 sentences per second on a mid-range GPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-5",
"text": "In this work, we reintroduce sparsity to GPU parsing by adapting a coarse-to-fine pruning approach to the constraints of a GPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-6",
"text": "The resulting system is capable of computing over 404 Viterbi parses per second-more than a 2x speedup-on the same hardware."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-7",
"text": "Moreover, our approach allows us to efficiently implement less GPU-friendly minimum Bayes risk inference, improving throughput for this more accurate algorithm from only 32 sentences per second unpruned to over 190 sentences per second using pruning-nearly a 6x speedup."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-10",
"text": "Because NLP models typically treat sentences independently, NLP problems have long been seen as \"embarrassingly parallel\" -large corpora can be processed arbitrarily fast by simply sending different sentences to different machines."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-11",
"text": "However, recent trends in computer architecture, particularly the development of powerful \"general purpose\" GPUs, have changed the landscape even for problems that parallelize at the sentence level."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-12",
"text": "First, classic single-core processors and main memory architectures are no longer getting substantially faster over time, so speed gains must now come from parallelism within a single machine."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-13",
"text": "Second, compared to CPUs, GPUs devote a much larger fraction of their computational power to actual arithmetic."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-14",
"text": "Since tasks like parsing boil down to repeated read-multiply-write loops, GPUs should be many times more efficient in time, power, or cost."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-15",
"text": "The challenge is that GPUs are not a good fit for the kinds of sparse computations that most current CPU-based NLP algorithms rely on."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-16",
"text": "Recently, Canny et al. (2013) proposed a GPU implementation of a constituency parser that sacrifices all sparsity in exchange for the sheer horsepower that GPUs can provide."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-17",
"text": "Their system uses a grammar based on the Berkeley parser (Petrov and Klein, 2007) (which is particularly amenable to GPU processing), \"compiling\" the grammar into a sequence of GPU kernels that are applied densely to every item in the parse chart."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-18",
"text": "Together these kernels implement the Viterbi inside algorithm."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-19",
"text": "On a mid-range GPU, their system can compute Viterbi derivations at 164 sentences per second on sentences of length 40 or less (see timing details below)."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-20",
"text": "In this paper, we develop algorithms that can exploit sparsity on a GPU by adapting coarse-tofine pruning to a GPU setting."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-21",
"text": "On a CPU, pruning methods can give speedups of up to 100x."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-22",
"text": "Such extreme speedups over a dense GPU baseline currently seem unlikely because fine-grained sparsity appears to be directly at odds with dense parallelism."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-23",
"text": "However, in this paper, we present a system that finds a middle ground, where some level of sparsity can be maintained without losing the parallelism of the GPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-24",
"text": "We use a coarse-to-fine approach as in Petrov and Klein (2007) , but with only one coarse pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-25",
"text": "Figure 1 shows an overview of the approach: we first parse densely with a coarse grammar and then parse sparsely with the fine grammar, skipping symbols that the coarse pass deemed sufficiently unlikely."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-26",
"text": "Using this approach, we see a gain of more than 2x over the dense GPU implementation, resulting in overall speeds of up to 404 sentences per second."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-27",
"text": "For comparison, the publicly available CPU implementation of Petrov and Klein (2007) parses approximately 7 sentences per second per core on a modern CPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-28",
"text": "A further drawback of the dense approach in Canny et al. (2013) is that it only computes Viterbi parses."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-29",
"text": "As with other grammars with a parse/derivation distinction, the grammars of Petrov and Klein (2007) only achieve their full accuracy using minimum-Bayes-risk parsing, with improvements of over 1.5 F1 over best-derivation Viterbi parsing on the Penn Treebank (Marcus et al., 1993) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-30",
"text": "To that end, we extend our coarse-tofine GPU approach to computing marginals, along the way proposing a new way to exploit the coarse pass to avoid expensive log-domain computations in the fine pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-31",
"text": "We then implement minimumBayes-risk parsing via the max recall algorithm of Goodman (1996) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-32",
"text": "Without the coarse pass, the dense marginal computation is not efficient on a GPU, processing only 32 sentences per second."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-33",
"text": "However, our approach allows us to process over 190 sentences per second, almost a 6x speedup."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-35",
"text": "**A NOTE ON EXPERIMENTS**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-36",
"text": "We build up our approach incrementally, with experiments interspersed throughout the paper, and summarized in Tables 1 and 2 ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-37",
"text": "In this paper, we focus our attention on current-generation NVIDIA GPUs."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-38",
"text": "Many of the ideas described here apply to other GPUs (such as those from AMD), but some specifics will differ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-39",
"text": "All experiments are run with an NVIDIA GeForce GTX 680, a mid-range GPU that costs around $500 at time of writing."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-40",
"text": "Unless otherwise noted, all experiments are conducted on sentences of length \u2264 40 words, and we estimate times based on batches of 20K sentences."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-41",
"text": "1 We should note that our experimental condition differs from that of Canny et al. (2013) : they evaluate on sentences of length \u2264 30."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-42",
"text": "Furthermore, they 1 The implementation of Canny et al. (2013) cannot handle batches so large, and so we tested it on batches of 1200 sentences."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-43",
"text": "Our reimplementation is approximately the same speed for the same batch sizes."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-44",
"text": "For batches of 20K sentences, we used sentences from the training set."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-45",
"text": "We verified that there was no significant difference in speed for sentences from the training set and from the test set."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-46",
"text": "use two NVIDIA GeForce GTX 690s-each of which is essentially a repackaging of two 680s-meaning that our system and experiments would run approximately four times faster on their hardware."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-47",
"text": "(This expected 4x factor is empirically consistent with the result of running their system on our hardware.)"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-49",
"text": "**SPARSITY AND CPUS**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-50",
"text": "One successful approach for speeding up constituency parsers has been to use coarse-to-fine inference (Charniak et al., 2006) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-51",
"text": "In coarse-tofine inference, we have a sequence of increasingly complex grammars G ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-52",
"text": "Typically, each successive grammar G is a refinement of the preceding grammar G \u22121 ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-53",
"text": "That is, for each symbol A x in the fine grammar, there is some symbol A in the coarse grammar."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-54",
"text": "For instance, in a latent variable parser, the coarse grammar would have symbols like N P , V P , etc., and the fine pass would have refined symbols N P 0 , N P 1 , V P 4 , and so on."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-55",
"text": "In coarse-to-fine inference, one applies the grammars in sequence, computing inside and outside scores."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-56",
"text": "Next, one computes (max) marginals for every labeled span (A, i, j) in a sentence."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-57",
"text": "These max marginals are used to compute a pruning mask for every span (i, j)."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-58",
"text": "This mask is the set of symbols allowed for that span."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-59",
"text": "Then, in the next pass, one only processes rules that are licensed by the pruning mask computed at the previous level."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-60",
"text": "This approach works because a low quality coarse grammar can still reliably be used to prune many symbols from the fine chart without loss of accuracy."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-61",
"text": "Petrov and Klein (2007) found that over 98% of symbols can be pruned from typical charts using a simple X-bar grammar without any loss of accuracy."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-116",
"text": "**GRAMMAR CLUSTERS**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-62",
"text": "Thus, the vast majority of rules can be skipped, and therefore most computation can be avoided."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-63",
"text": "It is worth pointing out that although 98% of labeled spans can be skipped due to X-bar pruning, we found that only about 79% of binary rule applications can be skipped, because the unpruned symbols tend to be the ones with a larger grammar footprint."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-65",
"text": "**GPU ARCHITECTURES**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-66",
"text": "Unfortunately, the standard coarse-to-fine approach does not na\u00efvely translate to GPU architectures."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-67",
"text": "GPUs work by executing thousands of threads at once, but impose the constraint that large blocks of threads must be executing the same (2013)'s system."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-68",
"text": "The GPU and CPU communicate via a work queue, which ferries parse items from the CPU to the GPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-69",
"text": "Our system uses a coarse-to-fine approach, where the coarse pass computes a pruning mask that is used by the CPU when deciding which items to queue during the fine pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-70",
"text": "The original system of Canny et al. (2013) only used the fine pass, with no pruning."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-71",
"text": "instructions in lockstep, differing only in their input data."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-72",
"text": "Thus sparsely skipping rules and symbols will not save any work."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-73",
"text": "Indeed, it may actually slow the system down."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-74",
"text": "In this section, we provide an overview of GPU architectures, focusing on the details that are relevant to building an efficient parser."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-75",
"text": "The large number of threads that a GPU executes are packaged into blocks of 32 threads called warps."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-76",
"text": "All threads in a warp must execute the same instruction at every clock cycle: if one thread takes a branch the others do not, then all threads in the warp must follow both code paths."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-77",
"text": "This situation is called warp divergence."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-78",
"text": "Because all threads execute all code paths that any thread takes, time can only be saved if an entire warp agrees to skip any particular branch."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-79",
"text": "NVIDIA GPUs have 8-15 processors called streaming multi-processors or SMs."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-80",
"text": "2 Each SM can process up to 48 different warps at a time: it interleaves the execution of each warp, so that when one warp is stalled another warp can execute."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-81",
"text": "Unlike threads within a single warp, the 48 warps do not have to execute the same instructions."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-82",
"text": "However, the memory architecture is such that they will be faster if they access related memory locations."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-83",
"text": "A further consideration is that the number of registers available to a thread in a warp is rather limited compared to a CPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-84",
"text": "On the 600 series, maximum occupancy can only be achieved if each thread uses at most 63 registers (Nvidia, 2008) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-85",
"text": "3 Registers are many times faster than variables located in thread-local memory, which is actually the same speed as global memory."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-87",
"text": "**ANATOMY OF A DENSE GPU PARSER**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-88",
"text": "This architecture environment puts very different constraints on parsing algorithms from a CPU environment."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-89",
"text": "Canny et al. (2013) proposed an implementation of a PCFG parser that sacrifices standard sparse methods like coarse-to-fine pruning, focusing instead on maximizing the instruction and memory throughput of the parser."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-90",
"text": "They assume that they are parsing many sentences at once, with throughput being more important than latency."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-91",
"text": "In this section, we describe their dense algorithm, which we take as the baseline for our work; we present it in a way that sets up the changes to follow."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-92",
"text": "At the top level, the CPU and GPU communicate via a work queue of parse items of the form (s, i, k, j), where s is an identifier of a sentence, i is the start of a span, k is the split point, and j Table 1 : Performance numbers for computing Viterbi inside charts on 20,000 sentences of length \u226440 from the Penn Treebank."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-93",
"text": "All times are measured on an NVIDIA GeForce GTX 680."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-94",
"text": "'Reimpl' is our reimplementation of their approach."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-95",
"text": "Speedups are measured in reference to this reimplementation."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-96",
"text": "See Section 7 for discussion of the clustering algorithms and Section 6 for a description of the pruning methods."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-97",
"text": "The Canny et al. (2013) system is benchmarked on a batch size of 1200 sentences, the others on 20,000."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-98",
"text": "is the end point."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-99",
"text": "The GPU takes large numbers of parse items and applies the entire grammar to them in parallel."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-100",
"text": "These parse items are enqueued in order of increasing span size, blocking until all items of a given length are complete."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-101",
"text": "This approach is diagrammed in Figure 2 ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-102",
"text": "Because all rules are applied to all parse items, all threads are executing the same sequence of instructions."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-103",
"text": "Thus, there is no concern of warp divergence."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-105",
"text": "**GRAMMAR COMPILATION**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-106",
"text": "One important feature of Canny et al. (2013) 's system is grammar compilation."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-107",
"text": "Because registers are so much faster than thread-local memory, it is critical to keep as many variables in registers as possible."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-108",
"text": "One way to accomplish this is to unroll loops at compilation time."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-109",
"text": "Therefore, they inlined the iteration over the grammar directly into the GPU kernels (i.e. the code itself), which allows the compiler to more effectively use all of its registers."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-110",
"text": "However, register space is limited on GPUs."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-111",
"text": "Because the Berkeley grammar is so large, the compiler is not able to efficiently schedule all of the operations in the grammar, resulting in register spills."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-112",
"text": "Canny et al. (2013) found they had to partition the grammar into multiple different kernels."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-113",
"text": "We discuss this partitioning in more detail in Section 7."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-114",
"text": "However, in short, the entire grammar G is broken into multiple clusters G i where each rule belongs to exactly one cluster."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-117",
"text": "Figure 3: Schematic representation of the work queue and grammar clusters used in the fine pass of our work."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-118",
"text": "Here, the rules of the grammar are clustered by their coarse parent symbol."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-119",
"text": "We then have multiple work queues, with parse items only being enqueued if the span (i, j) allows that symbol in its pruning mask."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-120",
"text": "All in all, Canny et al. (2013) 's system is able to compute Viterbi charts at 164 sentences per second, for sentences up to length 40."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-121",
"text": "On larger batch sizes, our reimplementation of their approach is able to achieve 193 sentences per second on the same hardware."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-122",
"text": "(See Table 1 .) 6 Pruning on a GPU Now we turn to the algorithmic and architectural changes in our approach."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-123",
"text": "First, consider trying to directly apply the coarse-to-fine method sketched in Section 3 to the dense baseline described above."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-124",
"text": "The natural implementation would be for each thread to check if each rule is licensed before applying it."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-125",
"text": "However, we would only avoid the work of applying the rule if all threads in the warp agreed to skip it."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-126",
"text": "Since each thread in the warp is processing a different span (perhaps even from a different sentence), consensus from all 32 threads on any skip would be unlikely."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-127",
"text": "Another approach would be to skip enqueuing any parse item (s, i, k, j) where the pruning mask for any of (i, j), (i, k), or (k, j) is entirely empty (i.e. all symbols are pruned in this cell by the coarse grammar)."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-128",
"text": "However, our experiments showed that only 40% of parse items are pruned in this manner."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-129",
"text": "Because of the overhead associated with creating pruning masks and the further overhead of GPU communication, we found that this method did not actually produce any time savings at all."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-130",
"text": "The result is a parsing speed of 185.5 sentences per second, as shown in Table 1 on the row labeled 'Reimpl' with 'Empty, Coarse' pruning."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-131",
"text": "Instead, we take advantage of the partitioned structure of the grammar and organize our computation around the coarse symbol set."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-132",
"text": "Recall that the baseline already partitions the grammar G into rule clusters G i to improve register sharing."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-133",
"text": "(See Section 7 for more on the baseline clustering.) We create a separate work queue for each partition."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-134",
"text": "We call each such queue a labeled work queue, and each one only queues items to which some rule in the corresponding partition applies."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-135",
"text": "We call the set of coarse symbols for a partition (and therefore the corresponding labeled work queue) a signature."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-136",
"text": "During parsing, we only enqueue items (s, i, k, j) to a labeled queue if two conditions are met."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-137",
"text": "First, the span (i, j)'s pruning mask must have a non-empty intersection with the signature of the queue."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-138",
"text": "Second, the pruning mask for the children (i, k) and (k, j) must be non-empty."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-139",
"text": "Once on the GPU, parse items are processed using the same style of compiled kernel as in Canny et al. (2013) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-140",
"text": "Because the entire partition (though not necessarily the entire grammar) is applied to each item in the queue, we still do not need to worry about warp divergence."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-141",
"text": "At the top level, our system first computes pruning masks with a coarse grammar."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-143",
"text": "However, to the extent that the signatures are small, items can be selectively queued only to certain queues."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-144",
"text": "This approach is diagrammed in Figure 3 ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-145",
"text": "We tested our new pruning approach using an X-bar grammar as the coarse pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-146",
"text": "The resulting speed is 187.5 sentences per second, labeled in Table 1 as row labeled 'Reimpl' with 'Labeled, Coarse' pruning."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-147",
"text": "Unfortunately, this approach again does not produce a speedup relative to our reimplemented baseline."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-148",
"text": "To improve upon this result, we need to consider how the grammar clustering interacts with the coarse pruning phase."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-150",
"text": "**GRAMMAR CLUSTERING**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-151",
"text": "Recall that the rules in the grammar are partitioned into a set of clusters, and that these clusters are further divided into subclusters."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-152",
"text": "How can we best cluster and subcluster the grammar so as to maximize performance?"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-153",
"text": "A good clustering will group rules together that use the same symbols, since this means fewer memory accesses to read and write scores for symbols."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-154",
"text": "Moreover, we would like the time spent processing each of the subclusters within a cluster to be about the same."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-155",
"text": "We cannot move on to the next cluster until all threads from a cluster are finished, which means that the time a cluster takes is the amount of time taken by the longest-running subcluster."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-156",
"text": "Finally, when pruning, it is best if symbols that have the same coarse projection are clustered together."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-157",
"text": "That way, we are more likely to be able to skip a subcluster, since fewer distinct symbols need to be \"off\" for a parse item to be skipped in a given subcluster."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-158",
"text": "Canny et al. (2013) clustered symbols of the grammar using a sophisticated spectral clustering algorithm to obtain a permutation of the symbols."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-159",
"text": "Then the rules of the grammar were laid out in a (sparse) three-dimensional tensor, with one dimension representing the parent of the rule, one representing the left child, and one representing the right child."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-160",
"text": "They then split the cube into 6x2x2 contiguous \"major cubes,\" giving a partition of the rules into 24 clusters."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-161",
"text": "They then further subdivided these cubes into 2x2x2 minor cubes, giving 8 subclusters that executed in parallel."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-162",
"text": "Note that the clusters induced by these major and minor cubes need not be of similar sizes; indeed, they often are not."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-163",
"text": "Clustering using this method is labeled 'Reimplementation' in Table 1 ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-164",
"text": "The addition of pruning introduces further considerations."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-165",
"text": "First, we have a coarse grammar, with many fewer rules and symbols."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-166",
"text": "Second, we are able to skip a parse item for an entire cluster if that item's pruning mask does not intersect the cluster's signature."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-167",
"text": "Spreading symbols across clusters may be inefficient: if a parse item licenses a given symbol, we will have to enqueue that item to any queue that has the symbol in its signature, no matter how many other symbols are in that cluster."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-168",
"text": "Thus, it makes sense to choose a clustering algorithm that exploits the structure introduced by the pruning masks."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-169",
"text": "We use a very simple method: we cluster the rules in the grammar by coarse parent symbol."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-170",
"text": "When coarse symbols are extremely unlikely (and therefore have few corresponding rules), we merge their clusters to avoid the overhead of beginning work on clusters where little work has to be done."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-171",
"text": "4 In order to subcluster, we divide up rules among subclusters so that each subcluster has the same number of active parent symbols."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-172",
"text": "We found this approach to subclustering worked well in practice."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-173",
"text": "Clustering using this method is labeled 'Parent' in Table 1 . Now, when we use a coarse pruning pass, we are able to parse nearly 280 sentences per second, a 70% increase in parsing performance relative to Canny et al. (2013) 's system, and nearly 50% over our reimplemented baseline."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-174",
"text": "It turns out that this simple clustering algorithm produces relatively efficient kernels even in the unpruned case."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-175",
"text": "The unpruned Viterbi computations in a fine grammar using the clustering method of Canny et al. (2013) yields a speed of 193 sentences per second, whereas the same computation using coarse parent clustering has a speed of 159 sentences per second."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-176",
"text": "(See Table 1 .) This is not as efficient as Canny et al. (2013) 's highly tuned method, but it is still fairly fast, and much simpler to implement."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-178",
"text": "**PRUNING WITH FINER GRAMMARS**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-179",
"text": "The coarse to fine pruning approach of Petrov and Klein (2007) employs an X-bar grammar as its first pruning phase, but there is no reason why we cannot begin with a more complex grammar for our initial pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-180",
"text": "As Petrov and Klein (2007) have shown, intermediate-sized Berkeley grammars prune many more symbols than the X-bar system."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-181",
"text": "However, they are slower to parse with in a CPU context, and so they begin with an X-bar grammar."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-182",
"text": "Because of the overhead associated with transferring work items to GPU, using a very small grammar may not be an efficient use of the GPU's computational resources."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-183",
"text": "To that end, we tried computing pruning masks with one-split and twosplit Berkeley grammars."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-184",
"text": "The X-bar grammar can compute pruning masks at just over 1000 sentences per second, the 1-split grammar parses 858 sentences per second, and the 2-split grammar parses 526 sentences per second."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-185",
"text": "Because parsing with these grammars is still quite fast, we tried using them as the coarse pass instead."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-186",
"text": "As shown in Table 1 , using a 1-split grammar as a coarse pass allows us to produce over 400 sentences per second, a full 2x improvement over our original system."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-187",
"text": "Conducting a coarse pass with a 2-split grammar is somewhat slower, at a \"mere\" 343 sentences per second."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-188",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-189",
"text": "**MINIMUM BAYES RISK PARSING**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-190",
"text": "The Viterbi algorithm is a reasonably effective method for parsing."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-191",
"text": "However, many authors have noted that parsers benefit substantially from minimum Bayes risk decoding (Goodman, 1996; Simaan, 2003; Matsuzaki et al., 2005; Titov and Henderson, 2006; Petrov and Klein, 2007) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-192",
"text": "MBR algorithms for parsing do not compute the best derivation, as in Viterbi parsing, but instead the parse tree that maximizes the expected count of some figure of merit."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-193",
"text": "For instance, one might want to maximize the expected number of correct constituents (Goodman, 1996) , or the expected rule counts (Simaan, 2003; Petrov and Klein, 2007) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-194",
"text": "MBR parsing has proven especially useful in latent variable grammars."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-195",
"text": "Petrov and Klein (2007) showed that MBR trees substantially improved performance over Viterbi parses for latent variable grammars, earning up to 1.5F1."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-196",
"text": "Here, we implement the Max Recall algorithm of Goodman (1996) ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-197",
"text": "This algorithm maximizes the expected number of correct coarse symbols (A, i, j) with respect to the posterior distribution over parses for a sentence."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-198",
"text": "This particular MBR algorithm has the advantage that it is relatively straightforward to implement."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-199",
"text": "In essence, we must compute the marginal probability of each fine-labeled span \u00b5(A x , i, j), and then marginalize to obtain \u00b5(A, i, j)."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-200",
"text": "Then, for each span (i, j), we find the best possible split point k that maximizes C(i, j) = \u00b5(A, i, j) + max k (C(i, k) + C(k, j))."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-201",
"text": "Parse extraction is then just a matter of following back pointers from the root, as in the Viterbi algorithm."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-202",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-203",
"text": "**COMPUTING MARGINAL PROBABILITIES**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-204",
"text": "The easiest way to compute marginal probabilities is to use the log space semiring rather than the Viterbi semiring, and then to run the inside and outside algorithms as before."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-205",
"text": "We should expect this algorithm to be at least a factor of two slower: the outside pass performs at least as much work as the inside pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-206",
"text": "Moreover, it typically has worse memory access patterns, leading to slower performance."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-207",
"text": "Without pruning, our approach does not handle these log domain computations well at all: we are only able to compute marginals for 32.1 sentences/second, more than a factor of 5 slower than our coarse pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-208",
"text": "To begin, log space addition requires significantly more operations than max, which is a primitive operation on GPUs."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-209",
"text": "Beyond the obvious consequence that executing more operations means more time taken, the sheer number of operations becomes too much for the compiler to handle."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-210",
"text": "Because the grammars are compiled into code, the additional operations are all inlined into the kernels, producing much larger kernels."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-211",
"text": "Indeed, in practice the compiler will often hang if we use the same size grammar clusters as we did for Viterbi."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-212",
"text": "In practice, we found there is an effective maximum of 2000 rules per kernel using log sums, while we can use more than 10,000 rules rules in a single kernel with Viterbi."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-213",
"text": "With coarse pruning, however, we can avoid much of the increased cost associated with log domain computations."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-214",
"text": "Because so many labeled spans are pruned, we are able to skip many of the grammar clusters and thus avoid many of the expensive operations."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-215",
"text": "Using coarse pruning and log domain calculations, our system produces MBR trees at a rate of 130.4 sentences per second, a four-fold increase."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-217",
"text": "**SCALING WITH THE COARSE PASS**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-218",
"text": "One way to avoid the expense of log domain computations is to use scaled probabilities rather than log probabilities."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-219",
"text": "Scaling is one of the folk techniques that are commonly used in the NLP community, but not generally written about."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-220",
"text": "Recall that floating point numbers are composed of a mantissa m and an exponent e, giving a number Table 2 : Performance numbers for computing max constituent (Goodman, 1996) trees on 20,000 sentences of length 40 or less from the Penn Treebank."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-221",
"text": "For convenience, we have copied our pruned Viterbi system's result."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-222",
"text": "f = m \u00b7 2 e ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-223",
"text": "When a float underflows, the exponent becomes too low to represent the available number of bits."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-224",
"text": "In scaling, floating point numbers are paired with an additional number that extends the exponent."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-225",
"text": "That is, the number is represented as f = f \u00b7 exp(s)."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-226",
"text": "Whenever f becomes either too big or too small, the number is rescaled back to a less \"dangerous\" range by shifting mass from the exponent e to the scaling factor s."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-227",
"text": "In practice, one scale s is used for an entire span (i, j), and all scores for that span are rescaled in concert."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-228",
"text": "In our GPU system, multiple scores in any given span are being updated at the same time, which makes this dynamic rescaling tricky and expensive, especially since inter-warp communication is fairly limited."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-229",
"text": "We propose a much simpler static solution that exploits the coarse pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-230",
"text": "In the coarse pass, we compute Viterbi inside and outside scores for every span."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-231",
"text": "Because the grammar used in the coarse pass is a projection of the grammar used in the fine pass, these coarse scores correlate reasonably closely with the probabilities computed in the fine pass: If a span has a very high or very low score in the coarse pass, it typically has a similar score in the fine pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-232",
"text": "Thus, we can use the coarse pass's inside and outside scores as the scaling values for the fine pass's scores."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-233",
"text": "That is, in addition to computing a pruning mask, in the coarse pass we store the maximum inside and outside score in each span, giving two arrays of scores s I i,j and s O i,j ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-234",
"text": "Then, when applying rules in the fine pass, each fine inside score over a split span (i, k, j) is scaled to the appropriate s I i,j by multiplying the score by exp s I i,k + s I k,j \u2212 s I i,j , where s I i,k , s I k,j , s I i,j are the scaling factors for the left child, right child, and parent, respectively."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-235",
"text": "The outside scores are scaled analogously."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-236",
"text": "By itself, this approach works on nearly every sentence."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-237",
"text": "However, scores for approximately 0.5% of sentences overflow (sic)."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-238",
"text": "Because we are summing instead of maxing scores in the fine pass, the scaling factors computed using max scores are not quite large enough, and so the rescaled inside probabilities grow too large when multiplied together."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-239",
"text": "Most of this difference arises at the leaves, where the lexicon typically has more uncertainty than higher up in the tree."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-240",
"text": "Therefore, in the fine pass, we normalize the inside scores at the leaves to sum to 1.0."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-241",
"text": "5 Using this slight modification, no sentences from the Treebank under-or overflow."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-242",
"text": "We know of no reason why this same trick cannot be employed in more traditional parsers, but it is especially useful here: with this static scaling, we can avoid the costly log sums without introducing any additional inter-thread communication, making the kernels much smaller and much faster."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-243",
"text": "Using scaling, we are able to push our parser to 190.6 sentences/second for MBR extraction, just under half the speed of the Viterbi system."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-244",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-245",
"text": "**PARSING ACCURACIES**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-246",
"text": "It is of course important verify the correctness of our system; one easy way to do so is to examine parsing accuracy, as compared to the original Berkeley parser."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-247",
"text": "We measured parsing accuracy on sentences of length \u2264 40 from section 22 of the Penn Treebank."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-248",
"text": "Our Viterbi parser achieves 89.7 F1, while our MBR parser scores 91.0."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-249",
"text": "These results are nearly identical to the Berkeley parsers most comparable numbers: 89.8 for Viterbi, and 90.9 for their \"Max-Rule-Sum\" MBR algorithm."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-250",
"text": "These slight differences arise from the usual minor variation in implementation details."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-251",
"text": "In particular, we use one coarse pass instead of several, and a different MBR algorithm."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-252",
"text": "In addition, there are some differences in unary processing."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-253",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-254",
"text": "**ANALYZING SYSTEM PERFORMANCE**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-255",
"text": "In this section we attempt to break down how exactly our system is spending its time."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-256",
"text": "We do this in an effort to give a sense of how time is spent during computation on GPUs."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-257",
"text": "These timing numbers are computed using the built-in profiling capabilities of the programming environment."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-258",
"text": "As usual, profiles exhibit an observer effect, where the act of measuring the system changes the execution."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-259",
"text": "Nev- Table 3 : Time spent in the passes of our different systems, in seconds per 1000 sentences."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-260",
"text": "Pruning refers to using a 1-split grammar for the coarse pass."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-261",
"text": "ertheless, the general trends should more or less be preserved as compared to the unprofiled code."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-262",
"text": "To begin, we can compute the number of seconds needed to parse 1000 sentences."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-263",
"text": "(We use seconds per sentence rather than sentences per second because the former measure is additive.) The results are in Table 3 ."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-264",
"text": "In the case of pruned Viterbi, pruning reduces the amount of time spent in the fine pass by more than 4x, though half of those gains are lost to computing the pruning masks."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-265",
"text": "In Table 4 , we break down the time taken by our system into individual components."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-266",
"text": "As expected, binary rules account for the vast majority of the time in the unpruned Viterbi case, but much less time in the pruned case, with the total time taken for binary rules in the coarse and fine passes taking about 1/5 of the time taken by binaries in the unpruned version."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-267",
"text": "Queueing, which involves copying memory around within the GPU to process the individual parse items, takes a fairly consistent amount of time in all systems."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-268",
"text": "Overhead, which includes transport time between the CPU and GPU and other processing on the CPU, is relatively small for most system configurations."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-269",
"text": "There is greater overhead in the scaling system, because scaling factors are copied to the CPU between the coarse and fine passes."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-270",
"text": "A final question is: how many sentences per second do we need to process to saturate the GPU's processing power?"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-271",
"text": "We computed Viterbi parses of successive powers of 10, from 1 to 100,000 sentences."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-272",
"text": "6 In Figure 4 , we then plotted the throughput, in terms of number of sentences per second."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-273",
"text": "Throughput increases through parsing 10,000 sentences, and then levels off by the time it reaches 100,000 sentences."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-274",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-275",
"text": "**RELATED WORK**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-276",
"text": "Apart from the model of Canny et al. (2013) , there have been a few attempts at using GPUs in NLP contexts before."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-277",
"text": "Johnson (2011) and Yi et al. (2011) both had early attempts at porting parsing algorithms to the GPU."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-278",
"text": "However, they did not demonstrate significantly increased speed over a CPU implementation."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-279",
"text": "In machine translation, He et al. (2013) adapted algorithms designed for GPUs in the computational biology literature to speed up on-demand phrase table extraction."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-280",
"text": "----------------------------------"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-281",
"text": "**CONCLUSION**"
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-282",
"text": "GPUs represent a challenging opportunity for natural language processing."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-283",
"text": "By carefully designing within the constraints imposed by the architecture, we have created a parser that can exploit the same kinds of sparsity that have been developed for more traditional architectures."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-284",
"text": "One of the key remaining challenges going forward is confronting the kind of lexicalized sparsity common in other NLP models."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-285",
"text": "The Berkeley parser's grammars-by virtue of being unlexicalized-can be applied uniformly to all parse items."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-286",
"text": "The bilexical features needed by dependency models and lexicalized constituency models are not directly amenable to acceleration using the techniques we described here. Determining how to efficiently implement these kinds of models is a promising area for new research."
},
{
"sent_id": "ca1391f1f908fc081589b1a7dd8229-C001-287",
"text": "Our system is available as open-source at https://www.github.com/dlwh/puck."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"ca1391f1f908fc081589b1a7dd8229-C001-4"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-16"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-28"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-89"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-97"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-106"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-112"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-158"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-276"
]
],
"cite_sentences": [
"ca1391f1f908fc081589b1a7dd8229-C001-4",
"ca1391f1f908fc081589b1a7dd8229-C001-16",
"ca1391f1f908fc081589b1a7dd8229-C001-28",
"ca1391f1f908fc081589b1a7dd8229-C001-89",
"ca1391f1f908fc081589b1a7dd8229-C001-97",
"ca1391f1f908fc081589b1a7dd8229-C001-106",
"ca1391f1f908fc081589b1a7dd8229-C001-112",
"ca1391f1f908fc081589b1a7dd8229-C001-158",
"ca1391f1f908fc081589b1a7dd8229-C001-276"
]
},
"@DIF@": {
"gold_contexts": [
[
"ca1391f1f908fc081589b1a7dd8229-C001-41"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-42"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-69",
"ca1391f1f908fc081589b1a7dd8229-C001-70"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-120",
"ca1391f1f908fc081589b1a7dd8229-C001-121"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-173"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-175"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-176"
]
],
"cite_sentences": [
"ca1391f1f908fc081589b1a7dd8229-C001-41",
"ca1391f1f908fc081589b1a7dd8229-C001-42",
"ca1391f1f908fc081589b1a7dd8229-C001-70",
"ca1391f1f908fc081589b1a7dd8229-C001-120",
"ca1391f1f908fc081589b1a7dd8229-C001-173",
"ca1391f1f908fc081589b1a7dd8229-C001-175",
"ca1391f1f908fc081589b1a7dd8229-C001-176"
]
},
"@USE@": {
"gold_contexts": [
[
"ca1391f1f908fc081589b1a7dd8229-C001-42"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-139"
],
[
"ca1391f1f908fc081589b1a7dd8229-C001-175"
]
],
"cite_sentences": [
"ca1391f1f908fc081589b1a7dd8229-C001-42",
"ca1391f1f908fc081589b1a7dd8229-C001-139",
"ca1391f1f908fc081589b1a7dd8229-C001-175"
]
},
"@SIM@": {
"gold_contexts": [
[
"ca1391f1f908fc081589b1a7dd8229-C001-139"
]
],
"cite_sentences": [
"ca1391f1f908fc081589b1a7dd8229-C001-139"
]
}
}
},
"ABC_5203c1037fe57bd1b813c0bf1ff5c4_4": {
"x": [
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-202",
"text": "In this study, we address a task of describing a given phrase with its context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-2",
"text": "When reading a text, it is common to become stuck on unfamiliar words and phrases, such as polysemous words with novel senses, rarely used idioms, internet slang, or emerging entities."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-3",
"text": "If we humans cannot figure out the meaning of those expressions from the immediate local context, we consult dictionaries for definitions or search documents or the web to find other global context to help in interpretation."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-4",
"text": "Can machines help us do this work? Which type of context is more important for machines to solve the problem? To answer these questions, we undertake a task of describing a given phrase in natural language based on its local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-5",
"text": "To solve this task, we propose a neural description model that consists of two context encoders and a description decoder."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-6",
"text": "In contrast to the existing methods for non-standard English explanation (Ni and Wang, 2017) and definition generation (Noraset et al., 2017; Gadetsky et al., 2018) , our model appropriately takes important clues from both local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-7",
"text": "Experimental results on three existing datasets (including WordNet, Oxford and Urban Dictionaries) and a dataset newly created from Wikipedia demonstrate the effectiveness of our method over previous work."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-9",
"text": "****"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-10",
"text": "When reading a text, it is common to become stuck on unfamiliar words and phrases, such as polysemous words with novel senses, rarely used idioms, internet slang, or emerging entities."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-11",
"text": "If we humans cannot figure out the meaning of those expressions from the immediate local context, we consult dictionaries for definitions or search documents or the web to find other global context to help in interpretation."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-12",
"text": "Can machines help us do this work? Which type of context is more important for machines to solve the problem? To answer these questions, we undertake a task of describing a given phrase in natural language based on its local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-13",
"text": "To solve this task, we propose a neural description model that consists of two context encoders and a description decoder."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-14",
"text": "In contrast to the existing methods for non-standard English explanation (Ni and Wang, 2017) and definition generation (Noraset et al., 2017; Gadetsky et al., 2018) , our model appropriately takes important clues from both local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-15",
"text": "Experimental results on three existing datasets (including WordNet, Oxford and Urban Dictionaries) and a dataset newly created from Wikipedia demonstrate the effectiveness of our method over previous work."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-17",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-18",
"text": "When we read news text with emerging entities, text in unfamiliar domains, or text in foreign languages, we often encounter expressions (words or phrases) whose senses we do not understand."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-19",
"text": "In such cases, we may first try to figure out the meanings of those expressions by reading the surrounding words (local context) carefully."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-20",
"text": "Failing to do so, we may consult dictionaries, and in the case of polysemous words, choose an appropriate meaning based on the context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-21",
"text": "Learning novel word senses via dictionary definitions is known to be more effective than contextual guessing (Fraser, 1998; Chen, 2012) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-22",
"text": "However, very often, handcrafted dictionaries do not contain definitions of expressions that are rarely used or newly created."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-23",
"text": "Ultimately, we may need to read through the entire document or even search the web to find other occurances of the expression (global context) so that we can guess its meaning."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-24",
"text": "Can machines help us do this work?"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-25",
"text": "Ni and Wang (2017) have proposed a task of generating a definition for a phrase given its local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-26",
"text": "However, they follow the strict assumption that the target phrase is newly emerged and there is only a single local context available for the phrase, which makes the task of generating an accurate and coherent definition difficult (perhaps as difficult as a human comprehending the phrase itself)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-27",
"text": "On the other hand, Noraset et al. (2017) attempted to generate a definition of a word from an embedding induced from massive text (which can be seen as global context)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-28",
"text": "This is followed by Gadetsky et al. (2018) that refers to a local context to disambiguate polysemous words by choosing relevant dimensions of their word embeddings."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-29",
"text": "Al-though these research efforts revealed that both local and global contexts are useful in generating definitions, none of these studies exploited both contexts directly to describe unknown phrases."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-30",
"text": "In this study, we tackle the task of describing (defining) a phrase when given its local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-31",
"text": "We present LOG-CaD, a neural description generator (Figure 1 ) to directly solve this task."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-32",
"text": "Given an unknown phrase without sense definitions, our model obtains a phrase embedding as its global context by composing word embeddings while also encoding the local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-33",
"text": "The model therefore combines both pieces of information to generate a natural language description."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-34",
"text": "Considering various applications where we need definitions of expressions, we evaluated our method with four datasets including WordNet (Noraset et al., 2017) for general words, the Oxford dictionary (Gadetsky et al., 2018) for polysemous words, Urban Dictionary (Ni and Wang, 2017) for rare idioms or slang, and a newlycreated Wikipedia dataset for entities."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-35",
"text": "Our contributions are as follows:"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-36",
"text": "\u2022 We propose a general task of defining unknown phrases given their contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-37",
"text": "This task is a generalization of three related tasks (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) and involves various situations where we need definitions of unknown phrases ( \u00a7 2)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-38",
"text": "\u2022 We propose a method for generating natural language descriptions for unknown phrases with local and global contexts ( \u00a7 3)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-39",
"text": "\u2022 As a benchmark to evaluate the ability of the models to describe entities, we build a largescale dataset from Wikipedia and Wikidata for the proposed task."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-40",
"text": "We release our dataset and the code 1 to promote the reproducibility of the experiments ( \u00a7 4)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-41",
"text": "\u2022 The proposed method achieves the state-ofthe-art performance on our new dataset and the three existing datasets used in the related studies (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) ( \u00a7 5) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-42",
"text": "In this section, we define our task of describing a phrase in a specific context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-43",
"text": "Given an undefined phrase X trg = {x j , \u00b7 \u00b7 \u00b7 , x k } with its con-"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-44",
"text": "Here, X trg can be a word or a short phrase and is included in X. Y is a definition-like concrete and concise sentence that describes the X trg ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-45",
"text": "For example, given a phrase \"sonic boom\" with its context \"the shock wave may be caused by sonic boom or by explosion,\" the task is to generate a description such as \"sound created by an object moving fast.\" If the given context has been changed to \"this is the first official tour to support the band's latest studio effort, 2009's Sonic Boom,\" then the appropriate output would be \"album by Kiss.\""
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-46",
"text": "The process of description generation can be modeled with a conditional language model as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-47",
"text": "Context-aware Description Generator"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-48",
"text": "In this section, we describe our idea of utilizing local and global contexts in the description generation task, and present the details of our model."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-50",
"text": "**LOCAL & GLOBAL CONTEXTS**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-51",
"text": "When we find an unfamiliar phrase in text and it is not defined in dictionaries, how can we humans come up with its meaning? As discussed in Section 1, we may first try to figure out the meaning of the phrase from the immediate context, and then read through the entire document or search the web to understand implicit information behind the text."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-52",
"text": "In this paper, we refer to the explicit contextual information included in a given sentence with the target phrase (i.e., the X in Eq. (1)) as \"local context,\" and the implicit contextual information in massive text as \"global context.\" While both local and global contexts are crucial for humans to understand unfamiliar phrases, are they also useful for machines to generate descriptions?"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-53",
"text": "To verify this idea, we propose to incorporate both local and global contexts to describe an unknown phrase."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-54",
"text": "Figure 1 shows an illustration of our LOG-CaD model."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-55",
"text": "Similarly to the standard encoder-decoder model with attention (Bahdanau et al., 2015; Luong and Manning, 2016) , it has a context encoder and a description decoder."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-56",
"text": "The challenge here is that the decoder needs to be conditioned not only on the local context, but also on its global context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-57",
"text": "To incorporate the different types of contexts, we propose to use a gate function similar to Noraset et al. (2017) to dynamically control how the global and local contexts influence the description."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-59",
"text": "**PROPOSED MODEL**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-60",
"text": "Local & global context encoders We first describe how to model local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-61",
"text": "Given a sentence X and a phrase X trg , a bidirectional LSTM (Gers et al., 1999) encoder generates a sequence of continuous vectors H = {h 1 \u00b7 \u00b7 \u00b7 , h I } as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-62",
"text": "where x i is the word embedding of word x i ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-63",
"text": "In addition to the local context, we also utilize the global context obtained from massive text."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-64",
"text": "This can be achieved by feeding a phrase embedding x trg to initialize the decoder (Noraset et al., 2017 ) as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-65",
"text": "Here, the phrase embedding x trg is calculated by simply summing up all the embeddings of words that consistute the phrase X trg ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-66",
"text": "Note that we use a randomly-initialized vector if no pre-trained embedding is available for the words in X trg ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-67",
"text": "Description decoder Using the local and global contexts, a description decoder computes the conditional probability of a description Y with Eq."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-68",
"text": "(1), which can be approximated with another LSTM as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-69",
"text": "where s t is a hidden state of the decoder LSTM (s 0 = 0), and y t\u22121 is a jointly-trained word embedding of the previous output word y t\u22121 ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-70",
"text": "In what follows, we explain each equation in detail."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-71",
"text": "Attention on local context Considering the fact that the local context can be relatively long (e.g., around 20 words on average in our Wikipedia dataset introduced in Section 4), it is hard for the decoder to focus on important words in local contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-72",
"text": "In order to deal with this problem, the ATTENTION(\u00b7) function in Eq. (5) decides which words in the local context X to focus on at each time step."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-73",
"text": "d t is computed with an attention mechanism (Luong and Manning, 2016) as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-74",
"text": "where U h and U s are matrices that map the encoder and decoder hidden states into a common space, respectively."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-76",
"text": "**USE OF CHARACTER INFORMATION**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-77",
"text": "In order to capture the surface information of X trg , we construct character-level CNNs (Eq. (6)) following (Noraset et al., 2017) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-78",
"text": "Note that the input to the CNNs is a sequence of words in X trg , which are concatenated with special character \" ,\" such as \"sonic boom.\" Following Noraset et al. (2017), we set the CNN kernels of length 2-6 and size 10, 30, 40, 40, 40 respectively with a stride of 1 to obtain a 160-dimensional vector c trg ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-80",
"text": "**GATE FUNCTION TO CONTROL LOCAL & GLOBAL CONTEXTS**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-81",
"text": "In order to capture the interaction between the local and global contexts, we adopt a GATE(\u00b7) function (Eq. (7)) which is similar to Noraset et al. (2017) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-82",
"text": "The GATE(\u00b7) function updates the LSTM output s t to s t depending on the global context x trg , local context d t , and character-level information c trg as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-83",
"text": "where \u03c3(\u00b7), and ; denote the sigmoid function, element-wise multiplication, and vector concatenation, respectively."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-84",
"text": "W * and b * are weight matrices and bias terms, respectively."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-85",
"text": "Here, the update gate z t controls how much the original hidden state s t is to be changed, and the reset gate r t controls how much the information from f t contributes to word generation at each time step."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-87",
"text": "**WIKIPEDIA DATASET**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-88",
"text": "Our goal is to let machines describe unfamiliar words and phrases, such as polysemous words, rarely used idioms, or emerging entities."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-89",
"text": "Among the three existing datasets, WordNet and Oxford dictionary mainly target the words but not phrases, thus are not perfect test beds for this goal."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-90",
"text": "On the other hand, although the Urban Dictionary dataset contains descriptions of rarely-used phrases, the domain of its targeted words and phrases is limited to Internet slang."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-91",
"text": "In order to confirm that our model can generate the description of entities as well as polysemous words and slang, we constructed a new dataset for context-aware phrase description generation from Wikipedia 2 and Wikidata 3 which contain a wide variety of entity descriptions with contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-92",
"text": "The overview of the data extraction process is shown in Figure 2 ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-93",
"text": "Each entry in the dataset consists of (1) a phrase, (2) its description, and (3) context (a sentence)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-94",
"text": "For preprocessing, we applied Stanford Tokenizer 4 to the descriptions of Wikidata items and the articles in Wikipedia."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-95",
"text": "Next, we removed phrases in parentheses from the Wikipedia articles, since they tend to be paraphrasing in other languages and work as noise."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-96",
"text": "To obtain the contexts of each item in Wikidata, we extracted the sentence which has a link referring to the item through all the first paragraphs of Wikipedia articles and replaced the phrase of the links with a special token [TRG] ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-97",
"text": "Wikidata items with no description or no contexts are ignored."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-98",
"text": "This utilization of links makes it possible to resolve the ambiguity of words and phrases in a sentence without human annotations, which is a major advantage of using Wikipedia."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-99",
"text": "Note that we used only links whose anchor texts are identical to the title of the Wikipedia articles, since the users of Wikipedia sometimes link mentions to related articles."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-101",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-102",
"text": "We evaluate our method by applying it to describe words in WordNet 5 (Miller, 1995) and Oxford Dictionary, 6 phrases in Urban Dictionary 7 and Wikipedia/Wikidata."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-103",
"text": "8 For all of these datasets, a given word or phrase has an inventory of senses with corresponding definitions and usage examples."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-104",
"text": "These definitions are regarded as groundtruth descriptions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-105",
"text": "Datasets To evaluate our model on the word description task on WordNet, we followed Noraset et al. (2017) and extracted data from WordNet using the dict-definition 9 toolkit."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-106",
"text": "Each entry in the data consists of three elements: (1) a word, (2) its definition, and (3) a usage example of the Table 2 : Domains, expressions to be described, and the coverage of pre-trained embeddings of the expressions to be described."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-107",
"text": "word."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-108",
"text": "We split this dataset to obtain Train, Validation, and Test sets."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-109",
"text": "If a word has multiple definitions/examples, we treat them as different entries."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-110",
"text": "Note that the words are mutually exclusive across the three sets."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-111",
"text": "The only difference between our dataset and theirs is that we extract the tuples only if the words have their usage examples in WordNet."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-112",
"text": "Since not all entries in WordNet have usage examples, our dataset is a small subset of Noraset et al. (2017) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-113",
"text": "In addition to WordNet, we use the Oxford Dictionary following Gadetsky et al. (2018) , the Urban Dictionary following Ni and Wang (2017) and our Wikipedia dataset described in the previous section."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-114",
"text": "Table 1 and Table 2 show the properties and statistics of the four datasets, respectively."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-115",
"text": "To simulate a situation in a real application where we might not have access to global context for the target phrases, we did not train domainspecific word embeddings on each dataset."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-116",
"text": "Instead, for all of the four datasets, we use the same Table 3 : Hyperparameters of the models pre-trained CBOW 10 vectors trained on Google news corpus as global context following previous work (Noraset et al., 2017; Gadetsky et al., 2018) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-117",
"text": "If the expression to be described consists of multiple words, its phrase embedding is calculated by simply summing up all the CBOW vectors of words in the phrase, such as \"sonic\" and \"boom.\" (See Figure 1) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-118",
"text": "If pre-trained CBOW embeddings are unavailable, we instead use a special [UNK] vector (which is randomly initialized with a uniform distribution) as word embeddings."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-119",
"text": "Note that our pre-trained embeddings only cover 26.79% of the words in the expressions to be described in our Wikipedia dataset, while it covers all words in WordNet dataset (See Table 2 )."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-120",
"text": "Even if no reliable word embeddings are available, all models can capture the character information through character-level CNNs (See Figure 1) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-122",
"text": "**MODELS WE IMPLEMENTED FOUR METHODS: (1)**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-123",
"text": "Global (Noraset et al., 2017) , (2) Local (Ni and Wang, 2017) with CNN, (3) I-Attention (Gadetsky et al., 2018) , and our proposed model, (4) LOGCaD. The Global model is our reimplementation of the best model (S + G + CH) in Noraset et al. (2017) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-124",
"text": "It can access the global context of a phrase to be described, but has no ability to read the local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-125",
"text": "The Local model is the reimplementation of the best model (dual encoder) in Ni and Wang(2017) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-126",
"text": "In order to make a fair comparison of the effectiveness of local and global contexts, we slightly modify the original implementation by Ni and Wang(2017); as the character-level encoder in the Local model, we adopt CNNs that are exactly the same as the other two models instead of the original LSTMs."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-127",
"text": "The I-Attention is our reimplementation of the best model (S + I-Attention) in Gadetsky (2018) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-128",
"text": "Similar to our model, it uses both local and global contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-129",
"text": "Unlike our model, however, it does not use character information to predict descriptions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-130",
"text": "Also, it cannot directly use the local context to predict the words in descriptions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-131",
"text": "This is because the I-Attention model indirectly uses the local context only to disambiguate the phrase embedding x trg as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-132",
"text": "Here, the FFNN(\u00b7) function is a feed-forward neural network that maps the encoded local contexts h i to another space."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-203",
"text": "In what follows, we explain existing tasks that are related to our work."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-133",
"text": "The mapped local contexts are then averaged over the length of the sentence X to obtain a representation of the local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-134",
"text": "This is followed by a linear layer and a sigmoid function to obtain the soft binary mask m which can filter out the unrelated information included in global context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-135",
"text": "Finally, the disambiguated phrase embedding x trg is then used to update the decoder hidden state as"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-136",
"text": "All four models (Table 3) are implemented with the PyTorch framework (Ver. 1.0.0)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-137",
"text": "11 Table 4 shows the BLEU (Papineni et al., 2002) scores of the output descriptions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-138",
"text": "We can see that the LOG-CaD model consistently outperforms the three baselines in all four datasets."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-139",
"text": "This result indicates that using both local and global contexts helps describe the unknown words/phrases correctly."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-140",
"text": "While the 11 http://pytorch.org/ I-Attention model also uses local and global contexts, its performance was always lower than the LOG-CaD model."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-141",
"text": "This result shows that using local context to predict description is more effective than using it to disambiguate the meanings in global context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-142",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-143",
"text": "**AUTOMATIC EVALUATION**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-144",
"text": "In particular, the low BLEU scores of Global and I-Attention models on Wikipedia dataset suggest that it is necessary to learn to ignore the noisy information in global context if the coverage of pre-trained word embeddings is extremely low (see the third and fourth rows in Table 2 )."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-145",
"text": "We suspect that the Urban Dictionary task is too difficult and the results are unreliable considering its extremely low BLEU scores and high ratio of unknown tokens in generated descriptions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-146",
"text": "q-lets and co. is a filipino and english informative children 's show on q in the philippines ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-147",
"text": "she was a founding producer of the cbc radio one show \" q \" ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-148",
"text": "the q awards are the uk 's annual music awards run by the music magazine \" q \" ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-149",
"text": "charles fraser-smith was an author and one-time missionary who is widely credited as being the inspiration for ian fleming 's james bond quartermaster q ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-150",
"text": "Reference: philippine tv network canadian radio show british music magazine fictional character from james bond"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-151",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-152",
"text": "**MANUAL EVALUATION**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-153",
"text": "To compare the proposed model and the strongest baseline in Table 4 (i.e., the Local model), we performed a human evaluation on our dataset."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-154",
"text": "We randomly selected 100 samples from the test set of the Wikipedia dataset and asked three native English speakers to rate the output descriptions from 1 to 5 points as: 1) completely wrong or self-definition, 2) correct topic with wrong information, 3) correct but incomplete, 4) small details missing, 5) correct."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-155",
"text": "The averaged scores are reported in Table 5 ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-156",
"text": "Pair-wise bootstrap resampling test (Koehn, 2004) for the annotated scores has shown that the superiority of LOG-CaD over the Local model is statistically significant (p < 0.01)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-157",
"text": "Qualitative Analysis Table 6 shows a word in the WordNet, while Table 7 and Table 8 show the examples of the entities in Wikipedia as examples."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-158",
"text": "When comparing the two datasets, the quality of generated descriptions of Wikipedia dataset is significantly better than that of WordNet dataset."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-159",
"text": "The main reason for this result is that the size of training data of the Wikipedia dataset is 64x larger than the WordNet dataset (See Table 1 )."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-160",
"text": "For all examples in the three tables, the Global model can only generate a single description for each input word/phrase because it cannot access any local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-161",
"text": "In the Wordnet dataset, only the I-Attention and LOG-CaD models can successfully generate the concept of \"remove\" given the context #2."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-162",
"text": "This result suggests that considering both local and global contexts are essential to generate correct descriptions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-163",
"text": "In our Wikipedia dataset, both the Local and LOG-CaD models can describe the word/phrase considering its local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-164",
"text": "For example, both the Local and LOG-CaD models could generate \"american\" in the description for \"daniel o'neill\" given \"united states\" in context #1, while they could generate \"british\" given \"belfast\" in context #2."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-165",
"text": "A similar trend can also be observed in Table 8 , where LOG-CaD could generate the locational expressions such as \"philippines\" and \"british\" given the different contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-166",
"text": "On the other hand, the I-Attention model could not describe the two phrases, taking into account the local contexts."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-167",
"text": "We will present an analysis of this phenomenon in the next section."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-169",
"text": "**DISCUSSION**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-170",
"text": "In this section, we present analyses on how the local and global contexts contribute to the description generation task."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-171",
"text": "First, we discuss how the local context helps the models to describe a phrase."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-172",
"text": "Then, we analyze the impact of global context under the situation where local context is unreliable."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-174",
"text": "**HOW DO THE MODELS UTILIZE LOCAL CONTEXTS?**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-175",
"text": "Local context helps us (1) disambiguate polysemous words and (2) infer the meanings of unknown expressions."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-176",
"text": "Can machines also utilize the local context?"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-177",
"text": "In this section, we discuss the two roles of local context in description generation."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-178",
"text": "Considering that the pre-trained word embeddings are obtained from word-level cooccurrences in massive text, more information is mixed up into a single vector as the more senses the word has."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-179",
"text": "While Gadetsky et al. (2018) de- signed the I-Attention model to filter out unrelated meanings in the global context given local context, they did not discuss the impact of the number of senses has on the performance of definition generation."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-180",
"text": "To understand the influence of the ambiguity of phrases to be defined on the generation performance, we did an analysis on our Wikipedia dataset."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-181",
"text": "Figure 3(a) shows that the description generation task becomes harder as the phrases to be described become more ambiguous."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-182",
"text": "In particular, when a phrase has an extremely large number of senses, (i.e., #senses \u2265 4), the Global model drops its performance significantly."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-183",
"text": "This result indicates that the local context is necessary to disambiguate the meanings in global context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-184",
"text": "Table 2 , a large proportion of the phrases in our Wikipedia dataset includes unknown words (i.e., only 26.79% of words in the phrases have their pre-trained embeddings)."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-185",
"text": "This fact indicates that the global context in this dataset is not fully reliable."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-186",
"text": "Then our next question is, how does the lack of information from global context affect the performance of phrase description? Figure 3(b) shows the impact of unknown words in the phrases to be described on the performance."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-187",
"text": "As we can see from the result, the advantage of LOG-CaD and Local models over Global and IAttention models becomes larger as the unknown words increases."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-188",
"text": "This result suggests that we need to fully utilize local contexts especially in practical applications where the phrases to be defined have many unknown words."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-189",
"text": "Here, Figure 3 (b) also shows a counterintuitive phenomenon that BLEU scores increase as the ratio of unknown words in a phrase increase."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-190",
"text": "This is mainly because unknown phrases tend to be person names such as writers, actors, or movie directors."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-191",
"text": "Since these entities have fewer ambiguities in categories, they can be described in extremely short sentences that are easy for all four models to decode (e.g., \"finnish writer\" or \"american television producer\")."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-193",
"text": "**AS SHOWN IN**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-194",
"text": "6.2 How do the models utilize global contexts?"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-195",
"text": "As discussed earlier, local contexts are important to describe unknown expressions, but how about global contexts?"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-196",
"text": "Assuming a situation where we cannot obtain much information from local contexts (e.g., infer the meaning of \"boswellia\" from a short local context \"Here is a boswellia\"), global contexts should be essential to understand the meaning."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-197",
"text": "To confirm this hypothesis, we analyzed the impact of the length of local contexts on BLEU scores."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-198",
"text": "Figure 3(c) shows that when the length of local context is extremely short (l \u2264 10), the LOG-CaD model becomes much stronger than the Local model."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-199",
"text": "This result indicates that not only local context but also global context help models describe the meanings of phrases."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-200",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-201",
"text": "**RELATED WORK**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-204",
"text": "Our task is closely related to word sense disambiguation (WSD) (Navigli, 2009) , which identifies a pre-defined sense for the target word with its context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-205",
"text": "Although we can use it to solve our task by retrieving the definition sentence for the sense identified by WSD, it requires a substantial amount of training data to handle a different set of meanings of each word, and cannot handle words (or senses) which are not registered in the dictionary."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-206",
"text": "Although some studies have attempted to detect novel senses of words for given contexts (Erk, 2006; Lau et al., 2014) , they do not provide definition sentences."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-207",
"text": "Our task avoids these difficulties in WSD by directly generating descriptions for phrases or words."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-208",
"text": "It also allows us to flexibly tailor a fine-grained definition for the specific context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-209",
"text": "Paraphrasing (Androutsopoulos and Malakasiotis, 2010; Madnani and Dorr, 2010) (or text simplification (Siddharthan, 2014) ) can be used to rephrase words with unknown senses."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-210",
"text": "However, the target of paraphrase acquisition are words/phrases with no specified context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-211",
"text": "Although a few studies (Connor and Roth, 2007; Max, 2009; Max et al., 2012) consider subsentential (context-sensitive) paraphrases, they do not intend to obtain a definition-like description as a paraphrase of a word."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-212",
"text": "Recently, Noraset et al. (2017) introduced a task of generating a definition sentence of a word from its pre-trained embedding."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-213",
"text": "Since their task does not take local contexts of words as inputs, their method cannot generate an appropriate definition for a polysemous word for a specific context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-214",
"text": "To cope with this problem, Gadetsky et al. (2018) proposed a definition generation method that works with polysemous words in dictionaries."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-215",
"text": "They presented a model that utilizes local context to filter out the unrelated meanings from a pre-trained word embedding in a specific context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-216",
"text": "While their method use local context for disambiguating the meanings that are mixed up in word embeddings, the information from local contexts cannot be utilized if the pre-trained embeddings are unavailable or unreliable."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-217",
"text": "On the other hand, our method can fully utilize the local context through an attentional mechanism, even if the reliable word embeddings are unavailable."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-218",
"text": "The most related work to this paper is Ni and Wang (2017) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-219",
"text": "Focusing on non-standard English phrases, they proposed a model to generate the explanations solely from local context."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-220",
"text": "They followed the strict assumption that the target phrase was newly emerged and there was only a single local context available, which made the task of generating an accurate and coherent definition difficult."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-221",
"text": "Our proposed task and model are more general and practical than Ni and Wang (2017) ; where (1) we use Wikipedia, which includes expressions from various domains, and (2) our model takes advantage of global contexts if available."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-222",
"text": "Our task of describing phrases with its context is a generalization of the three tasks (Noraset et al., 2017; Ni and Wang, 2017; Gadetsky et al., 2018) , and the proposed method utilizes both local and global contexts of an expression in question."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-223",
"text": "----------------------------------"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-224",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-225",
"text": "This paper sets up a task of generating a natural language description for an unknown phrase with a specific context, aiming to help us acquire unknown word senses when reading text."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-226",
"text": "We approached this task by using a variant of encoderdecoder models that capture the given local context with the encoder and global contexts with the decoder initialized by the target phrase's embedding induced from massive text."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-227",
"text": "We performed experiments on three existing datasets and one newly built from Wikipedia and Wikidata."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-228",
"text": "The experimental results confirmed that the local and global contexts complement one another and are both essential; global contexts are crucial when local contexts are short and vague, while the local context is important when the target phrase is polysemous, rare, or unseen."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-229",
"text": "As future work, we plan to modify our model to use multiple contexts in text to improve the quality of descriptions, considering the \"one sense per discourse\" hypothesis (Gale et al., 1992) ."
},
{
"sent_id": "5203c1037fe57bd1b813c0bf1ff5c4-C001-230",
"text": "We will release the newly built Wikipedia dataset and the experimental codes for the academic and industrial communities at https://github.com/shonosuke/ ishiwatari-naacl2019 to facilitate the reproducibility of our results and their use in various application contexts."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-27"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-212"
]
],
"cite_sentences": [
"5203c1037fe57bd1b813c0bf1ff5c4-C001-27",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-212"
]
},
"@USE@": {
"gold_contexts": [
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-34"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-77"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-78"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-105"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-112"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-123"
]
],
"cite_sentences": [
"5203c1037fe57bd1b813c0bf1ff5c4-C001-34",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-77",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-78",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-105",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-112",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-123"
]
},
"@SIM@": {
"gold_contexts": [
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-57"
],
[
"5203c1037fe57bd1b813c0bf1ff5c4-C001-81"
]
],
"cite_sentences": [
"5203c1037fe57bd1b813c0bf1ff5c4-C001-57",
"5203c1037fe57bd1b813c0bf1ff5c4-C001-81"
]
}
}
},
"ABC_edfce6b99a4804c0908b39ea38d707_4": {
"x": [
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-63",
"text": "**COMPARISON OF PA AND GPA**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-2",
"text": "Most recent approaches to bilingual dictionary induction find a linear alignment between the word vector spaces of two languages."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-3",
"text": "We show that projecting the two languages onto a third, latent space, rather than directly onto each other, while equivalent in terms of expressivity, makes it easier to learn approximate alignments."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-4",
"text": "Our modified approach also allows for supporting languages to be included in the alignment process, to obtain an even better performance in low resource settings."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-7",
"text": "Several papers recently demonstrated the potential of very weakly supervised or entirely unsupervised approaches to bilingual dictionary induction (BDI) (Barone, 2016; Artetxe et al., 2017; Zhang et al., 2017; Conneau et al., 2018; , the task of identifying translational equivalents across two languages."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-8",
"text": "These approaches cast BDI as a problem of aligning monolingual word embeddings."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-9",
"text": "Pairs of monolingual word vector spaces can be aligned without any explicit crosslingual supervision, solely based on their distributional properties (for an adversarial approach, see Conneau et al. (2018) )."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-10",
"text": "Alternatively, weak supervision can be provided in the form of numerals (Artetxe et al., 2017) or identically spelled words ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-11",
"text": "Successful unsupervised or weakly supervised alignment of word vector spaces would remove much of the data bottleneck for machine translation and push horizons for cross-lingual learning ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-12",
"text": "In addition to an unsupervised approach to aligning monolingual word embedding spaces with adversarial training, Conneau et al. (2018) present a supervised alignment algorithm that assumes a gold-standard seed dictionary and performs Procrustes Analysis (Sch\u00f6nemann, 1966) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-13",
"text": "show that this approach, weakly supervised with a dictionary seed of crosslingual homographs, i.e. words with identical spelling across source and target language, is superior to the completely unsupervised approach."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-14",
"text": "We therefore focus on weakly-supervised Procrustes Analysis (PA) for BDI here."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-15",
"text": "The implementation of PA in Conneau et al. (2018) yields notable improvements over earlier work on BDI, even though it learns a simple linear transform of the source language space into the target language space."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-16",
"text": "Seminal work in supervised alignment of word vector spaces indeed reported superior performance with linear models as compared to non-linear neural approaches (Mikolov et al., 2013) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-17",
"text": "The relative success of the simple linear approach can be explained in terms of isomorphism across monolingual semantic spaces, 1 an idea that receives support from cognitive science (Youn et al., 1999) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-18",
"text": "Word vector spaces are not perfectly isomorphic, however, as shown by , who use a Laplacian graph similarity metric to measure this property."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-19",
"text": "In this work, we show that projecting both source and target vector spaces into a third space (Faruqui and Dyer, 2014) , using a variant of PA known as Generalized Procrustes Analysis (Gower, 1975) , makes it easier to learn the alignment between two word vector spaces, as compared to the single linear transform used in Conneau et al. (2018) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-20",
"text": "Contributions We show that Generalized Procrustes Analysis (GPA) (Gower, 1975) , a method that maps two vector spaces into a third, latent space, is superior to PA for BDI, e.g., improving the state-of-the-art on the widely used EnglishItalian dataset (Dinu et al., 2015) from a P@1 score of 66.2% to 67.6%."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-21",
"text": "We compare GPA to PA 1 Two vector spaces are isomorphic if there is an invertible linear transformation from one to the other."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-22",
"text": "on aligning English with five languages representing different language families (Arabic, German, Spanish, Finnish, and Russian) , showing that GPA consistently outperforms PA."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-23",
"text": "GPA also allows for the use of additional support languages, aligning three or more languages at a time, which can boost performance even further."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-24",
"text": "We present experiments with multi-source GPA on an additional five lowresource languages from the same language families (Hebrew, Afrikaans, Occitan, Estonian, and Bosnian), using their bigger counterpart as a support language."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-25",
"text": "Our code is publicly available."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-26",
"text": "2"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-28",
"text": "**PROCRUSTES ANALYSIS**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-29",
"text": "Procrustes Analysis is a graph matching algorithm, used in most mapping-based approaches to BDI ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-30",
"text": "Given two graphs, E and F , Procrustes finds the linear transformation T that minimizes the following objective:"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-31",
"text": "thus minimizing the trace between each two corresponding rows of the transformed space T E and F ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-32",
"text": "We build E and F based on a seed dictionary of N entries, such that each pair of corresponding rows in E and F , (e n , f n ) for n = 1, . . . , N consists of the embeddings of a translational pair of words."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-33",
"text": "In order to preserve the monolingual quality of the transformed embeddings, it is beneficial to use an orthogonal matrix T for cross-lingual mapping purposes (Xing et al., 2015; Artetxe et al., 2017) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-34",
"text": "3 Conveniently, the orthogonal Procrustes problem has an analytical solution, based on Singular Value Decomposition (SVD):"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-36",
"text": "**GENERALIZED PROCRUSTES ANALYSIS**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-37",
"text": "Generalized Procrustes Analysis (Gower, 1975) is a natural extension of PA that aligns k vector spaces at a time."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-38",
"text": "Given embedding spaces Figure 1 : Visualization of the difference between PA, which maps the source space directly onto the target space, and GPA, which aligns both source and target spaces with a third, latent space, constructed by averaging over the two language spaces."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-39",
"text": "E 1 , . . . , E k , GPA minimizes the following objective:"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-40",
"text": "arg min"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-41",
"text": "For an analytical solution to GPA, we compute the average of the embedding matrices E 1...k after transformation by T 1...k :"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-42",
"text": "thus obtaining a latent space, G, which captures properties of each of E 1...k , and potentially additional properties emerging from the combination of the spaces."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-43",
"text": "On the very first iteration, prior to having any estimates of T 1...k , we set G = E i for a random i. The new values of T 1...k are then obtained as:"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-44",
"text": "Since G is dependent on T 1...k (see Eq.4), the solution of GPA cannot be obtained in a single step (as is the case with PA), but rather requires that we loop over subsequent updates of G (Eq.4) and T 1...k (Eq.5) for a fixed number of steps or until satisfactory convergence."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-45",
"text": "We observed little improvement when performing more than 100 updates, so we fixed that as the number of updates."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-46",
"text": "Notice that for k = 2 and with the orthogonality constraint in place, the objective for Generalized Procrustes Analysis (Eq. 3) reduces to that for simple Procrustes (Eq. 1):"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-47",
"text": "arg min"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-48",
"text": "Here T itself is also orthogonal."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-49",
"text": "Yet, the solution found with GPA may differ from the one found with simple Procrustes: the former maps E 1 and E 2 onto a third space, G, which is the average of the two spaces, instead of mapping E 1 directly onto E 2 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-50",
"text": "To understand the consequences of this difference, consider a single step of the GPA algorithm where after updating G according to Eq.4 we are recomputing T 1 using SVD."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-51",
"text": "Due to the fact that G is partly based on E 1 , these two spaces are bound to be more similar to each other than E 1 and E 2 are."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-52",
"text": "4 Finding a good mapping between E 1 and G, i.e. a good setting of T 1 , should therefore be easier than finding a good mapping from E 1 to E 2 directly."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-53",
"text": "In this sense, by mapping E 1 onto G, rather than onto E 2 (as PA would do), we are solving an easier problem and reducing the chance of a poor solution."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-55",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-56",
"text": "In our experiments, we generally use the same hyper-parameters as used in Conneau et al. (2018) , unless otherwise stated."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-57",
"text": "When extracting dictionaries for the bootstrapping procedure, we use cross-domain local scaling (CSLS, see Conneau et al. (2018) for details) as a metric for ranking candidate translation pairs, and we only use the ones that rank higher than 15,000."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-58",
"text": "We do not put any restrictions on the initial seed dictionaries, based on cross-lingual homographs: those vary considerably in size, from 17,012 for Hebrew to 85,912 for Spanish."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-59",
"text": "Instead of doing a single training epoch, however, we run PA and GPA with early stopping, until five epochs of no improvement in the validation criterion as used in Conneau et al. (2018) , i.e. the average cosine similarity between the top 10,000 most frequent words in the source language and their candidate translations as induced with CSLS."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-60",
"text": "Our metric is Precision at k\u00d7100 (P@k), i.e. percentage of correct translations retrieved among the k nearest neighbor of the source words in the test set (Conneau et al., 2018) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-61",
"text": "Unless stated otherwise, experiments were carried out using the publicly available pre-trained fastText embeddings, trained on Wikipedia data, 5 and bilingual dictionaries-consisting of 5000 and 1500 unique word pairs for training and testing, respectively-provided by Conneau et al. (2018) 6 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-64",
"text": "High resource setting We first present a direct comparison of PA and GPA on BDI from English to five fairly high-resource languages: Arabic, Finnish, German, Russian, and Spanish."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-65",
"text": "The Wikipedia corpus sizes for these languages are reported in Table 1 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-66",
"text": "Results are listed in Table 2 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-67",
"text": "GPA improves over PA consistently for all five languages."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-68",
"text": "Most notably, for Finnish it scores 2.5% higher than PA."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-69",
"text": "Common benchmarks For a more extensive comparison with previous work, we include results on English-{Finnish, German, Italian} dictionaries used in Conneau et al. (2018) and Artetxe et al. (2018) -the second best approach to BDI known to us, which also uses Procrustes Analysis."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-70",
"text": "We conduct experiments using three forms of supervision: gold-standard seed dictionaries of 5000 word pairs, cross-lingual homographs, and numerals."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-71",
"text": "We use train and test bilingual dictionaries from Dinu et al. (2015) for English-Italian and from Artetxe et al. (2017) for English-{Finnish, German}. Following Conneau et al. (2018) , we report results with a set of CBOW embeddings trained on the WaCky corpus (Barone, 2016) , and with Wikipedia embeddings."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-72",
"text": "Results are reported in Table 3 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-73",
"text": "We observe that GPA outperforms PA consistently on Italian and German with the WaCky embeddings, and on all languages with the Wikipedia embeddings."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-74",
"text": "Notice that once more, Finnish benefits the most from a switch to GPA in the Wikipedia embeddings setting, but it is also the only language to suffer from that switch in the WaCky setup."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-75",
"text": "Interestingly, PA fails to learn a good alignment for Italian and Finnish when supervised with numerals, while GPA performs comparably with numerals as with other forms of supervision."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-76",
"text": "Conneau et al. (2018) point out that improvement from subsequent iterations of PA is generally negligible, which we also found to be the case."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-77",
"text": "We also found that while PA learned a slightly poorer alignment than GPA, it did so faster."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-78",
"text": "With our criterion for early stopping, PA converged in 5 to 10 epochs, while GPA did so within 10 to 15 epochs 7 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-79",
"text": "In the case of Italian and Finnish alignment supervised by numerals, PA converged in 8 and 5 epochs, respectively, but clearly got stuck in local minima."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-80",
"text": "GPA took considerably longer to converge: 27 and 74 epochs, respectively, but also managed to find a reasonable alignment between the language spaces."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-81",
"text": "This points to an important difference in the learning properties of PA and GPA-7 Notice that one epoch with both PA and GPA takes less than half a minute, so the slower convergence of GPA is in no way prohibitive."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-82",
"text": "unlike PA, GPA has a two-fold objective of opposing forces: it is simultaneously aligning each embedding space to two others, thus pulling it in different directions."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-83",
"text": "This characteristic helps GPA avoid particularly adverse local minima."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-85",
"text": "**MULTI-SUPPORT GPA**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-86",
"text": "In these experiments, we perform GPA with k = 3, including a third, linguistically-related supporting language in the alignment process."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-87",
"text": "To best evaluate the benefits of the multi-support setup, we use as targets five low-resource languages: Afrikaans, Bosnian, Estonian, Hebrew and Occitan (see statistics in Table 1 ) 8 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-88",
"text": "Three-way dictionaries, both the initial one (consisting of crosslingual homographs) and subsequent ones, are obtained by assuming transitivity between two-way dictionaries: if two pairs of words, e m -e n and e me l , are deemed translational pairs, then we consider e n -e m -e l a translational triple."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-89",
"text": "We report results in Table 4 with multi-support GPA in two settings: a three-way alignment trained for 10 epochs (MGPA), and a three-way alignment trained for 10 epochs, followed by 5 8 Occitan dictionaries were not available from the MUSE project, so we extracted a test dictionary of 911 unique word pairs from an English-Occitan lexicon available at http://www.occitania.online."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-90",
"text": "fr/aqui.comenca.occitania/en-oc.html."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-91",
"text": "epochs of two-way fine-tuning (MGPA+)."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-92",
"text": "We observe that at least one of our new methods always improves over PA."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-93",
"text": "GPA always outperforms PA and it also outperforms the multi-support settings on three out of five languages."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-94",
"text": "Yet, results for Hebrew and especially for Occitan, are best in a multi-support setting-we thus mostly focus on these two languages in the following subsections."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-95",
"text": "MGPA has variable performance: for four languages precision suffers from the addition of a third language, e.g. compare 38.93 for Hebrew with GPA to 37.53 with MGPA; for Occitan, however, the most challenging target language in our experiments, MGPA beats all other approaches by a large margin: 17.12 with GPA versus 23.81 with MGPA."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-96",
"text": "This pattern relates to the effect a supporting language has on the size of the induced seed dictionary."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-97",
"text": "Figure 2 visualizes the progression of dictionary size during training with and without a supporting language for Occitan and Hebrew."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-98",
"text": "The portion of the purple curves to the left of the dotted line corresponds to MGPA: notice how the curves are swapped between the two plots."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-99",
"text": "Spanish actually provides support for the English-Occitan alignment, by contributing to an increasingly larger seed dictionary-this provides better anchoring for the learned alignment."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-100",
"text": "Having Arabic as support for English-Hebrew alignment, on the other hand, causes a considerable reduction in the size of the seed dictionaries, giving GPA less anchor points and thus damaging the learned alignment."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-101",
"text": "The variable effect of a supporting language on dictionary size, and consequently on alignment precision, relates to the quality of alignment of the support language with English and with the target language: referring back to Table 2 , English-Spanish, for example, scores at 81.93, while English-Arabic precision is 35.33."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-102",
"text": "Notice that despite our linguisticallymotivated choice to pair related low-and highresource languages for multi-support training, it is not necessarily the case that those should align especially well, as that would also depend on practical factors, such as embeddings quality and training corpora similarity ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-103",
"text": "MGPA+ applies two-way fine-tuning on top of MGPA."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-104",
"text": "This leads to a drop in precision for Occitan, due to the removed support of Spanish and the consequent reduction in size of the induced dictionary (observe the fall of the purple curve after the dotted line in Figure 2 (a) )."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-105",
"text": "Meanwhile, precision for Hebrew is highest with MGPA+ out of all methods included."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-106",
"text": "While Arabic itself is not a good support language, its presence in the threeway MGPA alignment seems to have resulted in a good initialization for the English-Hebrew twoway fine-tuning, thus helping the model reach an even better minimum along the loss curve."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-107",
"text": "If word vector spaces were completely isomorphic, the introduction of a third (or fourth) space, and the application of GPA, would lead to the same alignment as the alignment learned by PA, projecting the source language E into the target space F ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-108",
"text": "This follows from the transitivity of isomorphism: if E is isomorphic to G and G is isomorphic to F , then E is isomorphic to F , via the isomorphism obtained by composing the isomorphisms from E to G and from G to F ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-109",
"text": "So why do we observe improvements? have shown that word vector spaces are often relatively far from being isomorphic, and approximate isomorphism is not transitive. What we observe therefore appears to be an instance of the Poincar\u00e9 Paradox (Poincar\u00e9, 1902) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-110",
"text": "While GPA is not more expressive than PA, it may still be easier to align each monolingual space to an intermediate space, as the latter constitutes a more similar target (albeit a nonisomorphic one); for example, the loss landscape of aligning a source and target language word embedding with an average of the two may be much smoother than when aligning source directly with target."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-111",
"text": "Our work is in this way similar in spirit to Raiko et al. (2012) , who use simple linear transforms to make learning of non-linear problems easier."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-112",
"text": "Table 5 lists example translational pairs as induced from alignments between English and Bosnian, learned with PA, GPA and MGPA+."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-113",
"text": "For interpretability, we query the system with words in Bosnian and seek their nearest neighbors in the English embedding space."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-114",
"text": "P@1 over the BosnianEnglish test set of Conneau et al. (2018) is 31.33, 34.80, and 34.47 for PA, GPA and MGPA+, respectively."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-115",
"text": "The examples are grouped in three blocks, based on success and failure of PA and GPA alignments to retrieve a valid translation."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-117",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-118",
"text": "It appears that a lot of the difference in performance between PA and GPA concerns morphologically related words, e.g. campaign v. campaigning, dialogue v. dialogues, merger v. merging etc."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-119",
"text": "These word pairs are naturally confusing to a BDI system, due to their related meaning and possibly identical syntactic properties (e.g. merger and merging can both be nouns)."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-120",
"text": "Another common mistake we observed in mismatches between PA and GPA predictions, was the wrong choice between two antonyms, e.g. stable v. unstable and visible v. unnoticeable."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-121",
"text": "Distributional word representations are known to suffer from limitations with respect to capturing opposition of meaning (Mohammad et al., 2013) , so it is not surprising that both PA-and GPA-learned alignments can fail in making this distinction."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-122",
"text": "While it is not the case that GPA always outperforms PA on a queryto-query basis in these rather challenging cases, on average GPA appears to learn an alignment more robust to subtle morphological and semantic differences between neighboring words."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-123",
"text": "Still, there are cases where PA and GPA both choose the wrong morphological variant of an otherwise correctly identified target word, e.g. transformation v. transformations."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-124",
"text": "Notice that many of the queries for which both algorithms fail, do result in a nearly synonymous word being predicted, e.g. participant for attendee, earns for gets, footage for video, etc."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-125",
"text": "This serves to show that the learned alignments are generally good, but they are not sufficiently precise."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-126",
"text": "This issue can have two sources: a suboptimal method for learning the alignment and/or a ceiling effect on how good of an alignment can be obtained, within the space of orthogonal linear transformations."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-128",
"text": "**PROCRUSTES FIT**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-129",
"text": "To explore the latter issue and to further compare the capabilities of PA and GPA, we perform a Procrustes fit test, where we learn alignments in a fully supervised fashion, using the test dictionaries of Conneau et al. (2018) 9 for both training and evaluation 10 ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-130",
"text": "In the ideal case, i.e. if the subspaces defined by the words in the seed dictionaries are perfectly alignable, this setup should result in precision of 100%."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-131",
"text": "We found the difference between the fit with PA and GPA to be negligible, 0.20 on average across all 10 languages (5 low-resource and 5 high-source languages)."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-132",
"text": "It is not surprising that PA and GPA results in almost equivalent fits-the two algorithms both rely on linear transformations, i.e. they are equal in expressivity."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-133",
"text": "As pointed out earlier, the superiority of GPA over PA stems from its Tables 2 and 4."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-134",
"text": "more robust learning procedure, not from higher expressivity."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-135",
"text": "Figure 3 thus only visualizes the Procrustes fit as obtained with GPA."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-136",
"text": "The Procrustes fit of all languages is indeed lower than 100%, showing that there is a ceiling on the linear alignability between the source and target spaces."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-137",
"text": "We attribute this ceiling effect to variable degrees of linguistic difference between source and target language and possibly to differences in the contents of cross-lingual Wikipedias (recall that the embeddings we use are trained on Wikipedia corpora)."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-138",
"text": "An apparent correlation emerges between the Procrustes fit and precision scores for weakly-supervised GPA, i.e. between the circles and the xs in the plot."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-139",
"text": "The only language that does not conform here is Occitan, which has the highest Procrustes fit and the lowest GPA precision out of all languages, but this result has an important caveat: our dictionary for Occitan comes from a different source and is much smaller than all the other dictionaries."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-140",
"text": "For some of the high-resource languages, weakly-supervised GPA takes us rather close to the best possible fit: e.g. for Spanish GPA scores 81.93%, and the Procrustes fit is 90.07%."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-141",
"text": "While low-resource languages do not necessarily have lower Procrustes fits than high-resource ones (compare Estonian and Finnish, for example), the gap between the Procrustes fit and GPA precision is on average much higher within low-resource languages than within high-resource ones (52.46 11 compared to 25.47, respectively)."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-142",
"text": "This finding is in line with the common understanding that the quality of distributional word vectors depends on the amount of data available-we can infer from these results that suboptimal embeddings results in suboptimal cross-lingual alignments."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-143",
"text": "11 Even if we leave Occitan out as an outlier, this number is still rather high: 47.10."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-145",
"text": "**MULTILINGUALITY**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-146",
"text": "Finally, we note that there may be specific advantages to including support languages for which large monolingual corpora exist, as those should, theoretically, be easier to align with English (also a high-resource language): variance in vector directionality, as studied in Mimno and Thompson (2017) , increases with corpus size, so we would expect embedding spaces learned from corpora comparable in size, to also be more similar in shape."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-148",
"text": "**RELATED WORK**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-149",
"text": "Bilingual embeddings Many diverse crosslingual word embedding models have been proposed ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-150",
"text": "The most popular kind learns a linear transformation from source to target language space (Mikolov et al., 2013) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-151",
"text": "In most recent work, this mapping is constrained to be orthogonal and solved using Procrustes Analysis (Xing et al., 2015; Artetxe et al., 2017 Artetxe et al., , 2018 Conneau et al., 2018; Lu et al., 2015) ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-152",
"text": "The approach most similar to ours, Faruqui and Dyer (2014) , uses canonical correlation analysis (CCA) to project both source and target language spaces into a third, joint space."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-153",
"text": "In this setup, similarly to GPA, the third space is iteratively updated, such that at timestep t, it is a product of the two language spaces as transformed by the mapping learned at timestep t \u2212 1."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-154",
"text": "The objective that drives the updates of the mapping matrices is to maximize the correlation between the projected embeddings of translational equivalents (where the latter are taken from a gold-standard seed dictionary)."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-155",
"text": "In their analysis of the transformed embedding spaces, Faruqui and Dyer (2014) focus on the improved quality of monolingual embedding spaces themselves and do not perform evaluation of the task of BDI."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-156",
"text": "They find that the transformed monolingual spaces better encode the difference between synonyms and antonyms: in the original monolingual English space, synonyms and antonyms of beautiful are all mapped close to each other in a mixed fashion; in the transformed space the synonyms of beautiful are mapped in a cluster around the query word and its antonyms are mapped in a separate cluster."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-157",
"text": "This finding is in line with our observation that GPA-learned alignments are more precise in distinguishing between synonyms and antonyms."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-158",
"text": "Multilingual embeddings Several approaches extend existing methods to space alignments between more than two languages (Ammar et al., 2016; ."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-159",
"text": "Smith et al. (2017) project all vocabularies into the English space."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-160",
"text": "In some cases, multilingual training has been shown to lead to improvements over bilingually trained embedding spaces (Vuli\u0107 et al., 2017) , similar to our findings."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-162",
"text": "**CONCLUSION**"
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-163",
"text": "Generalized Procrustes Analysis yields benefits over simple Procrustes Analysis for Bilingual Dictionary Induction, due to its smoother loss landscape."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-164",
"text": "In line with earlier research, benefits from the introduction of a common latent space seem to relate to a better distinction of synonyms and antonyms, and of syntactically-related words."
},
{
"sent_id": "edfce6b99a4804c0908b39ea38d707-C001-165",
"text": "GPA also offers the possibility to include multilingual support for inducing a larger seed dictionary during training, which better anchors the English to target language alignment in low-resource scenarios."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"edfce6b99a4804c0908b39ea38d707-C001-7"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-12"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-15"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-151"
]
],
"cite_sentences": [
"edfce6b99a4804c0908b39ea38d707-C001-7",
"edfce6b99a4804c0908b39ea38d707-C001-12",
"edfce6b99a4804c0908b39ea38d707-C001-15",
"edfce6b99a4804c0908b39ea38d707-C001-151"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"edfce6b99a4804c0908b39ea38d707-C001-9"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-114"
]
],
"cite_sentences": [
"edfce6b99a4804c0908b39ea38d707-C001-9",
"edfce6b99a4804c0908b39ea38d707-C001-114"
]
},
"@DIF@": {
"gold_contexts": [
[
"edfce6b99a4804c0908b39ea38d707-C001-19"
]
],
"cite_sentences": [
"edfce6b99a4804c0908b39ea38d707-C001-19"
]
},
"@USE@": {
"gold_contexts": [
[
"edfce6b99a4804c0908b39ea38d707-C001-56"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-57"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-59"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-60"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-61"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-69"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-71"
],
[
"edfce6b99a4804c0908b39ea38d707-C001-129"
]
],
"cite_sentences": [
"edfce6b99a4804c0908b39ea38d707-C001-56",
"edfce6b99a4804c0908b39ea38d707-C001-57",
"edfce6b99a4804c0908b39ea38d707-C001-59",
"edfce6b99a4804c0908b39ea38d707-C001-60",
"edfce6b99a4804c0908b39ea38d707-C001-61",
"edfce6b99a4804c0908b39ea38d707-C001-69",
"edfce6b99a4804c0908b39ea38d707-C001-71",
"edfce6b99a4804c0908b39ea38d707-C001-129"
]
},
"@SIM@": {
"gold_contexts": [
[
"edfce6b99a4804c0908b39ea38d707-C001-56"
]
],
"cite_sentences": [
"edfce6b99a4804c0908b39ea38d707-C001-56"
]
}
}
},
"ABC_163770df02c1110edc60e7cac90ad2_4": {
"x": [
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-182",
"text": "We average this score over all images."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-2",
"text": "Abstract."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-3",
"text": "Humans refer to objects in their environments all the time, especially in dialogue with other people."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-4",
"text": "We explore generating and comprehending natural language referring expressions for objects in images."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-5",
"text": "In particular, we focus on incorporating better measures of visual context into referring expression models and find that visual comparison to other objects within an image helps improve performance significantly."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-6",
"text": "We also develop methods to tie the language generation process together, so that we generate expressions for all objects of a particular category jointly."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-7",
"text": "Evaluation on three recent datasets -RefCOCO, RefCOCO+, and RefCOCOg 1 , shows the advantages of our methods for both referring expression generation and comprehension."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-10",
"text": "In this paper, we look at the dual-tasks of generating and comprehending natural language expressions referring to particular objects within an image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-11",
"text": "Referring to objects is a natural and common experience."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-33",
"text": "Similar to these recent methods, we also take a deep learning approach to referring expression generation and comprehension."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-58",
"text": "We still use such work to motivate the architecture of our pipeline."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-12",
"text": "For example, one often uses referring expressions in everyday speech to indicate a particular person or object to a co-observer, e.g., \"the man in the red hat\" or \"the book on the table\"."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-13",
"text": "Computational models to generate and comprehend such expressions would have applicability to human-computer interactions, especially for agents such as robots, interacting with humans in the physical world."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-14",
"text": "Successful models will have to connect both recognition of visual attributes of objects and effective natural language generation to compose useful expressions for dialogue."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-15",
"text": "A broader version of this latter goal was considered in 1975 by Paul Grice who introduced maxims describing cooperative conversation between people [11] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-16",
"text": "These maxims, called the Gricean Maxims, describe a set of rational principles for natural language dialogue interactions."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-17",
"text": "The 4 maxims are: quality (try to be truthful), quantity (make your contribution as informative as you can, giving as much information as is needed but no more), relevance (be relevant and pertinent to the discussion), and manner (be as clear, brief, and orderly as possible, avoiding obscurity and ambiguity)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-18",
"text": "RefCOCO: 1.)giraffe)on)left 2.)first)giraffe)on)left RefCOCO+: 1.)giraffe)with)lowered)head 2.)giraffe)head)down RefCOCOg: 1.)an)adult)giraffe)scratching)its) back)with)its)horn 2.)giraffe)hugging)another)giraffe Fig. 1 . Example referring expressions for the giraffe outlined in green from three referring expression datasets (described in Sec 4)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-19",
"text": "For the purpose of referring to objects in complex real world scenes these maxims suggest that a well formed expression should be informative, succinct, and unambiguous."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-20",
"text": "The last point is especially necessary for referring to objects in the real world since we often find multiple objects of a particular category situated together in a scene."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-21",
"text": "For example, consider the image in Fig. 1 which contains three giraffes."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-22",
"text": "We should not refer to the target (outlined in green) as \"the spotted giraffe\" since all of the giraffes are spotted and this would create an ambiguous reference."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-23",
"text": "More reasonably we should refer to the target as \"the giraffe with lowered head\" to differentiate this giraffe from the other two."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-24",
"text": "The task of referring expression generation (REG) has been studied since the 1970s [40, 22, 30, 7] , with most work focused on studying particular aspects of the problem in some relatively constrained datasets."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-25",
"text": "Recent approaches have pushed this work toword more realistic scenarios."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-26",
"text": "Kazemzadeh et al [19] introduced the first large-scale dataset of referring expressions for objects in real-world natural images, collected in a two-player game."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-27",
"text": "This dataset was originally collected on top of the 20,000 image ImageCleft dataset, but has recently been extended to images from the MSCOCO collection."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-28",
"text": "We make use of the RefCOCO and RefCOCO+ datasets in our work along with another recently collected referring expression dataset, released by Google, denoted in our paper as RefCOCOg [26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-29",
"text": "The most relevant work to ours is Mao et al [26] which introduced the first deep learning approach to REG."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-30",
"text": "In this model, the authors use a Convolutional Neural Network (CNN) [36] model pre-trained on ImageNet [34] to extract visual features from a bounding box around the target object and from the entire image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-31",
"text": "They use these features plus 5 features encoding the target object location and size as input to a Long Short-term Memory (LSTM) [10] network that generates expressions."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-32",
"text": "Additionally, they apply the same model to the inverse problem of referring expression comprehension where the input is a natural language expression and the goal is to localize the referred object in the image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-34",
"text": "However, while they use a generic model for object context -CNN features for the entire image containing the target object -we take a more focused approach to encode object comparisons."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-35",
"text": "These object comparisons are critical for producing an unambiguous referring expression since one must consider visual characteristics of similar objects during generation in order to select the most distinct aspects for description."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-36",
"text": "This mimics the process that a human would use to compose a good referring expression for an object, e.g. look at the object, look at other relevant objects, and generate an expression that could be used by a co-observer to unambiguously pick out the target object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-37",
"text": "In addition, for the referring expression generation task, we introduce a method to tie the language generation process together for all depicted objects of the same type."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-38",
"text": "This helps generate a good set of expressions such that the expressions differentiate between objects but are also complementary."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-39",
"text": "For example, we never want to generate the exact same expression for two objects in an image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-40",
"text": "Alternatively, if we call one object \"the red ball\" then we may desire the expression for the other object to follow the same generation pattern, i.e., \"the blue ball\"."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-41",
"text": "Our experimental evaluations show that these visual and linguistic comparisons improve performance over previous state of the art."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-42",
"text": "In the rest of our paper, we first describe related work (Sec 2)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-43",
"text": "We then describe our improvements to models for referring expression generation and comprehension (Sec 3), describe 3 referring expression datasets (Sec 4), and perform experimental evaluations on several model variations (Sec 5)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-44",
"text": "Finally we present our conclusions (Sec 6)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-46",
"text": "**RELATED WORK**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-47",
"text": "Referring expressions are closely related to the more general problem of modeling the connection between images and descriptive language."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-48",
"text": "In recent years, this has been studied in the image captioning task [6, 37, 13, 31, 23] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-49",
"text": "There, the aim is to condition the generation of language on the visual information from an image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-50",
"text": "The wide range of aspects of an image that could be described, and the variety of words that could be chosen for a particular description complicate studying image captioning."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-51",
"text": "Our study of referring expressions is partially motivated by focusing on description for a specific, and more easily evaluated, communication goal."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-52",
"text": "Although our task is somewhat different, we borrow machinery from state of the art caption generation [3, 39, 27, 5, 18, 21, 41] using LSTM to generate captions based on CNN features computed on an input image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-53",
"text": "Three recent approaches for referring expression generation [26] and comprehension [14, 33] also take a deep learning approach."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-54",
"text": "However, we add visual object comparisons and tie together language generation for multiple objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-55",
"text": "Referring expression generation has been studied for many years [40, 22, 30] in linguistics and natural language processing."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-56",
"text": "These works were limited by data collection and insufficient computer vision algorithms."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-57",
"text": "Together Amazon Mechanical Turk and CNNs have somewhat mitigated these limitations, allowing us to revisit these ideas on large-scale datasets."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-59",
"text": "For instance, Mitchell and Jordan et al [30, 16] show the importance of using attributes, Funakoshi et al [8] show the importance of relative relations between objects in the same perceptual group, and Kelleher et al [20] show the importance of spatial relationships."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-60",
"text": "These provide motivation for our modeling choices: when considering a referring expression for an object, the model takes into account the relative spatial location of other objects of the same type and visual comparisons to objects in the same perceptual group."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-61",
"text": "The REG datasets of the past were sometimes limited to using computer generated images [38] , or relatively small collections of natural objects [29, 28, 7] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-62",
"text": "Recently, a large-scale referring expression dataset was collected by Kazemzadeh et al [19] featuring natural objects in the real world."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-63",
"text": "Since then, another three REG datasets based on the object labels in MSCOCO have been collected [19, 26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-64",
"text": "The availability of large-scale referring expression datasets allows us to train deep learning models."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-65",
"text": "Additionally, our analysis of these datasets motivates our incorporation of visual comparisons between same-type objects, and the need to tie together choices for referring expression generation between objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-67",
"text": "**MODELS**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-68",
"text": "We implement several model variations for referring expression generation and comprehension."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-69",
"text": "The first set of models are recent state of the art deep learning approaches from Mao et al [26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-70",
"text": "We use these as our baselines (Sec 3.1)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-71",
"text": "Next, we investigate incorporating better visual context features into the models (Sec 3.2)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-72",
"text": "Finally, we explore methods to jointly produce an entire set of referring expressions for all depicted objects of the same category (Sec 3.3)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-74",
"text": "**BASELINES**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-75",
"text": "For comparison, we implement both the baseline and strong model of Mao et al [26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-76",
"text": "Both models utilize a pre-trained CNN network to model the target object and its context within the image, and then use a LSTM for generation."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-77",
"text": "In particular, object and context are modeled as features from a CNN trained to recognize 1,000 object categories [36] from ImageNet [34] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-78",
"text": "Specifically, the visual representation is composed of:"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-79",
"text": "-Target object representation, o i ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-80",
"text": "The object is modeled as features extracted from the VGG-fc7 layer by forwarding its bounding box through the network."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-81",
"text": "-Global context representation, g i ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-82",
"text": "Context is modeled as features extracted from the VGG-fc7 layer for the entire image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-83",
"text": "-Location/size representation, l i , for the target object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-84",
"text": "Location and size are modeled as a 5-d vector encoding the x and y locations of the top left and bottom right corners of the target object bounding box, as well as the bounding box size with respect to the image, i.e., l i = ["
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-85",
"text": "Language generation is handled by a long short-term memory network (LSTM) [10] where inputs are the above visual features and the network is trained to generate natural language referring expressions."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-86",
"text": "In Mao et al's baseline [26] , the model uses maximum likelihood training and outputs the most likely referring expression given the target object, context, and location/size features."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-87",
"text": "In addition, they also propose a stronger model that uses maximum mutual information (MMI) training to consider whether a listener would interpret a referring expression unambiguously."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-88",
"text": "They impose this by penalizing the model if a generated referring expression could also be generated by some other object within the image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-89",
"text": "We implement both their original model and MMI model in our experiments."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-90",
"text": "We subsequently refer to these two models as Baseline and MMI, respectively."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-92",
"text": "**VISUAL COMPARISON**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-93",
"text": "Previous works [2, 30] have shown that objects in an image, of the same type as the target object, are most important for influencing what attributes people use to describe the target."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-94",
"text": "One drawback of considering a general feature over the entire image to encode context (as in the baseline models) is that it may not specifically focus on visual comparisons to the most relevant objects -the other objects of the same object category within the image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-95",
"text": "In this paper, we propose a more explicit encoding of the visual difference between objects of the same category within an image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-96",
"text": "This helps for generating referring expressions which best discriminate the target object from the surrounding objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-97",
"text": "For example, in an image with three cars, two blue and one red, visual appearance comparisons could help generate \"the red car\" as an expression for the latter object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-98",
"text": "Given the referred object and its surrounding objects, we compute two types of features for visual comparison."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-99",
"text": "The first type encodes the similarities and differences in visual appearance between the target object and other objects of the same cateogry depicted in the image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-100",
"text": "Inspired by Sadeghi et al [35] , we compute the difference in visual CNN features as our representation of relative appearance."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-101",
"text": "Because there may be many surrounding objects of the same type in the image, and not every object will provide useful information about how to describe the target object, we need to first select which objects to compare and aggregate their visual differences."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-102",
"text": "In Section 5, we experiment with selecting different subsets of comparison objects: objects of the same category, objects of different category, or all other depicted objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-103",
"text": "For each selected comparison object, we compute the appearance difference as the subtraction of the target object and comparison object CNN representations."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-104",
"text": "We experiment with three different strategies for computing an aggregate vector to represent the visual difference between the target object and the surrounding objects: minimum, maximum, and average over each feature dimension."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-105",
"text": "In our experiments, pooling the average difference between the target object and surrounding objects seems to work best."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-106",
"text": "Therefore, we use this pooling in all experiments."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-107",
"text": "-Visual appearance difference representation, \u03b4v i = 1 n j =i oi\u2212oj oi\u2212oj , where n is the number of objects chosen for comparisons and we use average pooling to aggregate the differences."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-108",
"text": "The second type of comparison feature encodes the relative location and size differences between the target object and surrounding objects of the same object category."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-109",
"text": "People often use comparative size or location terms in referring expressions, e.g. \"the second giraffe from the left\" or \"the smaller monkey\" [38] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-110",
"text": "To address the dynamic number of nearby objects, we choose up to five comparison objects of the same category as the target object, sorted by distance to the target."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-111",
"text": "When fewer than five objects of the same category are depicted, this 25-d vector (5-d x 5 surrounding objects) is padded with zeros."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-112",
"text": "-Location difference representation, \u03b4l i , where each 5-d difference is computed as \u03b4l ij = ["
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-113",
"text": "wihi ]."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-114",
"text": "In summary, our final visual representation for a target object is:"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-115",
"text": "where o i , g i , l i are the target object, global context, and location/size features from the baseline model, \u03b4v i and \u03b4l i encodes visual appearance difference and location difference."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-116",
"text": "W m and b m project the concatenation of the five types of features to be the final representation."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-118",
"text": "**JOINT LANGUAGE GENERATION**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-119",
"text": "For the referring expression generation task, rather than generating sentences for each object in an image separately [15] [26], we consider tying the generation process together into a single task to jointly generate expressions for all objects of the same object category depicted in an image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-120",
"text": "This makes sense intuitively -when a person attempts to generate a referring expression for an object in an image they inherently compose that expression while keeping in mind expressions for the other objects in the picture."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-121",
"text": "This can be observed in the fact that the expressions people generate for objects in an image tend to share similar patterns of expression."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-122",
"text": "If you say \"the man on the left\" for one object then you tend to say \"the man on the right\" for the other object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-123",
"text": "We would like our algorithms to mimic these behaviors."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-124",
"text": "Additionally, the algorithm should also be able to push generated expressions away from each other to create less ambiguous references."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-125",
"text": "For example, if we use the word \"red\" to describe one object, then we probably shouldn't use the same word to describe another object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-126",
"text": "To model this joint generation process, we model generation using an LSTM model where in addition to the usual connections between time steps within an expression we also add connections between expressions for different objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-127",
"text": "This architecture is illustrated in Fig 2."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-128",
"text": "Specifically, we use LSTM to generate multiple referring expressions, {r i }, given depicted objects of the same type, {o j }."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-129",
"text": "where w it are words at time t, v i visual representations, and h jt is the hidden output of j-th object at time step t that encodes the visual and sentence information for the j-th object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-130",
"text": "As visual comparison, we aggregate the difference of hidden outputs to push away ambiguous information."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-131",
"text": ". There, n is the the number of other objects of the same type."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-180",
"text": "If the IOU is larger than 0.5 we count it as a true positive."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-132",
"text": "The hidden difference is jointly embedded with the target object's hidden output, and forwarded to the softmax layer for predicting the word."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-134",
"text": "**DATA**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-135",
"text": "We make use of 3 referring expression datasets in our work, all collected on top of the Microsoft COCO image collection [24] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-136",
"text": "One dataset, RefCOCOg [26] is collected in a non-interactive setting, while the other two datasets, RefCOCO and RefCOCO+, are collected interactively in a two-player game [19] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-137",
"text": "In the following, we describe each dataset and provide some analysis of their similarities and differences, and then discuss splits of the datasets used in our experiments ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-139",
"text": "**DATASETS & ANALYSIS**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-140",
"text": "Images for each dataset were selected to contain multiple objects of the same category (object categories depicted cover the 80 common objects from MSCOCO with ground-truth segmentation)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-141",
"text": "These images provide useful cases for referring expression generation since the referrer needs to compose a referring expression that uniquely singles out one object from other relevant objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-142",
"text": "RefCOCOg: This dataset was collected on Amazon Mechanical Turk in a non-interactive setting."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-143",
"text": "One set of workers were asked to write natural language referring expressions for objects in MSCOCO images then another set of workers were asked to click on the indicated object given a referring expression."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-144",
"text": "If the click overlapped with the correct object then the referring expression was considered valid and added to the dataset."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-145",
"text": "If not, another referring expression was collected for the object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-146",
"text": "This dataset consists of 85,474 referring expressions for 54,822 objects in 26,711 images."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-147",
"text": "Images were selected to contain between 2 and 4 objects of the same object category."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-148",
"text": "RefCOCO & RefCOCO+: These datasets were collected using the ReferitGame [19] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-149",
"text": "In this two-player game, the first player is shown an image with a segmented target object and asked to write a natural language expression referring to the target object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-150",
"text": "The second player is shown only the image and the referring expression and asked to click on the corresponding object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-151",
"text": "If the players do their job correctly, they receive points and swap roles."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-152",
"text": "If not, they are presented with a new object and image for description."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-153",
"text": "Images in these collections were selected to contain two or more objects of the same object category."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-154",
"text": "In the RefCOCO dataset, no restrictions are placed on the type of language used in the referring expressions while in the RefCOCO+ dataset players are disallowed from using location words in their referring expressions by adding \"taboo\" words to the ReferItGame."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-155",
"text": "This dataset was collected to obtain a referring expression dataset focsed on purely appearance based description, e.g., \"the man in the yellow polka-dotted shirt\" rather than \"the second man from the left\", which tend to be more interesting from a computer vision based perspective and are independent of viewer perspective."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-181",
"text": "Otherwise, we count it as a false positive."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-156",
"text": "RefCOCO consists of 142,209 refer expressions for 50,000 objects in 19,994 images, and RefCOCO+ has 141,564 expressions for 49,856 objects in 19,992 images."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-157",
"text": "Dataset Comparisons: As shown in Fig. 1 , the languages used in RefCOCO and RefCOCO+ datasets tend to be more concise and less flowery than the languages used in the RefCOCOg."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-158",
"text": "RefCOCO expressions have an average length of 3.61 while RefCOCO+ have an average length of 3.53, and RefCOCOg contain an average of 8.43 words."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-159",
"text": "This is most likely due to the differences in collection strategy."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-160",
"text": "RefCOCO and RefCOCO+ were collected in a game scenario where players are trying to efficiently provide enough information to indicate the correct object to the other player."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-161",
"text": "RefCOCOg was collected in independent rounds of Mechanical Turk without any interactive time constraints and therefore tend to provide more complex expressions, often entire sentences rather than phrases."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-162",
"text": "In addition, RefCOCO and RefCOCO+ do not limit the number of objects of the same type to 4 and thus contain some images with many objects of the same type."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-163",
"text": "Both RefCOCO and RefCOCO+ contain an average of 3.9 sametype objects per image, while RefCOCOg contains an average of 1.63 sametype objects per image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-164",
"text": "The large number of same-type objects per image in RefCOCO and RefCOCO+ suggests that incorporating visual comparisons to same-type objecs will be useful."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-165",
"text": "Dataset Splits: There are two types of splits of the data into train/test sets: a per-object split and a people-vs-objects split."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-166",
"text": "The first type is per-object split."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-167",
"text": "In this split, the dataset is divided by randomly partitioning objects into training and testing sets."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-168",
"text": "This means that each object will only appear either in training or testing set, but that one object from an image may appear in the training set while another object from the same image may appear in the test set."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-169",
"text": "We use this split for RefCOCOg since same division was used in the previous state-of-the-art approach [26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-170",
"text": "The second type is people-vs-objects splits."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-171",
"text": "One thing we observe from analyzing the datasets is that about half of the referred objects are people."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-172",
"text": "Therefore, we create a split for RefCOCO and RefCOCO+ datasets that evaluates images containing multiple people (testA) vs images containing multiple instances of all other objects (testB)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-173",
"text": "In this split all objects from an image will appear either in the training or testing sets, but not both."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-174",
"text": "This split creates a more meaningfully separated division between training and testing, allowing us to evaluate the usefulness of context more fairly."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-176",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-177",
"text": "We first perform some experiments to analyze the use of context in referring expressions (Sec 5.1)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-178",
"text": "Given these findings, we then perform experiments evaluating the usefulness of our proposed visual and language innovations on the comprehension (Sec 5.2) and generation tasks (Sec 5.3)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-179",
"text": "In experiments for the referring expression comprehension task, we use the same evaluation as Mao et al [26] , namely we first predict the region referred by the given expression, then we compute the intersection over union (IOU) ratio between the true and predicted bounding box."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-183",
"text": "For the referring expression generation task we use automatic evaluation metrics, BLEU, ROUGE, and METEOR developed for evaluating machine translation results, commonly used to evaluate language generation results [41, 18, 5, 27, 39, 23] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-184",
"text": "We further perform human evaluations, and propose a new metric evaluating the duplicate rate of generated expressions."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-185",
"text": "For both tasks, we compare our models with \"Baseline\" and \"MMI\" [26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-186",
"text": "Specifically, Table 1 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-187",
"text": "Expression Comprehension accuracies on RefCOCO and RefCOCO+ of the Baseline model with differenct context source."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-188",
"text": "Scale n indicates the size of the cropped window centered by the target object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-189",
"text": "we denote \"visdif\" as our visual comparison model, and \"tie\" as the LSTM tying model."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-190",
"text": "We also perform an ablation study, evaluating the combinations."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-192",
"text": "**ANALYSIS EXPERIMENTS**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-193",
"text": "Context Representation As previously discussed, we suggest that the approaches proposed in recent referring expression works [26, 14] make use of relatively weak contextual information, by only considering a single global image context for all objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-194",
"text": "To verify this intuition, we implemented both the baseline and strong MMI models from Mao et al [26] , and compare the results for referring expression comprehension task with and without global context on RefCOCO and Refcoco+ in Table 1 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-195",
"text": "Surprisingly we find that the global context does not improve the performance of the model."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-196",
"text": "In fact, adding context even decreases performance slightly."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-197",
"text": "This may be due to the fact that the global context for each object in an image would be the same, introducing some ambiguity into the referring expression comprehension task."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-198",
"text": "Given these findings, we implemented a simple modification to the global context, computing the same visual representation, but on a somewhat scaled window centered around the target object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-199",
"text": "We found this to improve performance, suggesting room for improving the visual context feature."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-200",
"text": "This motivate our development of a better context feature."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-201",
"text": "Visual Comparison For our visual comparison model, there could be several choices regarding which objects from the image should be compared to the target object."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-202",
"text": "We experiment with three sets of reference objects on RefCOCO and RefCOCO+ datasets: a) objects of the same-category in the image, b) objects of different-category in the image, and c) all objects appeared in the image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-203",
"text": "We use our \"visdif\" model for this experiment."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-204",
"text": "The results are shown in Figure 3 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-205",
"text": "It is clear to see the visual comparisons to the same-category objects are most useful for referring expression comprehension task."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-206",
"text": "This is more like mimicing how human refer object -we tend to point out the difference between the target object with the other same-category objects within the same image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-207",
"text": "Fig. 3 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-208",
"text": "Comprehension accuracies on RefCOCO and RefCOCO+ datasets."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-260",
"text": "In addition, for the referring expression generation task, we explore methods for joint generation over all relevant objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-209",
"text": "We compare the performance of \"visdif\" model without visual comparison, and visual comparison between different-category objects, between all objects, and between same-type objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-210",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-211",
"text": "**REFERRING EXPRESSION COMPREHENSION**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-212",
"text": "We evaluate performance on the referring expression comprehension task on RefCOCO, RefCOCO+ and RefCOCOg datasets."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-213",
"text": "For RefCOCO and RefCOCO+, we evaluate on the two subsets of people (testA) and all other objects (testB)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-214",
"text": "For RefCOCOg, we evaluate on the per-object split as previous work [26] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-215",
"text": "Since the authors haven't released their testing set, we show the performance on their validation set only, using the optimized hyper-parameters on RefCOCO."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-216",
"text": "Table 2 shows the comprehension accuracies."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-217",
"text": "We observe that our implementation of Mao et al [26] achieves comparable performance to the numbers reported in their paper."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-218",
"text": "We also find that adding visual comparison features to the Baseline model improves performance across all datasets and splits."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-219",
"text": "Similar improvements are also observed on top of the MMI model."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-220",
"text": "In order to make a fully automatic referring system, we also train a Fast-RCNN [9] detector and build our system on top of the detections."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-221",
"text": "We train Fast-RCNN on the validation portion only as the RefCOCO and RefCOCO+ are collected using MSCOCO training data."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-222",
"text": "For RefCOCOg, we use the detection results provided by [26] , which were trained uisng Multibox [4] ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-223",
"text": "Results on shown in the bottom half of Table 2 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-224",
"text": "Although all comprehension accuracies drop due to imperfect detections, the improvements of our models over Baseline and MMI are still observed."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-225",
"text": "One weakness of our automatic system is that it highly depends on detection performance, especially for general objects (testB)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-226",
"text": "However, considering our detector was trained on MSCOCO validation only, we believe such weakness may be alleviated with more training data and stronger detection techniques, e.g., [12] Table 2 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-227",
"text": "Referring Expression comprehension results on the RefCOCO, RefCOCO+, and RefCOCOg datasets."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-228",
"text": "Rows of \"method(det)\" are the results of automatic system built on Fast-RCNN [9] and Multibox [4] detections."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-229",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-230",
"text": "**REFERRING EXPRESSION GENERATION**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-231",
"text": "For the referring expression generation task, we evaluate the usefulness of our visual comparison features as well as our joint language generation model."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-232",
"text": "These serve to tie the generation process together so that the model considers other objects of the same type both visually and linguistically during generation."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-233",
"text": "On the visual side, comparisons are used to judge similarity of the target object to other objects of the same type in terms of appearance, size and location."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-234",
"text": "On the language side, the joint LSTM model serves to both differentiate and mimic language patterns in the referring expressions for the entire set of depicted objects."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-235",
"text": "Fig 5 shows some comparison between our model with other methods."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-236",
"text": "Our full results are shown in Table 3 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-237",
"text": "We find that incorporating our visual comparison features into the Baseline model improves generation quality (compare row \"Baseline\" to row \"visdif\")."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-238",
"text": "It also improves the performance of MMI model (compare row \"MMI\" to row \"visdif+MMI\")."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-239",
"text": "We also observe that tying the language generation together across all objects consistently improves the performance (compare the bottom three \"+tie\" rows with the above)."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-240",
"text": "Especially for method \"visdif+tie\", it achieves the highest score under almost every measurement."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-241",
"text": "We do not perform language tying on RefCOCOg since here some objects from an image may appear in training while others may appear in testing."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-242",
"text": "We observe in Table 3 that models incoporating \"+MMI\" are worse than without \"+MMI\" under the automatic scoring metrics."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-243",
"text": "To verify whether these metrics really reflect performance, we performed human evaluations on the expression generation task."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-244",
"text": "Three Turkers were asked to click on the referred object given the image and the generated expression."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-245",
"text": "If more than two clicked on the true target object, we consider this expression to be correct."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-246",
"text": "Table 4 shows the human evaluation results, indicating that models with \"+MMI\" are consistently higher performance."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-247",
"text": "We also find \"+tie\" methods perform the best, indicating that tying language together is able to produce less ambiguous referring expressions."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-248",
"text": "Some referring expression generation examples using different methods are shown in Fig 5."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-249",
"text": "Besides, we show more examples of tied generations using \"visdif+MMI+tie\" model in Fig 6."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-250",
"text": "Finally, we introduce another evaluation metric which measures the fraction of images for which an algorithm produces the same generated referring expression for multiple objects within the image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-251",
"text": "Obviously, a good referring expression generator should never produce the same expressions for two objects within the same image."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-252",
"text": "Thus we would like this number to be as small as possible."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-253",
"text": "The evaluation results under such metric are shown in Table 5 ."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-254",
"text": "We find \"+MMI\" produces smaller number of duplicated expressions on both RefCOCO and RefCOCO+, while \"+tie\" helps generating even more different expressions."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-255",
"text": "Our combined model \"visdif+MMI+tie\" performs the best under this metric."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-256",
"text": "----------------------------------"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-257",
"text": "**CONCLUSION**"
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-258",
"text": "In this paper, we have developed a new model for incorporating detailed context into referring expression models."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-259",
"text": "With this visual comparison based context we have improved performance over previous state of the art for referring expression generation and comprehension."
},
{
"sent_id": "163770df02c1110edc60e7cac90ad2-C001-261",
"text": "Experiments verify that this joint generation improves results over previous attempts to reduce ambiguity during generation."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"163770df02c1110edc60e7cac90ad2-C001-28"
],
[
"163770df02c1110edc60e7cac90ad2-C001-69",
"163770df02c1110edc60e7cac90ad2-C001-70"
],
[
"163770df02c1110edc60e7cac90ad2-C001-75"
],
[
"163770df02c1110edc60e7cac90ad2-C001-135",
"163770df02c1110edc60e7cac90ad2-C001-136"
],
[
"163770df02c1110edc60e7cac90ad2-C001-169"
],
[
"163770df02c1110edc60e7cac90ad2-C001-179"
],
[
"163770df02c1110edc60e7cac90ad2-C001-194"
],
[
"163770df02c1110edc60e7cac90ad2-C001-214"
],
[
"163770df02c1110edc60e7cac90ad2-C001-217"
],
[
"163770df02c1110edc60e7cac90ad2-C001-222"
]
],
"cite_sentences": [
"163770df02c1110edc60e7cac90ad2-C001-28",
"163770df02c1110edc60e7cac90ad2-C001-69",
"163770df02c1110edc60e7cac90ad2-C001-75",
"163770df02c1110edc60e7cac90ad2-C001-136",
"163770df02c1110edc60e7cac90ad2-C001-169",
"163770df02c1110edc60e7cac90ad2-C001-179",
"163770df02c1110edc60e7cac90ad2-C001-194",
"163770df02c1110edc60e7cac90ad2-C001-214",
"163770df02c1110edc60e7cac90ad2-C001-217",
"163770df02c1110edc60e7cac90ad2-C001-222"
]
},
"@BACK@": {
"gold_contexts": [
[
"163770df02c1110edc60e7cac90ad2-C001-29"
],
[
"163770df02c1110edc60e7cac90ad2-C001-63"
],
[
"163770df02c1110edc60e7cac90ad2-C001-86"
],
[
"163770df02c1110edc60e7cac90ad2-C001-135",
"163770df02c1110edc60e7cac90ad2-C001-136"
],
[
"163770df02c1110edc60e7cac90ad2-C001-193"
]
],
"cite_sentences": [
"163770df02c1110edc60e7cac90ad2-C001-29",
"163770df02c1110edc60e7cac90ad2-C001-63",
"163770df02c1110edc60e7cac90ad2-C001-86",
"163770df02c1110edc60e7cac90ad2-C001-136",
"163770df02c1110edc60e7cac90ad2-C001-193"
]
},
"@DIF@": {
"gold_contexts": [
[
"163770df02c1110edc60e7cac90ad2-C001-53",
"163770df02c1110edc60e7cac90ad2-C001-54"
],
[
"163770df02c1110edc60e7cac90ad2-C001-119"
]
],
"cite_sentences": [
"163770df02c1110edc60e7cac90ad2-C001-53",
"163770df02c1110edc60e7cac90ad2-C001-119"
]
},
"@MOT@": {
"gold_contexts": [
[
"163770df02c1110edc60e7cac90ad2-C001-75"
]
],
"cite_sentences": [
"163770df02c1110edc60e7cac90ad2-C001-75"
]
},
"@SIM@": {
"gold_contexts": [
[
"163770df02c1110edc60e7cac90ad2-C001-169"
],
[
"163770df02c1110edc60e7cac90ad2-C001-179"
],
[
"163770df02c1110edc60e7cac90ad2-C001-214"
],
[
"163770df02c1110edc60e7cac90ad2-C001-217"
]
],
"cite_sentences": [
"163770df02c1110edc60e7cac90ad2-C001-169",
"163770df02c1110edc60e7cac90ad2-C001-179",
"163770df02c1110edc60e7cac90ad2-C001-214",
"163770df02c1110edc60e7cac90ad2-C001-217"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"163770df02c1110edc60e7cac90ad2-C001-185"
]
],
"cite_sentences": [
"163770df02c1110edc60e7cac90ad2-C001-185"
]
}
}
},
"ABC_a774b918013dbf60eb8cc0ad1de2f9_4": {
"x": [
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-63",
"text": "The reconstruction of an input x is given by r(x) = g(h(x))."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-2",
"text": "We introduce a neural network model that marries together ideas from two prominent strands of research on domain adaptation through representation learning: structural correspondence learning (SCL, (Blitzer et al., 2006)) and autoencoder neural networks (NNs)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-3",
"text": "Our model is a three-layer NN that learns to encode the non-pivot features of an input example into a low-dimensional representation, so that the existence of pivot features (features that are prominent in both domains and convey useful information for the NLP task) in the example can be decoded from that representation."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-4",
"text": "The low-dimensional representation is then employed in a learning algorithm for the task."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-5",
"text": "Moreover, we show how to inject pre-trained word embeddings into our model in order to improve generalization across examples with similar pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-6",
"text": "We experiment with the task of cross-domain sentiment classification on 16 domain pairs and show substantial improvements over strong baselines."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-7",
"text": "1"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-10",
"text": "Many state-of-the-art algorithms for Natural Language Processing (NLP) tasks require labeled data."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-11",
"text": "Unfortunately, annotating sufficient amounts of such data is often costly and labor intensive."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-12",
"text": "Consequently, for many NLP applications even resource-rich languages like English have labeled data in only a handful of domains."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-13",
"text": "Domain adaptation (Daum\u00e9 III, 2007; Ben-David et al., 2010) , training an algorithm on labeled data taken from one domain so that it can perform properly on data from other domains, is therefore recognized as a fundamental challenge in NLP."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-14",
"text": "Indeed, over the last decade domain adaptation methods have been proposed for tasks such as sentiment classification (Bollegala et al., 2011b) , POS tagging (Schnabel and Sch\u00fctze, 2013) , syntactic parsing (Reichart and Rappoport, 2007; McClosky et al., 2010; Rush et al., 2012) and relation extraction (Jiang and Zhai, 2007; Bollegala et al., 2011a) , if to name just a handful of applications and works."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-15",
"text": "Leading recent approaches to domain adaptation in NLP are based on Neural Networks (NNs), and particularly on autoencoders (Glorot et al., 2011; Chen et al., 2012) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-16",
"text": "These models are believed to extract features that are robust to cross-domain variations."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-17",
"text": "However, while excelling on benchmark domain adaptation tasks such as cross-domain product sentiment classification (Blitzer et al., 2007) , the reasons to this success are not entirely understood."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-18",
"text": "In the pre-NN era, a prominent approach to domain adaptation in NLP, and particularly in sentiment classification, has been structural correspondence learning (SCL) (Blitzer et al., 2006 (Blitzer et al., , 2007 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-19",
"text": "Following the auxiliary problems approach to semi-supervised learning (Ando and Zhang, 2005) , this method identifies correspondences among features from different domains by modeling their correlations with pivot features: features that are frequent in both domains and are important for the NLP task."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-20",
"text": "Non-pivot features from different domains which are correlated with many of the same pivot features are assumed to correspond, providing a bridge between the domains."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-21",
"text": "Elegant and well motivated as it may be, SCL does not form the state-of-the-art since the neural approaches took over."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-22",
"text": "In this paper we marry these approaches, proposing NN models inspired by ideas from both."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-23",
"text": "Particularly, our basic model receives the nonpivot features of an input example, encodes them into a hidden layer and then, instead of decoding the input layer as an autoencoder would do, it aims to decode the pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-24",
"text": "Our more advanced model is identical to the basic one except that the decoding matrix is not learned but is rather replaced with a fixed matrix consisting of pre-trained embeddings of the pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-25",
"text": "Under this model the probability of the i-th pivot feature to appear in an example is a (non-linear) function of the dot product of the feature's embedding vector and the network's hidden layer vector."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-26",
"text": "As explained in Section 3, this approach encourages the model to learn similar hidden layers for documents that have different pivot features as long as these features have similar meaning."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-27",
"text": "In sentiment classification, for example, although one positive review may use the unigram pivot feature excellent while another positive review uses the pivot great, as long as the embeddings of pivot features with similar meaning are similar (as expected from high quality embeddings) the hidden layers learned for both documents are biased to be similar."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-28",
"text": "We experiment with the task of cross-domain product sentiment classification of (Blitzer et al., 2007) , consisting of 4 domains (12 domain pairs) and further add an additional target domain, consisting of sentences extracted from social media blogs (total of 16 domain pairs)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-29",
"text": "For pivot feature embedding in our advanced model, we employ the word2vec algorithm (Mikolov et al., 2013) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-30",
"text": "Our models substantially outperform strong baselines: the SCL algorithm, the marginalized stacked denoising autoencoder (MSDA) model (Chen et al., 2012) and the MSDA-DAN model (Ganin et al., 2016) that combines the power of MSDA with a domain adversarial network (DAN)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-32",
"text": "**BACKGROUND AND CONTRIBUTION**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-33",
"text": "Domain adaptation is a fundamental, long standing problem in NLP (e.g. (Roark and Bacchiani, 2003; Chelba and Acero, 2004; Daume III and Marcu, 2006) )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-34",
"text": "The challenge stems from the fact that data in the source and the target domains are often distributed differently, making it hard for a model trained in the source domain to make valuable predictions in the target domain."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-35",
"text": "Domain adaptation has various setups, differing with respect to the amounts of labeled and unlabeled data available in the source and target domains."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-36",
"text": "The setup we address, commonly referred to as unsupervised domain adaptation is where both domains have ample unlabeled data, but only the source domain has labeled training data."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-37",
"text": "There are several approaches to domain adaptation in the machine learning literature, including instance reweighting (Huang et al., 2007; Mansour et al., 2009 ), sub-sampling from both domains (Chen et al., 2011) and learning joint target and source feature representations (Blitzer et al., 2006; Daum\u00e9 III, 2007; Xue et al., 2008; Glorot et al., 2011; Chen et al., 2012) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-38",
"text": "Here, we discuss works that, like us, take the representation learning path."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-39",
"text": "Most works under this approach follow a two steps protocol: First, the representation learning method (be it SCL, an autoencoder network, our proposed network model or any other model) is trained on unlabeled data from both the source and the target domains; Then, a classifier for the supervised task (e.g. sentiment classification) is trained in the source domain and this trained classifier is applied to test examples from the target domain."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-40",
"text": "Each input example of the task classifier, at both training and test, is first run through the representation model of the first step and the induced representation is fed to the classifier."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-41",
"text": "Recently, end-to-end models that jointly learn to represent the data and to perform the classification task have also been proposed."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-42",
"text": "We compare our models to one such method (MSDA-DAN, (Ganin et al., 2016) )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-43",
"text": "Below, we first discuss two prominent ideas in feature representation learning: pivot features and autoencoder neural networks."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-44",
"text": "We then summarize our contribution in light of these approaches."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-45",
"text": "Pivot and Non-Pivot Features The definitions of this approach are given in Blitzer et al. (2006 Blitzer et al. ( , 2007 , where SCL is presented in the context of POS tagging and sentiment classification, respectively."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-46",
"text": "Fundamentally, the method divides the shared feature space of both the source and the target domains to the set of pivot features that are frequent in both domains and are prominent in the NLP task, and a complementary set of non-pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-47",
"text": "In this section we abstract away from the actual feature space and its division to pivot and non-pivot subsets."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-48",
"text": "In Section 4 we discuss this issue in the context of sentiment classification."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-49",
"text": "For representation learning, SCL employs the pivot features in order to learn mappings from the original feature space of both domains to a shared, low-dimensional, real-valued feature space."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-50",
"text": "This is done by training classifiers whose input consists of the non-pivot features of an input example and their binary classification task (the auxiliary task) is predicting, every classifier for one pivot feature, whether the pivot associated with the classifier appears in the input example or not."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-51",
"text": "These classifiers are trained on unlabeled data from both the target and the source domains: the training supervision naturally occurs in the data, no human annotation is required."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-52",
"text": "The matrix consisting of the weight vectors of these classifiers is then post-processed with singular value decomposition (SVD), to facilitate final compact representations."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-53",
"text": "The SVD derived matrix serves as a transformation matrix which maps feature vectors in the original space into a low-dimensional real-valued feature space."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-54",
"text": "Numerous works have employed the SCL method in particular and the concept of pivot features for domain adaptation in general."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-55",
"text": "A prominent method is spectral feature alignment (SFA, (Pan et al., 2010) )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-56",
"text": "This method aims to align domain-specific (non-pivot) features from different domains into unified clusters, with the help of domain-independent (pivot) features as a bridge."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-57",
"text": "Recently, Gouws et al. (2012) and Bollegala et al. (2015) implemented ideas related to those described here within an NN for cross-domain sentiment classification."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-58",
"text": "For example, the latter work trained a word embedding model so that for every document, regardless of its domain, pivots are good predictors of nonpivots, and the pivots' embeddings are similar across domains."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-59",
"text": "Yu and Jiang (2016) presented a convolutional NN that learns sentence embeddings using two auxiliary tasks (whether the sentence contains a positive or a negative domain independent sentiment word), purposely avoiding prediction with respect to a large set of pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-60",
"text": "In contrast to these works our model can learn useful cross-domain representations for any type of input example and in our cross-domain sentiment classification experiments it learns document level embeddings."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-61",
"text": "That is, unlike Bollegala et al. (2015) we do not learn word embeddings and unlike Yu and Jiang (2016) we are not restricted to input sentences."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-62",
"text": "Autoencoder NNs An autoencoder is comprised of an encoder function h and a decoder function g, typically with the dimension of h smaller than that of its argument."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-64",
"text": "Autoencoders are typically trained to minimize a reconstruction error loss(x, r(x))."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-65",
"text": "Example loss functions are the squared error, the Kullback-Leibler (KL) divergence and the cross entropy of elements of x and elements of r(x)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-66",
"text": "The last two loss functions are appropriate options when the elements of x or r(x) can be interpreted as probabilities of a discrete event."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-67",
"text": "In Section 3 we get back to this point when defining the cross-entropy loss function of our model."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-68",
"text": "Once an autoencoder has been trained, one can stack another autoencoder on top of it, by training a second model which sees the output of the first as its training data (Bengio et al., 2007) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-69",
"text": "The parameters of the stack of autoencoders describe multiple representation levels for x and can feed a classifier, to facilitate domain adaptation."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-70",
"text": "Recent prominent models for domain adaptation for sentiment classification are based on a variant of the autoencoder called Stacked Denoising Autoencoders (SDA, (Vincent et al., 2008) )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-71",
"text": "In a denoising autoencoder (DEA) the input vector x is stochastically corrupted into a vectorx, and the model is trained to minimize a denoising reconstruction error loss(x, r(x))."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-72",
"text": "SDA for crossdomain sentiment classification was implemented by Glorot et al. (2011) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-73",
"text": "Later, Chen et al. (2012) proposed the marginalized SDA (MSDA) model that is more computationally efficient and scalable to high-dimensional feature spaces than SDA."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-74",
"text": "Marginalization of denoising autoencoders has gained interest since MSDA was presented."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-75",
"text": "Yang and Eisenstein (2014) showed how to improve efficiency further by exploiting noising functions designed for structured feature spaces, which are common in NLP."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-76",
"text": "More recently, Clinchant et al. (2016) proposed an unsupervised regularization method for MSDA based on the work of Ganin and Lempitsky (2015) and Ganin et al. (2016) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-77",
"text": "There is a recent interest in models based on variational autoencoders (Kingma and Welling, 2014; Rezende et al., 2014) , for example the variational fair autoencoder model (Louizos et al., 2016) , for domain adaptation."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-78",
"text": "However, these models are still not competitive with MSDA on the tasks we consider here."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-79",
"text": "Our Contribution We propose an approach that marries the above lines of work."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-80",
"text": "Our model is similar in structure to an autoencoder."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-81",
"text": "However, instead of reconstructing the input x from the hidden layer h(x), its reconstruction function r receives a low dimensional representation of the non-pivot features of the input (h(x np ), where x np is the non-pivot representation of x (Section 3)) and predicts whether each of the pivot features appears in this example or not."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-82",
"text": "As far as we know, we are the first to exploit the mutual strengths of pivot-based methods and autoencoders for domain adaptation."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-84",
"text": "**NEURAL SCL MODELS**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-85",
"text": "We propose two models: the basic Autoencoder SCL 3.2) ), that directly integrates ideas from autoencoders and SCL, and the elaborated Autoencoder SCL with Similarity Regularization 3.3) , where pre-trained word embeddings are integrated into the basic model."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-87",
"text": "**DEFINITIONS**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-88",
"text": "We denote the feature set in our problem with f , the subset of pivot features with f p \u2286 {1, . . . , |f |} and the subset of non-pivot features with f np \u2286 {1, . . . , |f |} such that f p \u222a f np = {1, . . . , |f |} and f p \u2229 f np = \u2205. We further denote the feature representation of an input example X with x. Following this notation, the vector of pivot features of X is denoted with x p while the vector of non-pivot features is denoted with x np ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-89",
"text": "In order to learn a robust and compact feature representation for X we will aim to learn a nonlinear prediction function from x np to x p ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-90",
"text": "As discussed in Section 4 the task we experiment with is cross-domain sentiment classification."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-91",
"text": "Following previous work (e.g. (Blitzer et al., 2006 (Blitzer et al., , 2007 Chen et al., 2012) our feature representation consists of binary indicators for the occurrence of word unigrams and bigrams in the represented document."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-92",
"text": "In what follows we hence assume that the feature representation x of an example X is a binary vector, and hence so are x p and x np ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-94",
"text": "**AUTOENCODER SCL (AE-SCL)**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-95",
"text": "In order to solve the prediction problem, we present an NN architecture inspired by autoencoders (Figure 1 )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-96",
"text": "Given an input example X with a feature representation x, our fundamental idea is to start from a non-pivot feature representation, x np , encode x np into an intermediate representation h w h (x np ), and, finally, predict with a function r w r (h w h (x np )) the occurrences of pivot features,"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-97",
"text": "As is standard in NN modeling, we introduce non-linearity to the model through a non-linear activation function denoted with \u03c3 (the sigmoid function in our models)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-98",
"text": "Consequently we get:"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-99",
"text": "In what follows we denote the output of the model with o = r w r (h w h (x np ))."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-100",
"text": "Since the sigmoid function outputs values in the [0, 1] interval, o can be interpreted as a vector of probabilities with the i-th coordinate reflecting the probability of the i-th pivot feature to appear in the input example."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-101",
"text": "Cross-entropy is hence a natural loss function to jointly reason about all pivots:"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-102",
"text": "As x p is a binary vector, for each pivot feature, x p i , only one of the two members of the sum that take this feature into account gets a non-zero value."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-103",
"text": "The higher the probability of the correct event is (whether or not x p i appears in the input example), the lower is the loss."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-105",
"text": "**AUTOENCODER SCL WITH SIMILARITY REGULARIZATION (AE-SCL-SR)**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-106",
"text": "An important observation of Blitzer et al. (2007) , is that some pivot features are similar to each other to the level that they indicate the same information with respect to the classification task."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-107",
"text": "For example, in sentiment classification with word unigram features, the words (unigrams) great and excellent are likely to serve as pivot features, as the meaning of each of them is preserved across domains."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-108",
"text": "At the same time, both features convey very similar (positive) sentiment information to the level that a sentiment classifier should treat them as equals."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-109",
"text": "The AE-SCL-SR model is based on two crucial observations."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-110",
"text": "First, in many NLP tasks the pivot features can be pre-embeded into a vector space where pivots with similar meaning have similar vectors."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-111",
"text": "Second, the set f p X i of pivot features that appear in an example X i is typically much smaller than the setf p X i of pivot features that do not appear in it."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-112",
"text": "Hence, if the pivot features of X 1 and X 2 convey the same information about the NLP task (e.g. that the sentiment of both X 1 and X 2 is positive), then even if f p X 1 and f p X 2 are not identical, the intersection between the larger setsf p X 1 andf p X 2 is typically much larger than the symmetric difference between f p X 1 and f p X 2 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-113",
"text": "For instance, consider two examples, X 1 with the single pivot feature f 1 = great, and X 2 , with the single pivot feature f 2 = excellent."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-114",
"text": "Crucially, even though X 1 and X 2 differ with respect to the existence of f 1 and f 2 , due to the similar meaning of these pivot features, we expect both X 1 and X 2 not to contain many other pivot features, such as terrible, awful and mediocre, whose meanings conflict with that of f 1 and f 2 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-115",
"text": "To exploit these observations, in AE-SCL-SR the reconstruction matrix w r is pre-trained with a word embedding model and is kept fixed during the training and prediction phases of the neural network."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-116",
"text": "Particularly, the i-th row of w r is set to be the vector representation of the i-th pivot feature as learned by the word embedding model."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-117",
"text": "Except from this change, the AE-SCL-SR model is identical to the AE-SCL model described above."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-118",
"text": "Now, denoting the encoding layer for X 1 with h 1 and the encoding layer for X 2 with h 2 , we expect both \u03c3(w r k i \u00b7 h 1 ) and \u03c3(w r k i \u00b7 h 2 ) to get low values (i.e. values close to 0), for those k i conflicting pivot features: pivots whose meanings conflict with that of f p X 1 and f p X 2 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-119",
"text": "By fixing the representations of similar conflicting features to similar vectors, AE-SCL-SR provides a strong bias for h 1 and h 2 to be similar, as its only way to bias the predictions with respect to these features to be low is by pushing h 1 and h 2 to be similar."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-120",
"text": "Consequently, under AE-SCL-SR the vectors that encode the non-pivot features of documents with similar pivot features are biased to be similar to each other."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-121",
"text": "As mentioned in Section 4 the vectorh = \u03c3 \u22121 (h) forms the feature representation that is fed to the sentiment classifier to facilitate domain adaptation."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-122",
"text": "By definition, when h 1 and h 2 are similar so are theirh 1 andh 2 counterparts."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-124",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-125",
"text": "In this section we describe our experiments."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-126",
"text": "To facilitate clarity, some details are not given here and instead are provided in the appendices."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-127",
"text": "Cross-domain Sentiment Classification To demonstrate the power of our models for domain adaptation we experiment with the task of crossdomain sentiment classification (Blitzer et al., 2007) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-128",
"text": "The data for this task consist of Amazon product reviews from four product domains: Books (B), DVDs (D), Electronic items (E) and Kitchen appliances (K)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-129",
"text": "For each domain 2000 labeled reviews are provided: 1000 are classified as positive and 1000 as negative, and these are augmented with unlabeled reviews: 6000 (B), 34741 (D), 13153 (E) and 16785 (K)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-130",
"text": "We also consider an additional target domain, denoted with Blog: the University of Michigan sentence level sentiment dataset, consisting of sentences taken from social media blogs."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-131",
"text": "2 The dataset for the original task consists of a labeled training set (3995 positive and 3091 negative) and a 33052 sentences test set for which sentiment labels are not provided."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-132",
"text": "We hence used the original test set as our target domain unlabeled set and the original training set as our target domain test set."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-133",
"text": "Baselines Cross-domain sentiment classification has been studied in a large number of papers."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-134",
"text": "However, the difference in preprocessing methods, dataset splits to train/dev/test subsets and the different sentiment classifiers make it hard to directly compare between the numbers reported in past."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-135",
"text": "We hence compare our models to three strong baselines, running all models under the same conditions."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-136",
"text": "We aim to select baselines that represent the state-of-the-art in cross-domain sentiment classification in general, and in the two lines of work we focus at: pivot based and autoencoder based representation learning, in particular."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-137",
"text": "The first baseline is SCL with pivot features selected using the mutual information criterion (SCL-MI, (Blitzer et al., 2007) )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-138",
"text": "This is the SCL method where pivot features are frequent in the unlabeled data of both the source and the target do-mains, and among those features are the ones with the highest mutual information with the task (sentiment) label in the source domain labeled data."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-139",
"text": "We implemented this method."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-140",
"text": "In our implementation unigrams and bigrams should appear at least 10 times in both domains to be considered frequent."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-141",
"text": "For non-pivot features we consider unigrams and bigrams that appear at least 10 times in their domain."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-142",
"text": "The same pivot and non-pivot selection criteria are employed for our AE-SCL and AE-SCL-SR models."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-143",
"text": "Among autoencoder models, SDA has shown by Glorot et al. (2011) to outperform SFA and SCL on cross-domain sentiment classification and later on Chen et al. (2012) demonstrated superior performance for MSDA over SDA and SCL on the same task."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-144",
"text": "Our second baseline is hence the MSDA method (Chen et al., 2012) , with code taken from the authors' web page."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-145",
"text": "3 To consider a regularization scheme on top of MSDA representations we also experiment with the MSDA-DAN model (Ganin et al., 2016) which employs a domain adversarial network (DAN) with the MSDA vectors as input."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-146",
"text": "In Ganin et al. (2016) MSDA-DAN has shown to substantially outperform the DAN model when DAN is randomly initialized."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-147",
"text": "The DAN code is taken from the authors' repository."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-148",
"text": "4 For reference we compare to the No-DA case where the sentiment classifier is trained in the source domain and applied to the target domain without adaptation."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-149",
"text": "The sentiment classifier we employ, in this case as well as with our methods and with the SCL-MI and MSDA baselines, is a standard logistic regression classifier."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-150",
"text": "5 6"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-151",
"text": "Experimental Protocol Following the unsupervised domain adaptation setup (Section 2), we have access to unlabeled data from both the source and the target domains, which we use to train the representation learning models."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-152",
"text": "However, only the source domain has labeled training data for sentiment classification."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-153",
"text": "The original feature set we start from consists of word unigrams and bigrams."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-154",
"text": "All methods (baselines and ours), except from MSDA-DAN, follow a two-step protocol at both training and test time."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-155",
"text": "In the first step, the input example is run through the representation model which generates a new feature vector for this example."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-156",
"text": "Then, in the second step, this vector is concatenated with the original feature vector of the example and the resulting vector is fed into the sentiment classifier (this concatenation is a standard convention in the baseline methods)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-157",
"text": "For MSDA-DAN all the above holds, except from one exception."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-158",
"text": "MSDA-DAN gets an input representation that consists of a concatenation of the original and the MSDA-induced feature sets."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-159",
"text": "As this is an end-to-end model that predicts the sentiment class jointly with the new feature representation, we do not employ any additional sentiment classifier."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-160",
"text": "As in the other models, MSDA-DAN utilizes source domain labeled data as well as unlabeled data from both the source and the target domains at training time."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-161",
"text": "We experiment with a 5-fold cross-validation on the source domain (Blitzer et al., 2007) : 1600 reviews for training and 400 reviews for development."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-162",
"text": "The test set for each target domain of Blitzer et al. (2007) consists of all 2000 labeled reviews of that domain, and for the Blog domain it consists of the 7086 labeled sentences provided with the task dataset."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-163",
"text": "In all five folds half of the training examples and half of the development examples are randomly selected from the positive reviews and the other halves from the negative reviews."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-164",
"text": "We report average results across these five folds, employing the same folds for all models."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-165",
"text": "Hyper-parameter Tuning The details of the hyper-parameter tuning process for all models (including data splits to training, development and test sets) are described in the appendices."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-166",
"text": "Here we provide a summary."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-168",
"text": "**AE-SCL AND AE-SCL-SR:**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-169",
"text": "For the stochastic gradient descent (SGD) training algorithm we set the learning rate to 0.1, momentum to 0.9 and weightdecay regularization to 10 \u22125 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-170",
"text": "The number of pivots was chosen among {100, 200, . . . , 500} and the dimensionality of h among {100, 300, 500}. For the features induced by these models we take their w h x np vector."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-171",
"text": "For AE-SCL-SR, embeddings for the unigram and bigram features were learned with word2vec (Mikolov et al., 2013) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-172",
"text": "Details about the software and the way we learn bigram representations are in the appendices."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-173",
"text": "Baselines: For SCL-MI, following (Blitzer et al., 2007) we tuned the number of pivot features (Gillick and Cox, 1989; Blitzer et al., 2006) between 500 and 1000 and the SVD dimensions among 50,100 and 150."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-174",
"text": "For MSDA we tuned the number of reconstructed features among {500, 1000, 2000, 5000, 10000}, the number of model layers among {1, 3, 5} and the corruption probability among {0.1, 0.2, . . . , 0.5}. For MSDA-DAN, we followed Ganin et al. (2016) : the \u03bb adaptation parameter is chosen among 9 values between 10 \u22122 and 1 on a logarithmic scale, the hidden layer size l is chosen among {50, 100, 200} and the learning rate \u00b5 is 10 \u22123 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-176",
"text": "**RESULTS**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-177",
"text": "Table 1 presents our results."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-178",
"text": "In the Blitzer et al. (2007) task (top tables), AE-SCL-SR is the best performing model in 9 of 12 setups and on a unified test set consisting of the test sets of all 12 setups (the Test-All column)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-179",
"text": "AE-SCL, MSDA and MSDA-DAN perform best in one setup each."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-180",
"text": "On the unified test set, AE-SCL-SR improves over SCL-MI by 3.8% (error reduction (ER) of 14.8%) and over MSDA-DAN by 2% (ER of 8.4%), while AE-SCL improves over SCL-MI and MSDA-DAN by 2.7% (ER of 10.5%) and 0.9% (ER of 3.8%), respectively."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-181",
"text": "MSDA-DAN and MSDA perform very similarly on the unified test set (0.761 and 0.759, respectively) with generally minor differences in the individual setups."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-182",
"text": "When adapting from the product review domains to the Blog domain (bottom table), AE-SCL-SR performs best in 3 of 4 setups, providing particularly large improvements when training is in the Kitchen (K) domain."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-183",
"text": "The average improvement of AE-SCL-SR over MSDA is 5.2% and over a non-adapted classifier is 11.7%."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-184",
"text": "As before, MSDA-DAN performs similarly to MSDA on the unified test set, although the differences in the individual setups are much higher."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-185",
"text": "The differences between AE-SCL-SR and the other models are statistically significant in most cases."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-186",
"text": "7"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-187",
"text": "Class Based Analysis cases over AE-SCL (compared to 2.19% of the positive examples where AE-SCL is better) and in 6.40% of the cases over MSDA (compared to 2.80%)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-188",
"text": "While on negative examples the pattern is reversed and AE-SCL and MSDA outperform AE-SCL-SR, this is a weaker effect which only moderates the overall superiority of AE-SCL-SR."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-189",
"text": "8 The unlabeled documents from all four domains are strongly biased to convey positive opinions (Section 4)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-190",
"text": "This is indicated, for example, by the average score given to these reviews by their authors: 4.29 (B), 4.33 (D), 3.96 (E) and 4.16 (K), on a scale of 1 to 5."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-191",
"text": "This analysis suggests that AE-SCL-SR better learns from of its unlabeled data."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-192",
"text": "Similar Pivots Recall that AE-SCL-SR aims to learn more similar representations for documents with similar pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-193",
"text": "Table 2 demonstrates this effect through pairs of test documents from 8 8 The reported numbers are averaged over the 5 folds and rounded to the closest integer, if necessary."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-194",
"text": "The comparison between AE-SCL-SR and MSDA-DAN yields a very similar pattern and is hence excluded from space considerations."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-195",
"text": "product review setups."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-196",
"text": "9 The documents contain pivot features with very similar meaning and indeed they belong to the same sentiment class."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-197",
"text": "Yet, in all cases AE-SCL-SR correctly classifies both documents, while AE-SCL misclassifies one."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-198",
"text": "The rightmost column of the table presents the difference in the ranking of the cosine similarity between the representation vectorsh of the documents in the pair, according to each of the models."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-199",
"text": "Results (in numerical values and percentage) are given with respect to all cosine similarity values between theh vectors of any document pair in the test set."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-200",
"text": "As the documents with the highest similarity are ranked 1, the positive difference between the ranks of AE-SCL and those of AE-SCL-SR indicate that AE-SCL's rank is lower."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-201",
"text": "That is, AE-SCL-SR learns more similar representations for documents with similar pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-202",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-203",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-204",
"text": "We presented a new model for domain adaptation which combines ideas from pivot based and autoencoder based representation learning."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-205",
"text": "We have demonstrated how to encode information from pre-trained word embeddings to improve the generalization of our model across examples with semantically similar pivot features."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-206",
"text": "We demonstrated strong performance on cross-domain sentiment classification tasks with 16 domain pairs and provided initial qualitative analysis that supports the intuition behind our model."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-207",
"text": "Our approach is general and applicable for a large number of NLP tasks (for AE-SCL-SR this holds as long as the pivot features can be embedded in a vector space)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-208",
"text": "In future we would like to adapt our model to more general domain adaptation setups such as where adaptation is performed between sets of source and target domains and where some labeled data from the target domain(s) is available."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-210",
"text": "**A HYPERPARAMETER TUNING**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-211",
"text": "This appendix describes the hyper-parameter tuning process for the models compared in our paper."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-212",
"text": "Some of these details appear in the full paper, but here we provide a detailed description."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-213",
"text": "AE-SCL and AE-SCL-SR We tuned the parameters of both our models in two steps."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-214",
"text": "First, we randomly split the unlabeled data from both the source and the target domains in a 80/20 manner and combine the large subsets together and the small subsets together so that to generate unlabeled training and validation sets."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-215",
"text": "On these training/validation sets we tune the hyperparameters of the stochastic gradient descent (SGD) algorithm we employ to train our networks: learning rate (0.1), momentum (0.9) and weight-decay regularization (10 \u22125 )."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-216",
"text": "Note that these values are tuned on the fully unsupervised task of predicting pivot features occurrence from non-pivot input representation, and are then employed in all the source-traget domain combinations, across all folds."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-217",
"text": "10 After tuning the SGD parameters, in the second step we tuned the model's hyper-parameters for each fold of each source-target setup."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-218",
"text": "The hyperparameters are the number of pivots (100 to 500 in steps 100) and the dimensionality of h (100 to 500 in steps of 200)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-219",
"text": "We select the values that yield the best performing model when training on the training set and evaluating on the training domain development set of each fold."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-220",
"text": "11 We further explored the quality of the various intermediate representations generated by the models as sources of features for the sentiment classifier."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-221",
"text": "The vectors we considered are: w h x np , h = \u03c3(w h x np ), w r h and r = \u03c3(w r h)."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-222",
"text": "We chose the w h x np vector, denoted in the paper in the paper withh."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-223",
"text": "10 Both AE-SCL and AE-SCL-SR converged to the same values."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-224",
"text": "This is probably because for each parameter we consider only a handful of values: learning rate (0.01,0.1,1), momentum (0.1,0.,5,0.9) and weight-decay regularization (10 \u22124 ,10 \u22125 , 10 \u22126 ). 11 When tuning the SGD parameters we experimented with 100 and 500 pivots and dimensionality of 100 and 500 for h."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-225",
"text": "For AE-SCL-SR, embeddings for the unigram and bigram features were learned with word2vec (Mikolov et al., 2013) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-226",
"text": "12 To learn bigram representations, in cases where a bigram pivot (w1,w2) is included in a sentence we generate the triplet w1,w1-w2, w2."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-227",
"text": "For example, the sentence It was a very good book with the bigram pivot very good is re-written as: It was a very very-good good book."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-228",
"text": "The revised corpus is then fed into word2vec."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-229",
"text": "The dimension of the hidden layer h of AE-SCL-SR is the dimension of the induced embeddings."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-230",
"text": "In both parameter tuning steps we use the unlabeled validation data for early stopping: the SGD algorithm stops at the first iteration where the validation data error increases rather then when the training error or the loss function are minimized."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-231",
"text": "(Blitzer et al., 2007) we used 1000 pivot features ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-232",
"text": "13 The number of SVD dimensions was tuned on the labeled development data to the best value among 50,100 and 150."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-233",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-234",
"text": "**SCL-MI FOLLOWING**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-235",
"text": "MSDA Using the labeled dev."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-236",
"text": "data we tuned the number of reconstructed features (among 500, 1000, 2000, 5000 and 10000) the number of model layers (among {1, 3, 5}) and the corruption probability (among {0.1, 0.2, . . . , 0.5})."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-237",
"text": "For details on these hyper-parameters see (Chen et al., 2012) ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-238",
"text": "Following Ganin et al. (2016) we tuned the hyperparameters on the labeled development data as follows."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-239",
"text": "The \u03bb adaptation parameter is chosen among 9 values between 10 \u22122 and 1 on a logarithmic scale."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-240",
"text": "The hidden layer size l is chosen among {50, 100, 200} and the learning rate \u00b5 is fixed to 10 \u22123 ."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-241",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-242",
"text": "**MSDA-DAN**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-243",
"text": "----------------------------------"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-244",
"text": "**B EXPERIMENTAL CHOICES**"
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-245",
"text": "Variants of the Product Review Data There are two releases of the datasets of the Blitzer et al. (2007) cross-domain product review task."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-246",
"text": "We use the one from http://www.cs.jhu."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-247",
"text": "edu/\u02dcmdredze/datasets/sentiment/ index2.html where the data is imbalanced, consisting of more positive than negative reviews."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-248",
"text": "We believe that our setup is more realistic as when collecting unlabeled data, it is hard to get a balanced set."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-249",
"text": "Note that Blitzer et al. (2007) used the other release where the unlabeled data consists of the same number of positive and negative reviews."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-250",
"text": "Test Set Size While Blitzer et al. (2007) used only 400 target domain reviews for test, we use the entire set of 2000 reviews."
},
{
"sent_id": "a774b918013dbf60eb8cc0ad1de2f9-C001-251",
"text": "We believe that this decision yields more robust and statistically significant results."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-17"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-18"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-45"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-106"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-162"
]
],
"cite_sentences": [
"a774b918013dbf60eb8cc0ad1de2f9-C001-17",
"a774b918013dbf60eb8cc0ad1de2f9-C001-18",
"a774b918013dbf60eb8cc0ad1de2f9-C001-45",
"a774b918013dbf60eb8cc0ad1de2f9-C001-106",
"a774b918013dbf60eb8cc0ad1de2f9-C001-162"
]
},
"@EXT@": {
"gold_contexts": [
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-28"
]
],
"cite_sentences": [
"a774b918013dbf60eb8cc0ad1de2f9-C001-28"
]
},
"@USE@": {
"gold_contexts": [
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-91"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-127"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-137"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-161"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-173"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-245",
"a774b918013dbf60eb8cc0ad1de2f9-C001-246"
]
],
"cite_sentences": [
"a774b918013dbf60eb8cc0ad1de2f9-C001-91",
"a774b918013dbf60eb8cc0ad1de2f9-C001-127",
"a774b918013dbf60eb8cc0ad1de2f9-C001-137",
"a774b918013dbf60eb8cc0ad1de2f9-C001-161",
"a774b918013dbf60eb8cc0ad1de2f9-C001-173",
"a774b918013dbf60eb8cc0ad1de2f9-C001-245"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-178"
]
],
"cite_sentences": [
"a774b918013dbf60eb8cc0ad1de2f9-C001-178"
]
},
"@DIF@": {
"gold_contexts": [
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-248",
"a774b918013dbf60eb8cc0ad1de2f9-C001-249"
],
[
"a774b918013dbf60eb8cc0ad1de2f9-C001-250"
]
],
"cite_sentences": [
"a774b918013dbf60eb8cc0ad1de2f9-C001-249",
"a774b918013dbf60eb8cc0ad1de2f9-C001-250"
]
}
}
},
"ABC_b2392c74f17fb2c0b6a0f19d16bc99_4": {
"x": [
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-32",
"text": "The rest of the paper is organized as follows."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-2",
"text": "Word vectors are at the core of many natural language processing tasks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-3",
"text": "Recently, there has been interest in postprocessing word vectors to enrich their semantic information."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-4",
"text": "In this paper, we introduce a novel word vector postprocessing technique based on matrix conceptors (Jaeger 2014), a family of regularized identity maps."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-5",
"text": "More concretely, we propose to use conceptors to suppress those latent features of word vectors having high variances."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-6",
"text": "The proposed method is purely unsupervised: it does not rely on any corpus or external linguistic database."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-7",
"text": "We evaluate the post-processed word vectors on a battery of intrinsic lexical evaluation tasks, showing that the proposed method consistently outperforms existing state-of-the-art alternatives."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-8",
"text": "We also show that post-processed word vectors can be used for the downstream natural language processing task of dialogue state tracking, yielding improved results in different dialogue domains."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-11",
"text": "Distributional representations of words, better known as word vectors, are a cornerstone of practical natural language processing (NLP)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-12",
"text": "Examples of word vectors include Word2Vec (Mikolov et al. 2013) , GloVe (Pennington, Socher, and Manning 2014) , Eigenwords (Dhillon, Foster, and Ungar 2015) , and Fasttext (Bojanowski et al. 2017) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-13",
"text": "These word vectors are usually referred to as distributional word vectors, as their training methods rely on the distributional hypothesis of semantics (Firth 1957) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-14",
"text": "Recently, there has been interest in post-processing distributional word vectors to enrich their semantic content."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-15",
"text": "The post-process procedures are usually performed in a lightweight fashion, i.e., without re-training word vectors on a text corpus."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-16",
"text": "In one line of study, researchers used supervised methods to enforce linguistic constraints (e.g., synonym relations) on word vectors (Faruqui et al. 2015; Mrksic et al. 2016; 2017) , where the linguistic constraints are extracted from an external linguistic knowledge base such as WordNet (Miller 1995) and PPDB (Pavlick et al. 2015) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-17",
"text": "In another line of study, researchers devised unsupervised methods to post-process word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-18",
"text": "Spectral-decomposition methods such as singular value decomposition (SVD) and principal component analysis (PCA) are usually used in this line of research (Caron 2001; Bullinaria and Levy 2012; Turney 2012; Levy and Goldberg 2014; Levy, Goldberg, and Dagan 2015; Mu and Viswanath 2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-19",
"text": "The current paper is in line with the second, unsupervised, research direction."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-20",
"text": "Among different unsupervised word vector postprocessing schemes, the all-but-the-top approach (Mu and Viswanath 2018 ) is a prominent example."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-21",
"text": "Empirically studying the latent features encoded by principal components (PCs) of distributional word vectors, Mu and Viswanath (2018) found that the variances explained by the leading PCs \"encode the frequency of the word to a significant degree\"."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-22",
"text": "Since word frequencies are arguably unrelated to lexical semantics, they recommend removing such leading PCs from word vectors using a PCA reconstruction."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-23",
"text": "The current work advances the findings of Mu and Viswanath (2018) and improves their post-processing scheme."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-24",
"text": "Instead of discarding a fixed number of PCs, we softly filter word vectors using matrix conceptors (Jaeger 2014; 2017) , which characterize the linear space of those word vector features having high variances -the features most contaminated by word frequencies according to Mu and Viswanath (2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-25",
"text": "The proposed approach is mathematically simple and computationally efficient, as it is founded on elementary linear algebra."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-26",
"text": "Besides these traits, it is also practically effective: using a standard set of lexical-level intrinsic evaluation tasks and a deep neural network-based dialogue state tracking task, we show that conceptor-based post-processing considerably enhances linguistic regularities captured by word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-27",
"text": "A more detailed list of our contributions are:"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-28",
"text": "1. We propose an unsupervised algorithm that leverages Boolean operations of conceptors to post-process word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-29",
"text": "The resulting word vectors achieve up to 18.86% and 28.34% improvement on the SimLex-999 and SimVerb-3500 dataset relative to the original word representations."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-30",
"text": "2. A closer look at the proposed algorithm reveals commonalities across several existing post-processing techniques for neural-based word vectors and pointwise mutual information (PMI) matrix based word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-31",
"text": "Unlike the existing alternatives, the proposed approach is flexible enough to remove lexically-unrelated noise, while general-purpose enough to handle word vectors induced by different learning algorithms."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-33",
"text": "We first briefly review the principal component nulling approach for unsupervised word vector post-processing introduced in (Mu and Viswanath 2018) , upon which our work is based."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-34",
"text": "We then introduce our proposed approach, Conceptor Negation (CN)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-35",
"text": "Analytically, we reveal the links and differences between the CN approach and the existing alternatives."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-36",
"text": "Finally, we showcase the effectiveness of the CN method with numerical experiments 1 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-38",
"text": "**NOTATION**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-39",
"text": "We assume a collection of words w \u2208 V , where V is a vocabulary set."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-40",
"text": "Each word w \u2208 V is embedded as a n dimensional real valued vector v w \u2208 R n ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-41",
"text": "An identity matrix will be denoted by I. For a vector v, we denote diag(v) as the diagonal matrix with v on its diagonal."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-42",
"text": "We write [n] = {1, 2, \u00b7 \u00b7 \u00b7 , n} for a positive integer n."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-44",
"text": "**POST-PROCESSING WORD VECTORS BY PC REMOVAL**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-45",
"text": "This section is an overview of the all-but-the-top (ABTT) word vector post-processing approach introduced by Mu and Viswanath (2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-46",
"text": "In brief, the ABTT approach is based on two key observations of distributional word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-47",
"text": "First, using a PCA, Mu and Viswanath (2018) revealed that word vectors are strongly influenced by a few leading principal components (PCs)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-48",
"text": "Second, they provided an interpretation of such leading PCs: they empirically demonstrated a correlation between the variances explained by the leading PCs and word frequencies."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-49",
"text": "Since word frequencies are arguably unrelated to lexical semantics, they recommend eliminating top PCs from word vectors via a PCA reconstruction."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-50",
"text": "This method is described in Algorithm 1."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-51",
"text": "Algorithm 1: The all-but-the-top (ABTT) algorithm for word vector post-processing."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-52",
"text": "Input : (i) {v w \u2208 R n : w \u2208 V }: word vectors with a vocabulary V ; (ii) d: the number of PCs to be removed."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-53",
"text": "1 Center the word vectors: Letv w := v w \u2212 \u00b5 for all w \u2208 V , where \u00b5 is the mean of the input word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-54",
"text": "In practice, Mu and Viswanath (2018) found that the improvements yielded by ABTT are particularly impressive for word similarity tasks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-55",
"text": "Here, we provide a straightforward interpretation of the effects."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-56",
"text": "Concretely, consider two arbitrary words w 1 and w 2 with word vectors v w1 and v w2 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-57",
"text": "Without loss of generality, we assume v w1 and v w2 are normalized, i.e., v w1 2 = v w2 2 = 1."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-58",
"text": "Given PCs {u 1 , \u00b7 \u00b7 \u00b7 , u n } of the word vectors {v w : w \u2208 V }, we re-write v w1 and v w2 via linear combinations with respect to the basis {u 1 , \u00b7 \u00b7 \u00b7 , u n }: v w1 := n i=1 \u03b2 i u i and v w2 := n i=1 \u03b2 i u i , for some \u03b2 i , \u03b2 i \u2208 R and for all i \u2208 [n]."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-59",
"text": "We see"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-60",
"text": "where ( * ) holds because the word vectors were assumed to be normalized and ( * * ) holds because {u 1 , \u00b7 \u00b7 \u00b7 , u n } is an orthonormal basis of R n ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-61",
"text": "Via Equation 2, the similarity between word w 1 and w 2 can be seen as the overall \"compatibility\" of their measurements \u03b2 i and \u03b2 i with respect to each latent feature u i ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-62",
"text": "If leading PCs encode the word frequencies, removing the leading PCs, in theory, help the word vectors capture semantic similarities, and consequently improve the experiment results of word similarity tasks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-64",
"text": "**POST-PROCESSING WORD VECTORS VIA CONCEPTOR NEGATION**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-65",
"text": "Removing the leading PCs of word vectors using the ABTT algorithm described above is effective in practice, as seen in the elaborate experiments conducted by Mu and Viswanath (2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-66",
"text": "However, the method comes with a potential limitation: for each latent feature taking form as a PC of the word vectors, ABTT either completely removes the feature or keeps it intact."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-67",
"text": "For this reason, Khodak et al. (2018) argued that ABTT is liable either to not remove enough noise or to cause too much information loss."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-68",
"text": "The objective of this paper is to address the limitations of ABTT."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-69",
"text": "More concretely, we propose to use matrix conceptors (Jaeger 2017) to gate away variances explained by the leading PCs of word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-70",
"text": "As will be seen later, the proposed Conceptor Negation method removes noise in a \"softer\" manner when compared to ABTT."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-71",
"text": "We show that it shares the spirit of an eigenvalue weighting approach for PMI-based word vector post-processing."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-72",
"text": "We proceed by providing the technical background of conceptors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-74",
"text": "**CONCEPTORS**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-75",
"text": "Conceptors are a family of regularized identity maps introduced by Jaeger (2014) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-76",
"text": "We present a sketch of conceptors by heavily re-using (Jaeger 2014; He and Jaeger 2018) sometimes verbatim."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-77",
"text": "In brief, a matrix conceptor C for some vector-valued random variable x taking values in R N is defined as a linear transformation that minimizes the following loss function."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-78",
"text": "where \u03b1 is a control parameter called aperture, \u00b7 2 is the 2 norm, and \u00b7 F is the Frobenius norm."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-79",
"text": "This optimization problem has a closed-form solution"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-80",
"text": "where R = E[xx ] and I are N \u00d7 N matrices."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-81",
"text": "If R = \u03a8T \u03a8 is the SVD of R, then the SVD of C is given as \u03a8S\u03a8 , where the singular values s i of C can be written in terms of the singular values t i of R:"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-82",
"text": "In intuitive terms, C is a soft projection matrix on the linear subspace where the samples of x lie, such that for a vector y in this subspace, C acts like the identity: Cy \u2248 y, and when some orthogonal to the subspace is added to y, C reconstructs y: C(y + ) \u2248 y."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-83",
"text": "Moreover, operations that satisfy most laws of Boolean logic such as NOT \u00ac, OR \u2228, and AND \u2227, can be defined on matrix conceptors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-84",
"text": "These operations all have interpretation on the data level, i.e., on the distribution of the random variable x (details in (Jaeger 2014, Section 3.9) )."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-85",
"text": "Among these operations, the negation operation NOT \u00ac is relevant for the current paper:"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-86",
"text": "Intuitively, the negated conceptor, \u00acC, softly projects the data onto a linear subspace that can be roughly understood as the orthogonal complement of the subspace characterized by C."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-88",
"text": "**POST-PROCESSING WORD VECTORS WITH CONCEPTOR NEGATION**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-89",
"text": "This subsection explains how conceptors can be used to post-process word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-90",
"text": "The intuition behind our approach is simple."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-91",
"text": "Consider a random variable x taking values on word vectors {v w \u2208 R n : w \u2208 V }."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-92",
"text": "We can estimate a conceptor C that describes the distribution of x using Equation 4."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-93",
"text": "Recall that (Mu and Viswanath 2018) found that the directions with which x has the highest variances encode word frequencies, which are unrelated to word semantics."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-94",
"text": "To suppress such word-frequency related features, we can simply pass all word vectors through the negated conceptor \u00acC, so that \u00acC dampens the directions with which x has the highest variances."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-95",
"text": "This simple method is summarized in Algorithm 2."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-96",
"text": "Algorithm 2: The conceptor negation (CN) algorithm for word vector post-processing."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-97",
"text": "Input : (i) {v w \u2208 R n : w \u2208 V }: word vectors of a vocabulary V ; (ii) \u03b1 \u2208 R: a hyper-parameter 1 Compute the conceptor C from word vectors:"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-98",
"text": "The hyper-parameter \u03b1 of Algorithm 2 governs the \"sharpness\" of the suppressing effects on word vectors employed by \u00acC. Although in this work we are mostly interested in \u03b1 \u2208 (0, \u221e), it is nonetheless illustrative to consider the extreme cases where \u03b1 = 0 or \u221e: for \u03b1 = 0, \u00acC will be an identity matrix, meaning that word vectors will be kept intact; for \u03b1 = \u221e, \u00acC will be a zero matrix, meaning that all word vectors will be nulled to zero vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-99",
"text": "The computational costs of the Algorithm 2 are dominated by its step 1: one needs to calculate the matrix product"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-100",
"text": "n\u00d7|V | being the matrix whose columns are word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-101",
"text": "Since modern word vectors usually come with a vocabulary of some millions of words (e.g., Google News Word2Vec contains 3 million tokens), performing a matrix product on such large matrices [v w ] w\u2208V is computationally laborious. But considering that there are many uninteresting words in the vast vocabulary, we find it is empirically beneficial to only use a subset of the vocabulary, whose words are not too peculiar 2 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-102",
"text": "Specifically, borrowing the word list provided by Arora, Liang, and Ma (2017) 3 , we use the words that appear at least 200 times in a Wikipedia dump 2015 to estimate R. This greatly boosts the computation speed."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-148",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-103",
"text": "Somewhat surprisingly, the trick also improves the performance of Algorithm 2."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-104",
"text": "This might due to the higher quality of word vectors of common words compared with infrequent ones."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-106",
"text": "**ANALYTIC COMPARISON WITH OTHER METHODS**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-107",
"text": "Since most of the existing unsupervised word vector postprocessing methods are ultimately based on linear data transformations, we hypothesize that there should be commonalities between the methods."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-108",
"text": "In this section, we show CN resembles ABTT in that both methods can be interpreted as \"spectral encode-decode processes\"; when applied to word vectors induced by a pointwise mutual information (PMI) matrix, CN shares the spirit with the eigenvalue weighting (EW) post-processing (Caron 2001; Levy, Goldberg, and Dagan 2015) : they both assign weights on singular vectors of a PMI matrix."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-109",
"text": "A key distinction of CN is that it does soft noise removal (unlike ABTT) and that it is not restricted to post-processing PMI-matrix induced word vectors (unlike EW)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-111",
"text": "**RELATION TO ABTT**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-112",
"text": "In this subsection, we reveal the connection between CN and ABTT."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-113",
"text": "To do this, we will re-write the last step of both algorithms into different formats."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-114",
"text": "For the convenience of comparison, throughout this section, we will assume that the word vectors {v w } v\u2208V in Algorithm 1 and Algorithm 2 possess a zero mean, although this is not a necessary requirement in general."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-115",
"text": "We first re-write the equation in step 3 of Algorithm 1."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-116",
"text": "We let U be the matrix whose columns are the PCs estimated from the word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-117",
"text": "Let U :,1:d be the first d columns of U ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-118",
"text": "It is clear that step 2 of Algorithm 1, under the assumption that word vectors possess zero mean, can be re-written as"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-119",
"text": "Next, we re-write step 3 of the Conceptor Negation (CN) method of algorithm 2. Note that for word vectors with zero mean, the estimation for R is a (sample) covariance matrix of a random variable taking values as word vectors, and therefore the singular vectors of R are PCs of word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-120",
"text": "Letting R = U \u03a3U be the SVD of R, the equation in step 3 of Algorithm 2 can be re-written via elementary linear algebraic operations:"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-121",
"text": "where \u03c3 1 , \u00b7 \u00b7 \u00b7 , \u03c3 n are diagonal entries of \u03a3."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-122",
"text": "Examining Equations 6 and 7, we see ABTT and CN share some similarities."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-123",
"text": "In particular, they both can be unified into \"spectral encode-decode processes,\" which contain the following three steps: \u03c3n+\u03b1 \u22122 ]) respectively for ABTT and CN."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-124",
"text": "3."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-125",
"text": "PC decoding."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-126",
"text": "Transform the data back to the usual coordinates using the matrix U ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-127",
"text": "With the above encode-decode interpretation, we see CN differ from ABTT is its variance gating step."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-128",
"text": "In particular, ABTT does a hard gating, in the sense that the diagonal entries of the variance gating matrix (call them variance gating coefficients) take values in the set {0, 1}. The CN approach, on the other hand, does a softer gating as the entries take values in (0, 1):"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-129",
"text": "for all 1 \u2264 i < j \u2264 n and \u03b1 \u2208 (0, \u221e)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-130",
"text": "To illustrate the gating effects, we plot the variance gating coefficients for ABTT and CN for Word2Vec in Figure 1 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-132",
"text": "**RELATION WITH EIGENVALUE WEIGHTING**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-133",
"text": "We relate the conceptor approach to the eigenvalue weighting approach for post-processing PMI-based word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-134",
"text": "This effort is in line with the ongoing research in the NLP community that envisages a connection between \"neural word embedding\" and PMI-matrix factorization based word embedding (Levy and Goldberg 2014; Pennington, Socher, and Manning 2014; Levy, Goldberg, and Dagan 2015) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-135",
"text": "In the PMI approach for word association modeling, for each word w and each context (i.e., sequences of words) q, the PMI matrix M assigns a value for the pair (w, q): M (w, q) = log P(w,q) P(w)P(q) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-136",
"text": "In practical NLP tasks, the sets of words and contexts tend to be large, and therefore, directly working with M is inconvenient."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-137",
"text": "To lift the problem, one way is to perform a truncated SVD on M , factorizing M into the product of three smaller matrices M \u2248 \u0398 :,1:n D 1:n,1:n \u0393 :,1:n , where \u0398 :,1:n is the first n left singular vectors of the matrix M , D 1:n,1:n is the diagonal matrix containing n leading singular values of M , and \u0393 :,1:n are the first n right singular vectors of the matrix M ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-138",
"text": "A generic way to induce word vectors from M is to let E := \u0398 :,1:n D 1:n,1:n \u2208 R |V |\u00d7n , which is a matrix containing word vectors as rows."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-139",
"text": "Coined by Levy, Goldberg, and Dagan (2015) , the term eigenvalue weighting 4 (EW) refers to a post-processing technique for PMI-matrix-induced word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-140",
"text": "This technique has its root in Latent Semantic Analysis (LSA): Caron (2001) first propose to define the post-processed version of E as E EW := \u0398 :,1:n D p 1:n,1:n , where p is the weighting exponent determining the relative weights assigned to each singular vector of \u0398 :,1:n ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-141",
"text": "While an optimal p depends on specific task demands, previous research suggests that p < 1 is generally preferred, i.e., the contributions of the initial singular vectors of M should be suppressed."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-142",
"text": "For instance, p = 0, 0.25, and 0.5 are recommended in (Caron 2001; Bullinaria and Levy 2012; Levy, Goldberg, and Dagan 2015) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-143",
"text": "Bullinaria and Levy (2012) argue that the initial singular vectors of M tend to be contaminated most by aspects other than lexical semantics."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-144",
"text": "We now show that applying CN on the PMI-matrix-based word embedding E := \u0398 :,1:n D 1:n,1:n has a tantamount effect with \"suppressing initial singular vectors\" of EW."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-145",
"text": "Acting the negated \u00acC on word vectors of E (i.e., rows of E), we get the post-processed word vectors as rows of the\u1ebc CN :"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-146",
"text": "for all 1 \u2264 i < j \u2264 n and \u03b1 \u2208 (0, \u221e), these weights suppress the contribution of the initial singular vectors, similar to what has been done in EW."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-149",
"text": "We evaluate the post-processed word vectors on a variety of lexical-level intrinsic tasks and a down-stream deep learning task."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-150",
"text": "We use the publicly available pre-trained Google News Word2Vec (Mikolov et al. 2013 ) 5 and Common Crawl GloVe 6 (Pennington, Socher, and Manning 2014) to perform lexical-level experiments."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-151",
"text": "For CN, we fix \u03b1 = 2 for Word2Vec and GloVe throughout the experiments 7 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-152",
"text": "For ABTT, we set d = 3 for Word2Vec and d = 2 for GloVe, as what has been suggested by Mu and Viswanath (2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-153",
"text": "Word similarity We test the performance of CN on seven benchmarks that have been widely used to measure word similarity: the RG65 (Rubenstein and Goodenough 1965) , the WordSim-353 (WS) (Finkelstein et al. 2002) , the rarewords (RW) (Luong, Socher, and Manning 2013) , the MEN dataset (Bruni, Tran, and Baroni 2014) , the MTurk (Radinsky et al. 2011), the SimLex-999 (SimLex) (Hill, Reichart, and Korhonen 2015) , and the SimVerb-3500 (Gerz et al. 2016) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-154",
"text": "To evaluate the word similarity, we calculate the cosine distance between vectors of two words using Equation 1."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-155",
"text": "We report the Spearman's rank correlation coefficient (Myers and Well 1995) of the estimated rankings against the rankings by humans in Table 1 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-156",
"text": "We see that the proposed CN method consistently outperforms the original word embedding (orig.) and the post-processed word embedding by ABTT for most of the benchmarks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-157",
"text": "Table 1 : Post-processing results (Spearman's rank correlation coefficient \u00d7 100) under seven word similarity benchmarks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-158",
"text": "The baseline results (orig. and ABTT) are collected from (Mu and Viswanath 2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-159",
"text": "The improvement of results by CN are particularly impressive for two \"modern\" word similarity benchmarks SimLex and SimVerb -these two benchmarks carefully distinguish genuine word similarity from conceptual association (Hill, Reichart, and Korhonen 2015) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-160",
"text": "For instance, coffee is associated with cup but by no means similar to cup, a confusion often made by earlier benchmarks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-161",
"text": "In particular, SimLex has been heavily used to evaluate word vectors yielded by supervised word vector fine-tuning algorithms, which perform gradient descent on word vectors with respect to linguistic constraints such as synonym and antonym relationships of words extracted from WordNet and/or PPDB."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-162",
"text": "When compared to a recent supervised approach of counter-fitting."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-163",
"text": "Our results on SimLex are comparable to those reported by Mrksic et al. (2016) , as shown in (Mrksic et al. 2016 , Table 2 ) and (Mrksic et al. 2017 , Table 3) )."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-164",
"text": "The linguistic constraints for Counter-Fitting are synonym (syn.) and/or antonym (ant.) relationships extracted from English PPDB."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-166",
"text": "**SEMANTIC TEXTUAL SIMILARITY**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-167",
"text": "In this subsection, we showcase the effectiveness of the proposed post-processing method using semantic textual similarity (STS) benchmarks, which are designed to test the semantic similarities of sentences."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-168",
"text": "We use 2012-2015 SemEval STS tasks (Agirre et al. 2012; 2014; and the 2012 SemEval Semantic Related task (SICK) (Marelli et al. 2014) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-169",
"text": "Concretely, for each pair of sentences, s 1 and s 2 , we computed v s1 and v s2 by averaging their constituent word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-170",
"text": "We then calculated the cosine distance between two sentence vectors v s1 and v s2 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-171",
"text": "This naive method has been shown to be a strong baseline for STS tasks (Wieting et al. 2016) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-172",
"text": "As in Agirre et al. (2012) , we used Pearson correlation of the estimated rankings of sentence similarity against the rankings by humans to assess model performance."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-173",
"text": "In Table 7 , we report the average result for the STS tasks each year (detailed results are in the supplemental material)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-174",
"text": "Again, our CN method consistently outperforms the alternatives."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-175",
"text": "Table 3 : Post-processing results (\u00d7100) on the semantic textual similarity tasks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-176",
"text": "The baseline results (orig. and ABTT) are collected from (Mu and Viswanath 2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-177",
"text": "Concept Categorization In the concept categorization task, we used k-means to cluster words into concept categories based on their vector representations (for example, \"bear\" and \"cat\" belong to the concept category of animals)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-178",
"text": "We use three standard datasets: (i) a rather small dataset ESSLLI 2008 (Baroni, Evert, and Lenci 2008) that contains 44 concepts in 9 categories; (ii) the Almuhareb-Poesio (AP) (Poesio and Almuhareb 2005) , which contains 402 concepts divided into 21 categories; and (iii) the BM dataset (Battig and Montague 1969) that 5321 concepts divided into 56 categories."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-179",
"text": "Note that the datasets of ESSLLI, AP, and BM are increasingly challenging for clustering algorithms, due to the increasing numbers of words and categories."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-180",
"text": "Following (Baroni, Dinu, and Kruszewski 2014; Schnabel et al. 2015; Mu and Viswanath 2018) , we used \"purity\" of clusters (Manning, Raghavan, and Sch\u00fctze 2008, Section 16.4) as the evaluation criterion."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-181",
"text": "That the results of k-means heavily depend on two hyper-parameters: (i) the number of clusters and (ii) the initial centroids of clusters."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-182",
"text": "We follow previous research (Baroni, Dinu, and Kruszewski 2014; Schnabel et al. 2015; Mu and Viswanath 2018) to set k as the ground-truth number of categories."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-183",
"text": "The settings of the initial centroids of clusters, however, are less well-documented in previous work -it is not clear how many initial centroids have been sampled, or if different centroids have been sampled at all."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-184",
"text": "To avoid the influences of initial centroids in k-means (which are particularly undesirable for this case because word vectors live in R 300 ), in this work, we simply fixed the initial centroids as the average of original, ABTT-processed, and CN-processed word vectors respectively from ground-truth categories."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-185",
"text": "This initialization is fair because all post-processing methods make use of the ground-truth information equally, similar to the usage of the ground-truth numbers of clusters."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-186",
"text": "We report the experiment results in Table 4 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-187",
"text": "The performance of the proposed methods and the baseline methods performed equally well for the smallest dataset Table 4 : Purity (\u00d7 100) of the clusters in concept categorization task with fixed centroids."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-188",
"text": "ESSLLI."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-189",
"text": "As the dataset got larger, the results differed and the proposed CN approach outperformed the baselines."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-190",
"text": "A Downstream NLP task: Neural Belief Tracker The experiments we have reported so far are all intrinsic lexical evaluation benchmarks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-191",
"text": "Only evaluating the post-processed word vectors using these benchmarks, however, invites an obvious critique: the success of intrinsic evaluation tasks may not transfer to downstream NLP tasks, as suggested by previous research (Schnabel et al. 2015) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-192",
"text": "Indeed, when supervised learning tasks are performed, the post-processing methods such as ABTT and CN can in principle be absorbed into a classifier such as a neural network."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-193",
"text": "Nevertheless, good initialization for classifiers is crucial."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-194",
"text": "We hypothesize that the post-processed word vectors serve as a good initialization for those downstream NLP tasks that semantic knowledge contained in word vectors is needed."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-195",
"text": "To validate this hypothesis, we conducted an experiment using Neural Belief Tracker (NBT), a deep neural network based dialogue state tracking (DST) model (Mrksic et al. 2017; Mrk\u0161i\u0107 and Vuli\u0107 2018) ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-196",
"text": "As a concrete example to illustrate the purpose of the task, consider a dialogue system designed to help users find restaurants."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-197",
"text": "When a user wants to find a Sushi restaurant, the system is expected to know that Japanese restaurants have a higher probability to be a good recommendation than Italian restaurants or Thai restaurants."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-198",
"text": "Word vectors are important for this task because NBT needs to absorb useful semantic knowledge from word vectors using a neural network."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-199",
"text": "In our experiment with NBT, we used the model specified in (Mrk\u0161i\u0107 and Vuli\u0107 2018) with default hyper-parameter settings 8 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-200",
"text": "We report the goal accuracy, a default DST performance measure, defined as the proportion of dialogue turns where all the user's search goal constraints match with the model predictions."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-201",
"text": "The test data was Wizard-of-Oz (WOZ) 2.0 (Wen et al. 2017) , where the goal constraints of users were divided into three domains: food, price range, and area."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-202",
"text": "The experiment results are reported in Table 5 ."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-203",
"text": "Further discussions Besides the NBT task, we have also tested ABTT and CN methods on other downstream NLP tasks such as text classification (not reported)."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-204",
"text": "We found that ABTT and CN yield equivalent results in such tasks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-205",
"text": "One explanation is that the ABTT and CN post-processed word vectors are different only up to a small perturbation."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-206",
"text": "With a sufficient amount of training data and an appropriate regularization method, a neural network should generalize over such a perturbation."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-207",
"text": "With a relatively small training data (e.g., the 600 dialogues for training NBT task), however, we found that word vectors as initializations matters, and in such cases, CN post-processed word vectors yield favorable results."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-208",
"text": "Another interesting finding is that having tested ABTT and CN on Fasttext (Bojanowski et al. 2017) , we found that neither post-processing method provides visible gain."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-209",
"text": "We hypothesize that this might be because Fasttext includes subword (character-level) information in its word representation during training, which suppresses the word frequency features contained in word vectors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-210",
"text": "It remains for future work to validate this hypothesis."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-211",
"text": "----------------------------------"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-212",
"text": "**CONCLUSION**"
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-213",
"text": "We propose a simple yet effective method for postprocessing word vectors via the negation operation of conceptors."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-214",
"text": "With a battery of intrinsic evaluation tasks and a down-stream deep-learning empowered dialogue state tracking task, the proposed method enhances linguistic regularities captured by word vectors and consistently improves performance over existing alternatives."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-215",
"text": "There are several possibilities for future work."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-216",
"text": "We envisage that the logical operations and abstract ordering admitted by conceptors can be used in other NLP tasks."
},
{
"sent_id": "b2392c74f17fb2c0b6a0f19d16bc99-C001-217",
"text": "As concrete examples, the AND \u2227 operation can be potentially applied to induce and fine-tune bi-lingual word vectors, by mapping word representations of individual languages into a shared linear space; the OR \u2228 together with NOT \u00ac operation can be used to study the vector representations of polysemous words, by joining and deleting sense-specific vector representations of words; the abstraction ordering \u2264 is a natural tool to study graded lexical entailment of words."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-18"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-20"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-21"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-24"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-45"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-47"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-54"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-65"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-93"
]
],
"cite_sentences": [
"b2392c74f17fb2c0b6a0f19d16bc99-C001-18",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-20",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-21",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-24",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-45",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-47",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-54",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-65",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-93"
]
},
"@EXT@": {
"gold_contexts": [
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-23"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-33"
]
],
"cite_sentences": [
"b2392c74f17fb2c0b6a0f19d16bc99-C001-23",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-33"
]
},
"@USE@": {
"gold_contexts": [
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-152"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-158"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-176"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-180"
],
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-182"
]
],
"cite_sentences": [
"b2392c74f17fb2c0b6a0f19d16bc99-C001-152",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-158",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-176",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-180",
"b2392c74f17fb2c0b6a0f19d16bc99-C001-182"
]
},
"@SIM@": {
"gold_contexts": [
[
"b2392c74f17fb2c0b6a0f19d16bc99-C001-152"
]
],
"cite_sentences": [
"b2392c74f17fb2c0b6a0f19d16bc99-C001-152"
]
}
}
},
"ABC_7c8f54479ce1f9d81b49839425f58e_4": {
"x": [
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-135",
"text": "**MULTINLI**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-20",
"text": "In this paper we focus on the sentence embedding approach."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-2",
"text": "Recurrent neural networks have proven to be very effective for natural language inference tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-3",
"text": "We build on top of one such model, namely BiLSTM with max pooling, and show that adding a hierarchy of BiLSTM and max pooling layers yields state of the art results for the SNLI sentence encoding-based models and the SciTail dataset, as well as provides strong results for the MultiNLI dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-4",
"text": "We also show that our sentence embeddings can be utilized in a wide variety of transfer learning tasks, outperforming InferSent on 7 out of 10 and SkipThought on 8 out of 9 SentEval sentence embedding evaluation tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-5",
"text": "Furthermore, our model beats the InferSent model in 8 out of 10 recently published SentEval probing tasks designed to evaluate sentence embeddings' ability to capture some of the important linguistic properties of sentences."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-8",
"text": "Neural networks have been shown to provide a powerful tool for building representations of natural language on multiple levels of abstraction."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-9",
"text": "Perhaps the most widely used representations in natural language processing are word embeddings (e.g. Bengio et al., 2003; Mikolov et al., 2013; Pennington et al., 2014) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-10",
"text": "Recently there has been a growing interest in models for sentence-level representations using neural networks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-11",
"text": "Sentence embeddings are distributed representations of natural language sentences with the intention to encode the meaning of the sentences in a neural network representation."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-12",
"text": "Sentence embeddings have been generated using unsupervised learning approaches (e.g. Hill et al., 2016) , and supervised learning (e.g. Bowman et al., 2016; Conneau et al., 2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-13",
"text": "Sentence-level representations have shown promise in multiple different NLP tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-14",
"text": "One prominent example is natural language inference."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-15",
"text": "Natural language inference (NLI) is the task of determining the inferential relationship between two or more sentences."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-16",
"text": "That is, given two sentences, the premise p and the hypothesis h, the task is to determine whether h is entailed by p, whether the sentences are in contradiction with each other or whether they are neutral."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-17",
"text": "There are two main approaches to NLI utilizing neural networks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-18",
"text": "Some approaches focus on building sentence embeddings for the premises and the hypothesis separately and then combine those using a classifier (e.g. Bowman et al., 2015 Bowman et al., , 2016 Conneau et al., 2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-19",
"text": "Other approaches do not treat the two sentences separately but utilize e.g. crosssentence attention (Tay et al., 2018; Chen et al., 2017a) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-21",
"text": "Motivated by the success of the architecture of InferSent (Conneau et al., 2017) , we build a hierarchical architecture utilizing bidirectional LSTM (BiLSTM) layers and max pooling."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-22",
"text": "All in all, our model in on par with the state of the art for Stanford Natural Language Inference corpus (SNLI) (Bowman et al., 2015) sentence encoding-based models and improves the previous state of the art for SciTail (Khot et al., 2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-23",
"text": "We also achieve strong results for the Multi-Genre Natural Language Inference corpus (MultiNLI) (Williams et al., 2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-24",
"text": "We also test our model on a number of transfer learning tasks using the SentEval testing library (Conneau et al., 2017) , and show that our model outperforms the InferSent model on 7 out of 10 and SkipThought on 8 out of 9 tasks, comparing to the scores reported by Conneau et al. (2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-25",
"text": "Moreover, our model outperforms the InferSent model in 8 out of 10 recently published SentEval probing tasks designed to evaluate sentence embeddings' ability to capture some of the important linguistic properties of sentences (Conneau et al., 2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-26",
"text": "This highlights the generalization capability of our proposed model, confirming that the proposed architecture is able to produce sentence embeddings with strong performance across a wide variety of different NLP tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-28",
"text": "**RELATED WORK**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-29",
"text": "Sentence embeddings have been utilized in a wide variety of approaches to natural language inference."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-30",
"text": "Bowman et al. (2015 Bowman et al. ( , 2016 explore RNN and LSTM architectures, Mou et al. (2016) convolutional neural networks and Vendrov et al. (2015) GRUs, to name a few."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-31",
"text": "The basic idea behind these approaches is to encode the premise and hypothesis sentences separately and then combine those using a neural network classifier."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-32",
"text": "Conneau et al. (2017) explore multiple different sentence embedding architectures ranging from LSTM, BiLSTM and intra-attention to convolution neural networks and the performance of these architectures on NLI tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-33",
"text": "They show that out of these models BiLSTM with max pooling achieves the strongest results in NLI."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-34",
"text": "They also show that their model trained on NLI data achieves strong performance on various transfer learning tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-35",
"text": "Although sentence embedding approaches have shown their effectiveness in NLI, there are multiple studies showing that treating the hypothesis and premise sentences together and focusing on the relationship between those sentences yields better results (e.g. Tay et al., 2018; Chen et al., 2017a) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-36",
"text": "However, as these methods are focused on the inference relations rather than the internal semantics of the sentences, they cannot as straightforwardly be used outside of the NLI context and do not offer similar insights about the sentence level semantics, as sentence embeddings do."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-37",
"text": "By choosing a sentence embedding-based architecture we can more easily use the models in a wide variety of NLP tasks requiring sentence-level semantic information."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-39",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-40",
"text": "Our proposed architecture follows a sentence embedding-based approach for NLI introduced by Bowman et al. (2015) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-41",
"text": "The model illustrated in Figure 1 contains sentence embeddings for the two input sentences, where the output of the sentence embeddings are combined using a heuristic introduced by (Mou et al., 2015) , putting together the concatenation (u, v), absolute element-wise difference |u \u2212 v|, and element-wise product u * v. The combined vector is then passed on to a 3layered multi-layer perceptron (MLP) with a 3way softmax classifier."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-42",
"text": "The first two layers of the MLP both utilize dropout and a ReLU activation function."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-43",
"text": "We use a variant of ReLU called Leaky ReLU (Maas et al., 2013) , defined by:"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-44",
"text": "where we set y = 0.01 as the negative slope for x < 0."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-45",
"text": "This prevents the gradient from dying when x < 0."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-46",
"text": "For the sentence representations we first embed the individual words with pre-trained word embedding."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-47",
"text": "The sequence of the embedded words is then passed on to the sentence encoder which utilizes BiLSTM with max pooling."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-48",
"text": "Given a sequence T of words (w 1 . . . , w T ), the output of the bi-directional LSTM is a set of vec-"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-49",
"text": "of a forward and backward LSTMs"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-50",
"text": "The max pooling layer produces a vector of the same dimensionality as h t , returning, for each dimension, its maximum value over the hidden units (h 1 , . . . , h T )."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-51",
"text": "Motivated by the strong results of the BiLSTM max pooling network by Conneau et al. (2017) , we experimented with combining BiLSTM max pooling networks as a hierarchical structure."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-52",
"text": "1 To improve the BiLSTM layers' ability to remember the input words, we let each layer of the stack re-read the input sentence."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-53",
"text": "In our baseline model we stack three BiLSTM max pooling networks as a hierarchical structure, where each BiLSTM reads the input sentence as the input."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-54",
"text": "At each BiLSTM layer except the first one, we initialize the initial hidden state and the cell state with the final state of the previous layer."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-55",
"text": "We take the max value over each dimension of the hidden units for each BiLSTM layer."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-56",
"text": "The final output of the sentence embedding is the concatenation of each of these max pooling layers."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-57",
"text": "Our sentence embedding architecture is visualized in Figure 2 ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-59",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-61",
"text": "**DATA**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-62",
"text": "We evaluated the sentence embedding architecture with three different natural language inference datasets, including the Stanford Natural Language Inference (SNLI) corpus, the Multi-Genre Natural Language Inference (MultiNLI) corpus and the SciTail dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-63",
"text": "In all our experiments with the three datasets we used only the training data provided in the respective corpus."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-64",
"text": "For the transfer learning tasks, described in Section 7, we used training data from both the SNLI and the MultiNLI datasets in order to compare to the results by Conneau et al. (2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-66",
"text": "**SNLI**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-67",
"text": "The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) is a dataset of 570k human-written sentence pairs manually labeled with entailment, contradiction, and neutral."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-68",
"text": "The dataset is divided into training (550,152 pairs), development (10,000 pairs) and test sets (10,000 pairs)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-69",
"text": "The source for the premise sentences in SNLI were image captions taken from the Flickr30k corpus (Young et al., 2014) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-71",
"text": "**MULTINLI**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-72",
"text": "The Multi-Genre Natural Language Inference (MultiNLI) corpus (Williams et al., 2018) is a broad-coverage corpus for natural language inference, consisting of 433k human-written sentence pairs labeled with entailment, contradiction and neutral."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-73",
"text": "Unlike the SNLI corpus, which draws the premise sentence from image captions, MultiNLI consists of sentence pairs from ten distinct genres of both written and spoken English."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-74",
"text": "The dataset is divided into training (392,702 pairs), development (20,000 pairs) and test sets (20,000 pairs)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-75",
"text": "All of the genres are included in the test and development sets, but only five are included in the training set."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-76",
"text": "The development and test datasets have been divided into matched and mismatched, where the former includes only sentences from the same genres as the training data, and the latter includes sentences from the remaining genres not present in the training data."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-77",
"text": "In addition to the training, development and test sets, MultiNLI provides a smaller annotation dataset, which contains approximately 1000 sentence pairs annotated with linguistic properties of the sentences and is split between the matched and mismatched datasets."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-78",
"text": "2 This annotation dataset provides a simple way to assess what kind of sentence pairs an NLI system is able to predict correctly and where it makes errors."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-80",
"text": "**SCITAIL**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-81",
"text": "The SciTail dataset (Khot et al., 2018) is an NLI dataset created from multiple-choice science exams consisting of 27k sentence pairs."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-82",
"text": "Each question and the correct answer choice have been converted into an assertive statement to form the hypothesis."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-83",
"text": "The dataset is divided into training (23,596 pairs), development (1,304 pairs) and test sets (2,126 pairs)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-84",
"text": "Unlike the SNLI and MultiNLI datasets, SciTail uses only two labels: entailment and neutral."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-86",
"text": "**TRAINING DETAILS**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-87",
"text": "The architecture was implemented using PyTorch."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-88",
"text": "We have published our code in GitHub: https: //github.com/Helsinki-NLP/HBMP."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-89",
"text": "For all of our models we used a gradient descent optimization algorithm based on the Adam update rule (Kingma and Ba, 2014), which is preimplemented in PyTorch."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-90",
"text": "We used a learning rate of 5e-4 for all our models."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-91",
"text": "The learning rate was decreased by the factor of 0.2 after each epoch if the model did not improve."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-92",
"text": "We used a batch size of 64."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-93",
"text": "The models were evaluated with the development data after each epoch and training was stopped if the development loss increased for more than 3 epochs."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-94",
"text": "The model with the highest development accuracy was selected for testing."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-95",
"text": "We use pre-trained GloVe word embeddings of size 300 dimensions (GloVe 840B 300D) (Pennington et al., 2014) , which were fine-tuned during training."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-96",
"text": "The sentence embeddings have hidden dimensionality of 600 for both direction (except for SentEval test, where we test models with 600D and 1200D per direction) and the 3-layer multi-layer perceptron (MLP) have the size of 600 dimensions."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-97",
"text": "We use a dropout of 0.1 in the BiL-STM layers and between the MLP layers (except just before the final layer)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-98",
"text": "All our models were trained using one NVIDIA Tesla P100 GPU."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-100",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-101",
"text": "The proposed architecture provides strong results on all of the three datasets."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-102",
"text": "We achieve the new state of the art for SciTail 86.0%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-103",
"text": "On SNLI our results are on par with the current state of the art 86.6%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-104",
"text": "On MultiNLI our model provides strong results on both the matched and mismatched test sets."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-106",
"text": "**SNLI**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-107",
"text": "For for the SNLI corpus our model provides the test accuracy of 86.6% after 4 epochs of training, which is on par with the previously published state of the art for sentence embedding approaches using BiLSTM and generalized pooling (Chen et al.) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-108",
"text": "However, our model requires less trainable parameters than Chen Table 2 ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-109",
"text": "Although we did not achieve state of the art results for the MultiNLI dataset, we believe that a systematic study of different hierarchical BiLSTM max pooling structures could reveal an architecture providing the needed improvement."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-111",
"text": "**SCITAIL**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-112",
"text": "On the SciTail dataset we compared our model against non-sentence embedding-based models, as no results have been previously published which are based only on sentence embeddings."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-113",
"text": "For this reason we also implemented the basic LSTM (using the last hidden state as the output) and BiL-STM with max pooling models and tested them on the SciTail dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-114",
"text": "For SciTail we obtain a test score of 86.0% after 4 epochs of training, which is +2.7% points abso- lute improvement on the previous state of the art by Tay et al. (2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-115",
"text": "The results achieved by our proposed model are significantly higher than the previously published results."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-116",
"text": "It has been argued that the lexical similarity of the sentences in SciTail sentence pairs make it a particularly difficult dataset (Khot et al., 2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-117",
"text": "We hypothesize that our model is indeed better at identifying entailment relations beyond focusing on the lexical similarity or difference of the sentences."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-119",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-120",
"text": "To understand better what kind of inferential relationships our model is able to identify, we conducted an error analysis for the three datasets."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-121",
"text": "We report the results below."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-122",
"text": "3 We also conducted a linguistic error analysis and compared our results to the results obtained with the InferSent BiLSTM max pooling model of Conneau et al. (2017) (our implementation)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-123",
"text": "4 3 For more detailed error statistics, see the appendix."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-124",
"text": "4 The scores for our implementation of InferSent are on par or slightly higher than the scores reported by Conneau et al. (2017) using their training setup."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-126",
"text": "**SNLI**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-127",
"text": "On the SNLI dataset our model makes the least errors on sentence pairs labeled with entailment, having an accuracy of: 90.5%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-128",
"text": "The model is also almost effective in predicting contradictions, with an accuracy of 87.7%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-129",
"text": "Our model is less effective in recognizing correctly sentence pairs labeled with neutral, having an accuracy of 81.5%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-130",
"text": "Table 4 summarizes the prediction accuracies per each gold label and compares our results to InferSent results."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-131",
"text": "We also tested our model against the recently published Breaking NLI test set for the SNLI (Glockner et al., 2018) , designed to test NLI models' ability to recognize inferences requiring lexical knowledge."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-132",
"text": "Our model outperforms InferSent on 7 out of 14 lexical categories and achieves the overall accuracy of 65.1% compared to InferSent's score of 65.6%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-133",
"text": "Detailed description of our results for the Breaking NLI dataset are included in the appendix."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-136",
"text": "As the labeled test data is not openly available for MultiNLI, we analyzed the error statistics for this dataset based on the development data."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-137",
"text": "For the matched dataset (MultiNLI-m) our model had a development accuracy of 73.2%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-138",
"text": "On this dataset our model makes fewer errors on sentence pairs labeled as entailment, having an accuracy of 75.4% compared to pairs with label contradiction, where the accuracy was 73.1% and neu- tral, where the accuracy was 70.8%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-139",
"text": "Table 5 lists the prediction accuracies per each gold label and compares the results to InferSent results."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-140",
"text": "For the mismatched dataset (MultiNLI-mm) our model had a development accuracy of 74.2%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-141",
"text": "On this dataset our model makes significantly fewer errors on sentence pairs labeled as entailment, having an accuracy of 80.3% compared to pairs with label contradiction, where the accuracy was 72.7% and neutral, where the accuracy was just 69.0%."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-142",
"text": "Table 6 summarizes the prediction accuracies per each gold label and compares the results to InferSent results."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-143",
"text": "We also conducted additional linguistic error analysis using the annotation test set provided for MultiNLI."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-144",
"text": "The results are reported in the appendix and they are mostly inconclusive."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-146",
"text": "**TRANSFER LEARNING**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-147",
"text": "To better understand how well our model generalizes to different tasks, we conducted additional transfer learning tests using the SentEval sentence embedding evaluation library 5 (Conneau et al., 2017) and compared our results to the results published for InferSent and SkipThought ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-148",
"text": "For the transfer learning tasks, we trained our model on NLI data consisting of the concatenation of the SNLI and MultiNLI training sets consisting of 942,854 sentence pairs in total."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-149",
"text": "This allows us to compare our results to the InferSent results which were obtained using a model trained on the same data (Conneau et al., 2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-150",
"text": "Conneau et al. (2017) have shown that including all the training data from SNLI and MultiNLI improves significantly the model performance on transfer learning tasks, compared to training the model only on SNLI data."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-151",
"text": "For training the model we used the same setup as described above in Section 4."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-152",
"text": "We used the SentEval sentence embedding evaluation library using the default settings 6 recommended on the SentEval website, with a logistic regression classifier, Adam optimizer with learning rate of 0.001, batch size of 64 and epoch size of 4."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-153",
"text": "Table 8 lists the transfer learning results for our models with 1200D and 2400D hidden dimensionality and compares our model to the InferSent and SkipThought scores reported by Conneau et al. (2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-154",
"text": "The downstream datasets included in the tests were MR movie reviews, CR product reviews, SUBJ subjectivity status, MPQA opinionpolarity, SST binary sentiment analysis, TREC question-type classification, MRPC paraphrase detection, SICK-Relatedness (SICK-R) semantic textual similarity, SICK-Entailment (SICK-E) natural language inference and STS14 semantic textual similarity."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-155",
"text": "Our 2400D model outperforms the InferSent model on 7 out of 10 tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-156",
"text": "The model achieves higher score on 8 out of 9 tasks reported for SkipThought, having equal score on the SUBJ dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-157",
"text": "No MRPC results have been reported for SkipThought."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-158",
"text": "To study in more detail the linguistic properties 5 The SentEval test suite for evaluating sentence embeddings is available online at https://github.com/ facebookresearch/SentEval."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-159",
"text": "6 Refer to the SentEval website for details about the different settings:"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-160",
"text": "https://github.com/ facebookresearch/SentEval."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-161",
"text": "of our proposed model, we ran the recently published SentEval probing tasks described in (Conneau et al., 2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-162",
"text": "Our 24000D HBMP model outperforms the InferSent BiLSTM max pooling model in 8 out of 10 probing tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-163",
"text": "These results further highlight our model's applicability for various NLP tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-164",
"text": "The results are listed in Table 9 ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-166",
"text": "**CONCLUSION**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-167",
"text": "In this paper we have introduced an architecture based on BiLSTM max pooling, that achieves on par results with the current state of the art for SNLI sentence encoding-based models and a new state of the art score for SciTail."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-168",
"text": "We furthermore tested our model using the SentEval sentence embedding evaluation library, showing that our model achieves great generalization capability, outperforming InferSent on 7 out of 10 downstream and 8 out of 10 probing tasks, and SkipThought on 8 out of 9 downstream tasks."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-169",
"text": "Our model also achieves a strong result on the Breaking NLI dataset outperforming Infersent on 11 out of 14 lexical categories."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-170",
"text": "The success of the proposed hierarchical architecture raises a number of additional interesting questions."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-171",
"text": "First, it would be important to understand what kind of semantic information the different layers are able to capture."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-172",
"text": "Second, a detailed and systematic comparison of different hierarchical architecture configurations, combining BiLSTM and max pooling in different ways, could lead to even stronger results, as indicated by the results we obtained on the SciTail dataset with the modified 4-layered model."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-173",
"text": "Also, as the sentence embedding approaches for NLI focus mostly on the sentence encoder, we think that more should be done to study the classifier part of the overall NLI architecture."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-174",
"text": "There is not enough research on classifiers for NLI and we hypothesize that further improvements can be achieved by a systematic study of different classifier architectures, starting from the way the two sentence embeddings are combined before passing on to the classifier."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-175",
"text": "This is also something we intend to undertake as a next step."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-176",
"text": "Finally, there are number of other NLP tasks, including neural machine translation (NMT), where HBMP models should be evaluated."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-177",
"text": "We plan to evaluate the performance of the HBMP encoder in other NLP tasks like on encoder-decoder NMT models in the future."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-179",
"text": "**A DETAILED ERROR STATISTICS FOR THE HBMP MODEL**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-180",
"text": "To see in more detail how our HBMP model is able to classify sentence pairs with different gold labels and what kind of errors it makes, we summarize error statistics as confusion matrices for the different datasets."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-181",
"text": "Figure 3 contains the confusion matrix for the SNLI dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-182",
"text": "Figure 4 contains the confusion matrix for the MultiNLI Matched dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-183",
"text": "Figure 5 The confusion matrices highlight the HBMP model's strong performance across all the labels."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-184",
"text": "However, the accuracy for neutral in the SNLI and MultiNLI datasets is clearly lower than for the other two labels."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-185",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-186",
"text": "**B LINGUISTIC ERROR ANALYSIS**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-187",
"text": "We also conducted a linguistic error analysis for the HBMP model using the MultiNLI annotation set."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-188",
"text": "We compare our model and the type of linguistic reasoning it is capable of to InferSent BiL-STM max pooling model (Conneau et al., 2017) , in order to see what benefits are achieved by adding a hierarchical BiLSTM max pooling structure on top of the basic BiLSTM max pooling architecture."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-189",
"text": "We provide a detailed comparison of the prediction accuracies of our HBMP model with InferSent with respect to the type of linguistic properties present in the sentence pairs."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-190",
"text": "Table 10 contains the comparison for MultiNLI-m dataset and Table 11 for MultiNLI-mm dataset."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-191",
"text": "The analysis show that our HBMP model outperforms InferSent in some of the categories, but the results are mostly at this point inconclusive."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-193",
"text": "**C ADDITIONAL TESTS WITH THE BREAKING NLI DATASET**"
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-194",
"text": "We conducted additional testing of the proposed sentence embedding architecture using the Breaking NLI test set recently published by Glockner et al. (2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-195",
"text": "The test set is designed to highlight the lack of lexical reasoning capability of NLI systems."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-196",
"text": "We trained our HBMP model and the InferSent model using the SNLI training data."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-197",
"text": "We compare our results with the results published by Glockner et al. (2018) and to results obtained with InferSent sentence encoder (our implementation)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-198",
"text": "Our HBMP model outperforms the InferSent model in 7 out of 14 categories, receiving an overall score of 65.1% (InferSent: 65.6%)."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-199",
"text": "Our model also compares well against the other models, outperforming Decomposable Attention model (51.90%) (Parikh et al., 2016) and Residual Encoders (62.20%) (Nie and Bansal, 2017b) in the overall score."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-200",
"text": "As these models are not based purely on sentence embeddings, the obtained result highlights that sentence embedding approaches can be competitive when handling inferences requiring lexical information."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-201",
"text": "Our model is still outperformed by and ESIM (Chen et al., 2017a) and KIM, an ESIM model incorporating external knowledge (Chen et al., 2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-202",
"text": "The results of the comparison are summarized in Glockner et al. (2018) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-203",
"text": "InferSent results obtained with our implementation using the architecture and training set-up described in (Conneau et al., 2017) ."
},
{
"sent_id": "7c8f54479ce1f9d81b49839425f58e-C001-204",
"text": "Scores highlighted with bold are top scores when comparing the InferSent and our HBMP model."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-12"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-18"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-32"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-150"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-12",
"7c8f54479ce1f9d81b49839425f58e-C001-18",
"7c8f54479ce1f9d81b49839425f58e-C001-32",
"7c8f54479ce1f9d81b49839425f58e-C001-150"
]
},
"@MOT@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-21"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-51"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-64"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-21",
"7c8f54479ce1f9d81b49839425f58e-C001-51",
"7c8f54479ce1f9d81b49839425f58e-C001-64"
]
},
"@USE@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-24"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-122"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-147"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-203"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-24",
"7c8f54479ce1f9d81b49839425f58e-C001-122",
"7c8f54479ce1f9d81b49839425f58e-C001-147",
"7c8f54479ce1f9d81b49839425f58e-C001-203"
]
},
"@DIF@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-24"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-24"
]
},
"@EXT@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-51"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-51"
]
},
"@SIM@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-124"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-149"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-124",
"7c8f54479ce1f9d81b49839425f58e-C001-149"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"7c8f54479ce1f9d81b49839425f58e-C001-153"
],
[
"7c8f54479ce1f9d81b49839425f58e-C001-188"
]
],
"cite_sentences": [
"7c8f54479ce1f9d81b49839425f58e-C001-153",
"7c8f54479ce1f9d81b49839425f58e-C001-188"
]
}
}
},
"ABC_57ef27eefdf272bead22212863a8a8_4": {
"x": [
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-2",
"text": "We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-3",
"text": "In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-4",
"text": "We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-7",
"text": "Despite over 25 years of research in computational linguistics aimed at acquiring multiword lexicons using corpora statistics, and growing evidence that speakers process language primarily in terms of memorized sequences (Wray, 2008) , the individual word nonetheless stubbornly remains the de facto standard processing unit for most research in modern NLP."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-8",
"text": "The potential of multiword knowledge to improve both the automatic processing of language as well as offer new understanding of human acquisition and usage of language is the primary motivator of this work."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-9",
"text": "Here, we present an effective, expandable, and tractable new approach to comprehensive multiword lexicon acquisition."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-10",
"text": "Our aim is to find a middle ground between standard MWE acquisition approaches based on association measures (Ramisch, 2014) and more sophisticated statistical models (Newman et al., 2012 ) that do not scale to large corpora, the main source of the distributional information in modern NLP systems."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-11",
"text": "A central challenge in building comprehensive multiword lexicons is paring down the huge space of possibilities without imposing restrictions which disregard a major portion of the multiword vocabulary of a language: allowing for diversity creates significant redundancy among statistically promising candidates."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-12",
"text": "The lattice model proposed here addresses this primarily by having the candidatescontiguous and non-contiguous n-gram typescompete with each other based on subsumption and overlap relations to be selected as the best (i.e., most parsimonious) explanation for statistical irregularities."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-13",
"text": "We test this approach across four large corpora in three languages, including two relatively freeword-order languages (Croatian and Japanese), and find that this approach consistency outperforms alternatives, offering scalability and many avenues for future enhancement."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-14",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-15",
"text": "**BACKGROUND AND RELATED WORK**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-16",
"text": "In this paper we will refer to the targets of our lexicon creation efforts as formulaic sequences, following the terminology of Wray (2002; , wherein a formulaic sequence (FS) is defined as \"a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar.\" That is, an FS shows signs of being part of a mental lexicon."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-17",
"text": "1 As noted by Wray (2008) , formulaic sequence theory is compatible with other highly multiword, lexicalized approaches to language structure, in particular Pattern Grammar (Hunston and Francis, ) and Construction Grammar (Goldberg, 1995) ; an important distinction, though, is that these sorts of theories often posit entirely abstract grammatical constructions/patterns/frames which do not fit well into the FS framework."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-18",
"text": "Nevertheless, since many such constructions are composed of sequences of specific words, the FS inventory of a language includes many flexible constructions (e.g., ask * for) along with entirely fixed combinations (e.g., rhetorical question) not typically of interest to grammarians."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-19",
"text": "Note that the FS framework allows for individual morphemes to be part of a formulaic sequence, but for practical reasons we focus primarily on lemmatized words as the unit out of which FS are built."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-20",
"text": "In computational linguistics, the most common term used to describe multiword lexical units is multiword expression (\"MWE\": Sag et al. (2002) , Baldwin and Kim (2010) ), but here we wish to make a principled distinction between at least somewhat non-compositional, strongly lexicalized MWEs and FS, a near superset which includes many MWEs but also compositional linguistic formulas."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-21",
"text": "This distinction is not a new one; it exists, for example, in the original paper of Sag et al. (2002) in the distinction between lexicalized and institutionalized phrases, and also to some extent in the MWE annotation of Schneider et al. (2014b) , who distinguish between weak (collocational) 2 and strong (non-compositional) MWEs."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-22",
"text": "It is our contention, however, that separate, precise terminology is useful for research targeted at either class: we need not strain the concept of MWE to include items which do not require special semantics, nor are we inclined to disregard the larger formulaticity of language simply because it is not the dominant focus of MWE 1 Though by this definition individuals or small groups may have their own FS, here we are only interested in FS that are shared by a recognizable language community."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-23",
"text": "2 Here we avoid the term collocation entirely due to confusion with respect to its interpretation."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-24",
"text": "Though some define it similarly to our definition of FS, it can be applied to any words that show a statistical tendency to appear in the vicinity of one another for any reason: for instance, the pair of words doctor/nurse might be considered a collocation (Ramisch, 2014) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-25",
"text": "research."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-26",
"text": "Many MWE researchers might defensibly balk at including in their MWE lexicons and corpus annotations (English) FS such as there is something going on, it is more important than ever to ..., ... do not know what it is like to ..., there is no shortage of ..., the rise and fall of ..., now is not the time to ..., etc. as well as tens of thousands of other such phrases which, along with less compositional MWEs like be worth ...'s weight in gold, fall under the FS umbrella."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-27",
"text": "Another reason to introduce a different terminology is that there are classes of phrases which are typically considered MWEs that do not fit well into an FS framework, for instance novel compound nouns whose semantics are accessible by analogy (e.g., glass limb, analogous to wooden leg)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-28",
"text": "We also exclude from the definition of both FS and MWE those named entities which refer to people or places which are little-known and/or whose surface form appears derived (e.g., Mrs. Barbara W. Smith or Smith Garden Supplies Ltd)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-29",
"text": "Figure 1 shows the conception of the relationship between FS, (multiword) constructions, MWE, and (multiword) named entities that we assume for this paper."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-30",
"text": "From a practical perspective, the starting point for multiword lexicon creation has typically been lexical association measures (Church and Hanks, 1990; Dunning, 1993; Schone and Jurafsky, 2001; Evert, 2004; Pecina, 2010; Araujo et al., 2011; Kulkarni and Finlayson, 2011; Ramisch, 2014) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-31",
"text": "When these methods are used to build a lexicon, particular binary syntactic patterns are typically chosen."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-32",
"text": "Only some of these measures generalize tractably beyond two words, for example PMI (Church and Hanks, 1990) , i.e., the log ratio of the joint probability to the product of the marginal probabilities of the individual words."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-33",
"text": "Another measure which addresses sequences of longer than two words is the C-value 456 (Frantzi et al., 2000) which weights term frequency by the log length of the n-gram while penalizing ngrams that appear in frequent larger ones."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-34",
"text": "Mutual expectation (Dias et al., 1999) involves deriving a normalized statistic that reflects the extent to which a phrase resists the omission of any constituent word."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-35",
"text": "Similarly, the lexical predictability ratio (LPR) of Brooke et al. (2015) is an association measure applicable to any possible syntactic pattern, which is calculated by discounting syntactic predictability from the overall conditional probability for each word given the other words in the phrase."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-36",
"text": "Though most association measures involve only usage statistics of the phrase and its subparts, the DRUID measure (Riedl and Biemann, 2015) is an exception which uses distributional semantics around the phrase to identify how easily an n-gram could be replaced by a single word."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-37",
"text": "Typically multiword lexicons are created by ranking n-grams according to an association measure and applying a threshold."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-38",
"text": "The algorithm of da Silva and Lopes (1999) is somewhat more sophisticated, in that it identifies the local maxima of association measures across subsuming n-grams within a sentence to identify MWEs of unrestricted length and syntactic composition; its effectiveness beyond noun phrases, however, seems relatively limited (Ramisch et al., 2012) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-39",
"text": "Brooke et al. (2014; developed a heuristic method intended for general FS extraction in larger corpora, first using conditional probabilities to do an initial (single pass) coarse-grained segmentation of the corpus, followed by a pass through the resulting vocabulary, breaking larger units into smaller ones based on a tradeoff between marginal and conditional statistics."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-40",
"text": "The work of Newman et al. (2012) is an example of an unsupervised approach which does not use association measures: it extends the Bayesian word segmentation approach of Goldwater et al. (2009) to multiword tokenization, applying a generative Dirichlet Process model which jointly constructs a segmentation of the corpus and a corresponding multiword vocabulary."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-41",
"text": "Other research in MWEs has tended to be rather focused on particular syntactic patterns such as verbnoun combinations (Fazly et al., 2009 )."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-42",
"text": "The system of Schneider et al. (2014a) distinguishes a full range of MWE sequences in the English Web Treebank, including gapped expressions, using a supervised sequence tagging model."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-43",
"text": "Though, in theory, automatic lexical resources could be a useful addition to the Schneider et al. model, which uses only manual lexical resources, attempts to do so have achieved mixed success (Riedl and Biemann, 2016) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-44",
"text": "The motivations for building lexicons of FS naturally overlap with those for MWE: models of distributional semantics, in particular, can benefit from sensitivity to multiword units (Cohen and Widdows, 2009) , as can parsing (Constant and Nivre, 2016) and topic models (Lau et al., 2013) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-45",
"text": "One major motivation for looking beyond MWEs is the ability to carry out broader linguistic analyses."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-46",
"text": "Within corpus linguistics, multiword sequences have been studied in the form of lexical bundles (Biber et al., 2004) , which are simply n-grams that occur above a certain frequency threshold."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-47",
"text": "Like FS, lexical bundles generally involve larger phrasal chunks that would be missed by traditional MWE extraction, and so research in this area has tended to focus on how particular formulaic phrases (e.g., if you look at) are indicative of particular genres (e.g., university lectures)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-48",
"text": "Lexical bundles have been applied, in particular, to learner language: for example, Chen and Baker (2010) show that non-native student writers use a severely restricted range of lexical bundle types, and tend to overuse those types, while Granger and Bestgen (2014) investigate the role of proficiency, demonstrating that intermediate learners underuse lower-frequency bigrams and overuse high-frequency bigrams relative to advanced learners."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-49",
"text": "Sakaguchi et al. (2016) demonstrate that improving fluency (closely linked to the use of linguistic formulas) is more important than improving strict grammaticality with respect to native speaker judgments of non-native productions; Brooke et al. (2015) explicitly argue for FS lexicons as a way to identify, track, and improve learner proficiency."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-51",
"text": "**METHOD**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-52",
"text": "Our approach to FS identification involves optimization of the total explanatory power of a lattice, where each node corresponds to an n-gram type."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-53",
"text": "The explanatory power of the whole lattice is defined simply as a product of the explainedness of the individual nodes."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-54",
"text": "Each node can be considered either \"on\" (is an FS) or \"off\" (is not an FS)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-55",
"text": "The basis of the calculation of explainedness is the syntax-sensitive LPR association measure of Brooke et al. (2015) , but it is calculated differently depending on the on/off status of the node as well as the status of the nodes in its vicinity."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-56",
"text": "Nodes are linked based on n-gram subsumption and corpus overlap relationships (see Figure 2), with \"on\" nodes typically explaining other nodes."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-57",
"text": "Given these relationships, we iterate over the nodes and greedily optimize the on/off choice relative to explainedness in the local neighborhood of each node, until convergence."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-59",
"text": "**COLLECTING STATISTICS**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-60",
"text": "The first step in the process is to derive a set of ngrams and related statistics from a large, unlabeled corpus of text."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-61",
"text": "Since our primary association measure is an adaption of LPR, our approach in this section mostly follows Brooke et al. (2015) up until the last stage."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-62",
"text": "An initial requirement of any such method is an n-gram frequency threshold, which we set to 1 instance per 10 million words, following Brooke et al. (2015) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-63",
"text": "3 We include gapped or non-contiguous n-grams in our analysis, in acknowledgment of the fact that many languages have MWEs where the components can be \"separated\", including verb particle constructions in English (Deh\u00e9, 2002) , and noun-verb idioms in Japanese ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-64",
"text": "Having said this, there are generally strong syntactic and length restrictions on what can constitute a gap (Wasow, 2002) , which we capture in the form of a language-specific POS-based regular expression (see Section 4 for details)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-65",
"text": "This greatly lowers the number of potentially gapped n-gram types, increasing precision and efficiency for negligible loss of recall."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-66",
"text": "We also exclude punctuation and lemmatize the corpus, and enforce an n-gram count threshold."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-67",
"text": "As long as the count threshold is substantially above 1, efficient extraction of all n-grams can be done iteratively: in iteration i, i-grams are filtered by the frequency threshold, and then pairs of instances of these i-grams with (i \u2212 1) words of overlap are found, which derives a set of (i + 1)-grams which necessarily includes all those over the frequency threshold."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-68",
"text": "Once a set of relevant n-grams is identified and counted, other statistics required to calculate the Lexical Predictability Ratio (\"LPR\") for each word in the n-gram are collected."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-69",
"text": "LPR is a measure of how predictable a word is in a lexical context, as compared to how predictable it is given only syntactic context (over the same span of words)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-70",
"text": "Formally, the LPR for word w i in the context of a word sequence w 1 , ..., w i , ..., w n with POS tag sequence t 1 , ..., t n is given by:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-71",
"text": "where w j,k denotes the word sequence w j , ..., w i\u22121 , w i+1 , ..., w k excluding w i (similarly for t j,k )."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-72",
"text": "Note that the lower bound of LPR is 1, since the ratio for a word with no context is trivially 1."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-73",
"text": "We use the same equation for gapped n-grams, with the caveat that quantities involving sequences which include the location where the gap occurs are derived from special gapped n-gram statistics."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-74",
"text": "Note that the identification of the best ratio across all possible choices of context, not just the largest, is important for longer FS, where the entire POS context alone might uniquely identify the phrase, resulting in the minimum LPR of 1 even for entirely formulaic sequences-an undesirable result."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-75",
"text": "In the segmentation approach of Brooke et al. (2015) , LPR for an entire span is calculated as a product of the individual LPRs, but here we will use the minimum LPR across the words in the sequence:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-76",
"text": "Here, minLPR for a particular n-gram does not reflect the overall degree to which it holds together, but rather focuses on the word which is its weakest link."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-77",
"text": "For example, in the case of be keep * under wraps (Figure 2 ), a general statistical metric might assign it a high score due to the strong association between keep and under or under and wraps, but minLPR is focused on the weaker relationship between be and the rest of the phrase."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-78",
"text": "This makes it particularly suited to use in a lattice model of competing n-grams, where the choice of be keep * under wraps versus keep * under wraps should be based exactly on the extent to which be is an essential part of the phrase; the other affinities are, in effect, irrelevant, because they occur in the smaller n-gram as well."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-80",
"text": "**NODE INTERACTIONS**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-81",
"text": "The n-gram nodes in the lattice are directionally connected to nodes consisting of (n + 1)-grams which subsume them and (n \u2212 1)-grams which they subsume."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-82",
"text": "For example, as detailed in Figure 2 , the (gapped) n-gram keep * under wraps would be connected \"upwards\" to the node keep everything under wraps and connected \"downwards\" to under wraps."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-83",
"text": "These directional relationships allow for two basic interactions between nodes in the lattice when a node is turned on: covering, which inhibits nodes below (subsumed by) a turned-on node (e.g., if keep * under wraps is on, the model will tend not to choose under wraps as an FS); and clearing, which inhibits nodes above a turned-on node (e.g., if keep * under wraps is on, the model would avoid selecting keep everything under wraps as an FS)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-84",
"text": "A third, undirected mechanism is overlapping, where nodes inhibit each other due to overlaps in the corpus (e.g., having both keep * under wraps and be keep * under as FS will be avoided)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-86",
"text": "**COVERING**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-87",
"text": "The most important node interaction is covering, which corresponds to discounting or entirely excluding a node due to a node higher in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-88",
"text": "Our model includes two types of covering: hard and soft."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-89",
"text": "Hard covering is based on the idea that, due to very similar counts, we can reasonably conclude that the presence of an n-gram in our statistics is a direct result of a subsuming (n+i)-gram."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-90",
"text": "In Figure 2 , e.g., if we have 143 counts of keep * under wraps and 152 counts of under wraps, the presence of keep * under wraps almost completely explains under wraps, and we should consider these two n-grams as one."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-91",
"text": "We do this by permanently disabling any hard covered node, and setting the minLPR of the covering node to the maximum minLPR among all the nodes it covers (including itself); this means that longer ngrams with function words (which often have lower minLPR) can benefit from the strong statistical relationships between open-class lexical features in ngrams that they cover."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-92",
"text": "This is done as a preprocessing step, and greatly improves the tractability of the iterative optimization of the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-93",
"text": "Of course, a threshold for hard covering must be chosen: during development we found that a ratio of 2/3 (corresponding to a significant majority of the counts of a lower node corresponding to the higher node) worked well."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-94",
"text": "We also use the concept of hard covering to address the issue of pronouns, based on the observation that specific pronouns often have high LPR values due to pragmatic biases (Brooke et al., 2015) ; for instance, private state verbs like feel tend to have first person singular subjects."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-95",
"text": "In the lattice, n-grams with pronouns are considered covered (inactive) unless they cover at least one other node which does not have a pronoun, which allows us to limit FS with pronouns without excluding them entirely: they are included only in cases where they are definitively formulaic."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-96",
"text": "Soft covering is used in cases when a single ngram does not entirely account for another, but a turned-on n-gram to some extent may explain some of the statistical irregularity of one lower in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-261",
"text": "Our initial investigations suggest, however, it may be difficult to apply this idea without merely amplifying existing undesirable biases in the LPR measure."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-262",
"text": "Bringing in other information such as simple distributional statistics might help the model identify non-compositional semantics, and could, in combination with the existing lattice competition, focus the model on MWEs which could provide a reliable basis for generalization."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-263",
"text": "For all four corpora, the lattice optimization algorithm converged within 10 iterations."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-264",
"text": "Although the optimization of the lattice is several orders of magnitude more complex than the decomposition heuristics of Brooke et al. (2015) , the time needed to build and optimize the lattice is a fraction of the time required to collect the statistics for LPR calculation, and so the end-to-end runtimes of the two methods are comparable."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-265",
"text": "In the BNC, the full lattice method was much faster than LocalMaxs and DP-Seg, though direct runtime comparisons to these methods are of modest value due to differences in both scope and implementation."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-266",
"text": "Finally, though the model was designed specifically for FS extraction, we note that it could be useful for related tasks such as unsupervised learning of morphological lexicons, particularly for agglutinative languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-267",
"text": "Character or phoneme n-grams could compete in an identically structured lattice to be chosen as the best morphemes for the language, with LPR adapted to use phonological predictability (i.e., based on vowel/consonant \"tags\") instead of syntactic predictability."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-268",
"text": "It is likely, though, that further algorithmic modifications would be necessary to target morphological phenomena well, which we leave for future work."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-269",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-270",
"text": "**CONCLUSION**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-271",
"text": "We have presented here a new methodology for acquiring comprehensive multiword lexicons from large corpora, using competition in an n-gram lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-272",
"text": "Our evaluation using annotations of sampled n-grams shows that it consistently outperforms alternatives across several corpora and languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-273",
"text": "A tool which implements the method, as well as the acquired lexicons, annotation guidelines, and test sets have been made available."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-274",
"text": "6"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-97",
"text": "For instance, in Figure 2 keep * under is not hard-covered by keep * under wraps (since there are FS such as keep * under surveillance and keep it under your hat), but if keep * under wraps is tagged as an FS, we nevertheless want to discount the portion of the keep * under counts that correspond to occurrences of keep * under wraps, with the idea that these occurrences have already been explained by the longer n-gram."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-98",
"text": "If enough subsuming n-grams are on, then the shorter n-gram will be discounted to the extent that it will be turned off, preventing redundancy."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-99",
"text": "This effect is accomplished by increasing the turned-off explainedness of keep * under (and thus making turning on less desirable) in the follow-459 ing manner: let c(\u00b7) be the count function, y i the current FS status for node x i (0 if off, 1 if on) and ab(x) a function which produces the set of indicies of all nodes above node x in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-100",
"text": "Then, the cover(x t ) score for a covered node t is:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-101",
"text": "When applied as an exponent to a minLPR score, it serves as simple, quick-to-calculate approximation of a new minLPR with the counts corresponding to the covering nodes removed from the calculation."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-102",
"text": "The cover score takes on values in the range 0 to 1, with 1 being the default when no covering occurs."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-104",
"text": "**CLEARING**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-105",
"text": "In general, covering prefers turning on longer, covering n-grams since doing so explains nodes lower in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-106",
"text": "Not surprisingly, it is generally desirable to have a mechanism working in opposition, i.e., one which views shorter FS as helping to explain the presence of longer n-grams which contain them, beyond the FS-neutral syntactic explanation provided by minLPR."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-107",
"text": "Clearing does this by increasing the explainedness of nodes higher in the lattice when a lower node is turned-on."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-108",
"text": "The basic mechanism is similar to covering, except that counts cannot be made use of in the same way-whereas it makes sense to explain covered nodes in proportion to the counts of their covering nodes (since the counts of the covered n-grams can be directly attributed to the covering n-gram), in the reverse direction this logic fails."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-109",
"text": "A simple but effective solution which avoids extra hyperparameters is to make use of the minLPR values of the relevant nodes."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-110",
"text": "In the most common two-node situation, we increase the explainedness of the cleared node based on the ratio of the minLPR of two nodes, though only if the minLPR of the lower node is higher."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-111",
"text": "Generalized to the (rare) case of multiple clearing nodes, we define clear(x t ) as:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-112",
"text": "where bl(x t ) produces a set of indicies of nodes below x t in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-113",
"text": "We refer to this mechanism as \"clearing\" because it tends to clear away a variety of trivial uses of common FS which may have higher LPR due to the lexical and syntactic specificity of the FS."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-114",
"text": "For instance, in Figure 2 if the node keep * under wraps is turned on and has a minLPR of 8, then, if the minLPR of a node such as keep * under wraps for is 4, clear(x t ) will be 0.5."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-115",
"text": "Like cover, clear takes on values in the range 0 to 1, with 1 being the default when no clearing occurs."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-116",
"text": "Note that one major advantage with this particular formulation of clearing is that low-LPR nodes will be unable to clear higher LPR nodes above them in the lattice; otherwise, bad FS like of the might be selected as FS based purely to increase the explainedness of the many n-grams they appear in."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-118",
"text": "**OVERLAP**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-119",
"text": "The third mechanism of node interaction involves n-grams which overlap in the corpus."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-120",
"text": "In general, independent FS do not consistently overlap."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-121",
"text": "For example, given that be keep * under and keep * under wraps often appear together (overlapping on the tokens keep * under), we do not want both being selected as an FS, even in the case that both have high minLPR."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-122",
"text": "To address this problem, rather than increasing the explainedness of turned-off nodes, we decrease the explainedness of the overlapping turned-on nodes-a penalty rather than an incentive which expresses the model's confusion at having overlapping FS."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-123",
"text": "For non-subsuming nodes x i and x j , let oc(x i , x j ) be the count of instances of x i which contain at least one non-gap token of a corresponding instance of x j ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-124",
"text": "For subsuming nodes, though, overlap is treated asymmetrically, with oc(x i , x j ) equal to c(x j ) (the lower count) if j \u2208 ab(x i ), but zero if j \u2208 bl(x i )."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-125",
"text": "Given this definition of oc, we define overlap(x t ) as:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-126",
"text": "Overlap takes on values in the range 1 to +\u221e, also defaulting to 1 when no overlaps exist."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-127",
"text": "The effect of overlap is hyperbolic: small amounts of overlap have little effect, but nodes with significant overlap will effectively be forced to turn off."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-129",
"text": "**EXPLAINEDNESS**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-130",
"text": "The objective function maximized by the model is then the explainedness (expl) across all the nodes 460 of the lattice X, x i , . . . , x N , which can be defined in terms of minLPR, the node interaction functions, and the FS status y i of each node in the lattice:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-131",
"text": "When a node is off, its explainedness is the inverse of its minLPR, except if there are covering or clearing nodes which explain it by pushing the exponent of minLPR towards zero."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-132",
"text": "When the node is on, its explainedness is the inverse of a fixed cost hyperparameter C, though this cost is increased if it overlaps with other active nodes."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-133",
"text": "All else being equal, when minLPR(t) > C, a node will be selected as an FS, and so, independent of the node interactions, C can be viewed as the threshold for the minLPR association measure under a traditional approach to MWE identification."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-135",
"text": "**OPTIMIZATION**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-136",
"text": "The dependence of the explainedness of nodes on their neighbors effectively prohibits a global optimization of the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-137",
"text": "Fortunately, though most of the nodes in the lattice are part of a single connected graph, most of the effects of nodes on each other are relatively local, and effective local optimizations can be made tractable by applying some simple restrictions."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-138",
"text": "The main optimization loop consists of iterations over the lattice until complete convergence (no changes in the final iteration)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-139",
"text": "For each iteration over the main loop, each potentially active node is examined in order to evaluate whether its current status is optimal given the current state of the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-140",
"text": "The order that we perform this has an effect on the result: among the obvious options (LPR, ngram length), in development good results were obtained through ordering nodes by frequency, which gives an implicit advantage to relatively common ngrams."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-141",
"text": "Given the relationships between nodes, it is obviously not sufficient to consider switching only the present node."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-142",
"text": "If, for instance, one or more of be keep * under wraps, under wraps, or be keep * under has been turned on, the covering, clearing, or overlapping effects of these other nodes will likely prevent Algorithm 1 Optimization algorithm."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-143",
"text": "X is an ordered list of the nodes in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-144",
"text": "Nodes (designated by x) contain pointers to the nodes immediately linked to them in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-145",
"text": "States (designated by Y ) indicate whether each node is ON or OFF."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-146",
"text": "Explainedness values are indicated by e. rev = relevant, aff = affected, curr = current function LOCALOPT(Y start , x, X rev , X aff )"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-147",
"text": "a competing node like keep * under wraps from being correctly activated."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-148",
"text": "Instead, the algorithm identifies a small set of \"relevant\" nodes which are the most important to the status of the node under consideration."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-149",
"text": "Since turned-off nodes have no direct effect on each other, only turned-on nodes above, below, or overlapping with the current node in the lattice need be considered."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-150",
"text": "Once the relevant nodes have been identified, all nodes (including turned-off nodes) whose explainedness is affected by one or more of the relevant nodes are identified."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-151",
"text": "Next, a search is carried out for the optimal configuration of the relevant nodes, starting from an 'all-on' state and iteratively considering new states with one relevant node turned off; the search continues as long as there is an improvement in explainedness."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-152",
"text": "Since the node interactions are roughly cumulative in their effects, this approach will generally identify the optimal state without the need for an exhaustive search."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-153",
"text": "See Algorithm 1 for details."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-154",
"text": "Omitted from Algorithm 1 for clarity are various low-level efficiencies which prevent the algorithm from reconsidering states already checked or from recalculating the explainedness of nodes when unnecessary."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-155",
"text": "We also apply the following efficiency restrictions, which significantly reduce the runtime of the algorithm."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-156",
"text": "In each case, more extreme (less efficient) values were individually tested using a development set and found to provide no benefit in terms of the quality of the output lexicon:"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-157",
"text": "\u2022 We limit the total number of relevant nodes to 5."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-158",
"text": "When there are more than 5 nodes turned on in the vicinity of the target node, the most relevant nodes are selected by ranking candidates by the absolute difference in explainedness across possible configurations of the target and candidate node considered in isolation; \u2022 To avoid having to deal with storing and processing trivial overlaps, we exclude overlaps with a count of less than 5 from our lattice; \u2022 Many nodes have a minLPR which is slightly larger than 1 (the lowest possible value)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-159",
"text": "There is very little chance these nodes will be activated by the algorithm, and so after applying hard covering, we do not consider activating nodes with minLPR < 2."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-161",
"text": "**EVALUATION**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-162",
"text": "We evaluate our approach across three different languages, including evaluation sets derived from four different corpora selected for their size and linguistic diversity."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-163",
"text": "In English, we follow Brooke et al. (2015) in using a 890M token filtered portion of the ICWSM blog corpus (Burton et al., 2009 ) tagged with the Tree Tagger (Schmid, 1995) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-164",
"text": "To facilitate a comparison with Newman et al. (2012) , which does not scale up to a corpus as large as the ICWSM, we also build a lexicon using the 100M token British National Corpus (Burnard, 2000) , using the standard CLAWS-derived POS tags for the corpus."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-165",
"text": "Lemmatization included removing all inflectional marking from both words and POS tags."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-166",
"text": "For English, gaps are identified using the same POS regex used in Brooke et al. (2015) , which includes simple nouns and portions thereof, up to a maximum of 4 words."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-167",
"text": "The other two languages we include in our evaluation are Croatian and Japanese."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-168",
"text": "Relative to English, both languages have freer word order: we were interested in probing the challenges associated with using an n-gram approach to FS identification in such languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-169",
"text": "For Croatian, we used the 1.2-billion-token fhrWaC corpus (\u0160najder et al., 2013) , a filtered version of the Croatian web corpus hrWaC (Ljube\u0161i\u0107 and Klubi\u010dka, 2014) , which is POS-tagged and lemmatized using the tools of ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-170",
"text": "Similar to English, the POS regex for Croatian includes simple nouns, adjectives and pronouns, but also other elements that regularly appear inside FS, including both adverbs and copulas."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-171",
"text": "For Japanese, we used a subset of the 100M-page web corpus of Shinzato et al. (2008) , which was roughly the same token length as the English corpus."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-172",
"text": "We segmented and POS-tagged the corpus with MeCab (Kudo, 2008) using the UNIDIC morphological dictionary (Den et al., 2007) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-173",
"text": "The POS regex for Japanese covers the same basic nominal structures as English, but also includes case markers and adverbials."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-174",
"text": "Though our processing of Japanese includes basic lemmatization related to superficial elements like the choice of writing script and politeness markers, many elements (such as case marking) which are removed by lemmatization in Croatian are segmented into independent morphological units in the MeCab output, making the task somewhat different for the two languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-175",
"text": "Brooke et al. (2015) introduced a method for evaluating FS extraction without a reference lexicon or direct annotation of the output of a model."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-176",
"text": "Instead, n-grams are sampled after applying the frequency threshold and then annotated as being either an FS or not."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-177",
"text": "Benefits of this style of evaluation include replicability, the diversity of FS, and the ability to calculate a true F-score."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-178",
"text": "We use the annotation of 2000 n-grams in the ICWSM corpus from that earlier work, and applied the same annotation methodology to the other three corpora: after training and based on written guidelines derived from the definitions of Wray (2008), three native-speaker, educated annotators judged 500 contiguous n-grams and another 500 gapped n-grams for each corpus."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-179",
"text": "Other than the inclusion of new languages, our test sets differ from Brooke et al. (2015) in two ways."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-180",
"text": "One advantage of a type-based annotation approach, particularly with regards to annotation with a known subjective component, is that it is quite sensible to simply discard borderline cases, improving reliability at the cost of some representativeness."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-181",
"text": "As such we entirely excluded from our test set n-grams which just one annotator marked as FS."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-182",
"text": "Table 1 contains the counts for the four test sets after this filtering step and Fleiss' Kappa scores before (\"Pre\") and after (\"Post\")."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-183",
"text": "The second change is that for the main evaluation we collapsed gapped and contiguous n-grams into a single test set."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-184",
"text": "The rationale is that the number of positive gapped examples is too low to provide a reliable independent F-score."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-185",
"text": "Our primary comparison is with the heuristic LPR model of Brooke et al. (2015) , which is scalable to large corpora and includes gapped n-grams."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-186",
"text": "For the BNC, we also benchmark against the DP-seg model of Newman et al. (2012) with recommended settings, and the LocalMaxs algorithm of da Silva and Lopes (1999) using SCP; neither of these methods scale to the larger corpora."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-187",
"text": "4 Because these other approaches only generate sequential multiword units, we use only the sequential part of the BNC test set for this evaluation."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-188",
"text": "All comparison approaches have themselves been previously compared against a wide range of association measures."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-189",
"text": "As such, we do not repeat all these comparisons here, but we do consider a lexicon built from ranking n-grams according to the measure used in our lattice (minLPR) as well as PMI and raw frequency."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-190",
"text": "For each of these association measures we rank all n-grams above the frequency threshold and build a lexicon equal to the size of the lexicon produced by our model."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-191",
"text": "We created small development sets for each corpus and used them to do a thorough testing of parameter settings."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-192",
"text": "Although it is generally possible to increase precision by increasing C, we found that across corpora we always obtained near-optimal results with C = 4, so to demonstrate the usefulness of the lattice technique as an entirely off-the-shelf tool, we present the results using identical settings for all four corpora."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-193",
"text": "We treat covering as a fundamental part of the Lattice model, but to investigate the efficacy of other node interactions within the model we present results with overlap and clearing node interactions turned off."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-194",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-195",
"text": "**RESULTS**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-196",
"text": "The main results for FS acquisition across the four corpora are shown in Table 2 ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-197",
"text": "As noted in Section 2, simple statistical association measures like PMI do poorly when faced with syntactically-unrestricted n-grams of variable length: minLPR is clearly a much better statistic for this purpose."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-198",
"text": "The LPRseg method of Brooke et al. (2015) consistently outperforms simple ranking, and the lattice method proposed here does better still, with a margin that is fairly consistent across the languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-199",
"text": "Generally, clearing and overlap node interactions provide a relatively large increase in precision at the cost of a smaller drop in recall, though the change is fairly symmetrical in Croatian."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-200",
"text": "When only covering is used, the results are fairly similar to Brooke et al. (2015) , which is unsurprising given the extent to which decomposition and covering are related."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-201",
"text": "The Japanese and ICWSM corpora have relatively high precision and low recall, whereas both the BNC and Croatian corpora have low precision and high recall."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-202",
"text": "In the contiguous FS test set for the BNC (Ta- Table 2 : Results of FS identification in various test sets: Countrank = ranking with frequency; PMIrank = PMI-based ranking; minLPRrank = ranking with minLPR; LPRseg = the method of Brooke et al. (2015) ; \"\u2212cl\" = no clearing; \"\u2212ovr\" = no penalization of overlaps; \"P\" = Precision; \"R\" = Recall; and \"F\" = F-score."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-203",
"text": "Bold is best in a given column."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-204",
"text": "The performance difference of the Lattice model relative to the best baseline for all test sets considered together is significant at p < 0.01 (based on the permutation test: Yeh (2000))."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-205",
"text": "Table 3 : Results of FS identification in contiguous BNC test set; LocalMaxs = method of da Silva and Lopes (1999); DP-seg = method of Newman et al. (2012) ble 3), we found that both the LocalMaxs algorithm and the DP-seg method of Newman et al. (2012) were able to beat our other baseline methods with roughly similar F-scores, though both are well below our Lattice method."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-206",
"text": "Some of the difference seems attributable to fairly severe precision/recall imbalance, though we were unable to improve the F-score by changing the parameters from recommended settings for either model."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-207",
"text": "----------------------------------"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-208",
"text": "**DISCUSSION**"
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-209",
"text": "Though the results across the four corpora are reasonably similar with respect to overall F-score, there are some discrepancies."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-210",
"text": "By using the standard UNI-DIC morpheme representation as the base unit for Japanese, the model ends up doing an extra layer of FS identification, one which is provided by word boundaries in the other languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-211",
"text": "The result is that there are relatively more FS for Japanese: precision is high, and recall is comparably low."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-212",
"text": "Importantly, the initial n-gram statistics actually reflect that Japanese is different: the number of n-gram types over length 4 is almost twice the number in the ICWSM corpus."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-213",
"text": "One idea for future work is to automatically adapt to the input language/corpus in order to ensure a good balance between precision and recall."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-214",
"text": "At the opposite extreme, the low precision of the BNC is almost certainly due to its relatively small size: whereas the n-gram threshold we used here results in minimum counts of roughly 100 for the other three corpora, the BNC statistics include n-grams with counts of less than 10."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-215",
"text": "At such low counts, LPR is less reliable and more noise gets into the lexicon: the first column of Table 4 shows that the BNC is noticeably larger then the other lexicons, and the higher numbers in columns 2 and 3 (number of POS types and percentage of gapped expressions, resp.) are also indicative of increased noise."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-216",
"text": "This could be resolved by increasing the n-gram threshold."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-217",
"text": "It might also make sense to simply avoid smaller corpora, though for some applications a smaller corpus Table 4 : Statistics for the lexicons created by our lattice method may be unavoidable."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-218",
"text": "One idea we are pursing is modifying the calculation of the LPR metric to use a more conservative probability estimate than maximum likelihood in the case of low counts."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-219",
"text": "We were interested in Croatian and Japanese in part because of their relatively free word order, and whether the handling of gaps would help with identifying FS in these languages."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-220",
"text": "We discovered, however, that free word order actually results in more of a tendency towards contiguous FS, not less, a fact that is reflected in our test sets (Table 1) as well as the lexicons themselves (Table 4) ."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-221",
"text": "Strikingly rare in Croatian, in particular, are expressions where the content of a gap is an argument which must be filled to syntactically complete an expression: it is English whose fixed-word-order constraints often keep elements of an FS distant from each other."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-222",
"text": "The gaps that do happen in Croatian are mostly prosodydriven insertions of other elements into already complete FS."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-223",
"text": "This phenomena highlights a problem with the current model, in that gapped and contiguous versions of the same n-gram sequence (e.g., take away and take * away) are, at present, considered entirely independently."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-224",
"text": "Alternatives for dealing with this include collapsing statistics to create a single node in the lattice, creating a promoting link between contiguous and gapped versions of the same n-grams sequence in the lattice model, or switching to a dependency representation (which, we note, requires very little change to the basic model presented here, but would narrow its applicability)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-225",
"text": "The statistics in Table 4 otherwise reflect the quantity and diversity of FS across the corpora, particularly in terms of the number of POS patterns represented in the lexicon."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-226",
"text": "Looking at the most common POS patterns across languages, only noun-noun and adjective-noun combinations ever account for more than 5% of all word types in any of the lexicons."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-227",
"text": "Though some of the diversity can of course be attributed to noise, it is safe to say that most FS do not fall into the standard two-word syntactic categories used in MWE work, and therefore identifying them requires a much more general approach like the one presented here."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-228",
"text": "Table 5 contains 10 randomly selected examples from each of the lexicons produced by our method."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-229",
"text": "Among the English examples, most of the clear errors are bigrams that reflect particular biases of their respective corpora: The phrase via slashdot comes from boilerplate text identifying the source of an article, whereas Maureen (from Maureen says) is a character in one of the novels included in the BNC."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-230",
"text": "The longer FS mostly seem sensible, in that they are plausible lexicalized constructions, though be open to all * in the from the BNC seems too long and is likely the result of noise due to insufficient examples."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-231",
"text": "Some FS are dialectal variants, for instance license endorsed refers to British traffic violations."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-232",
"text": "More generally, the FS lexicons created by these two corpora are quite distinct, sharing less than 50% of their entries."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-233",
"text": "One striking thing about the non-English FS is how poorly they translate: many good FS in these languages become extremely awkward when translated into English."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-234",
"text": "This is expected, of course, for idioms like biti general poslije bitke \"be the general after the battle\" (i.e., \"hindsight is 20/20\"), but it extends to other relatively compositional constructions like \u3053\u3046 \u8a00\u3046 * \u304c \u7d9a\u304f \"repeat occurrences of * like this\" and \u524d\u671f \u6bd4 \"first half comparison\"."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-235",
"text": "This highlights the potential importance of focusing on FS when learning a language."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-236",
"text": "Though some of the errors seem to be the result of extra material added to a good FS, for instance promet teretnih vozila \"good 465 English (ICWSM) heart ache, so * have some time, part of the blame, via slashdot, any more questions, protein expression, work in * bank, al-qaeda terrorist, continue discussions, speak about * issue English (BNC) go into decline, Maureen says, be open to all * in the, Peggy Sue, square * shoulders, delivery system for, this * also includes, license endorsed, point * finger, highly * asset Croatian negativno utjecati na \"negatively affects on\", jedan od dobrih poznavatelja \"one of the best connoisseurs of\", jasno * je da \"it is clear to * that\", promet teretnih vozila \"good vehicle traffic\", odvratiti pozornost \"divert attention\", biti general poslije bitke \"be the general after the battle\", popularni internetski \"popular internet\", izazvati kaos \"cause chaos\", austrijski investitor \"Austrian investor\", ideja o gradnji \"the idea of building\" Japanese \u9ad8\u901f \u9053\u8def \u6574\u5099 \"highway construction\", \u5e74\u6b21 \u5f8c\u671f \"the second half of the fiscal year\", \u52b4 \u50cd \u8005 \u6d3e\u9063 \u4e8b\u696d \"temporary labor agency\", \u3053\u3046 \u8a00\u3046 * \u304c \u7d9a\u304f \"repeat occurrences of * like this\", \u98a8\u90aa \u3063 \u5339 \"cold sufferer\", \uff24\uff28\uff23\uff30 \u30b5\u30fc\u30d0\u30fc \"DHCP server\", \u524d\u671f \u6bd4 \"first half comparison\", \u7d4c\u55b6 \u4e8b\u9805 \u5be9\u67fb \"examination of administrative affairs\", \u81ea\u5206 \u306e \u6587\u7ae0 \"own writing\", \u6df1\u3044 \u5473\u308f\u3044 \"deep flavor\" vehicle traffic\", most, again, are somewhat inexplicable artifacts of the corpus they were built from, like austrijski investitor \"Austrian investor\"."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-237",
"text": "Since Zipfian frequency curves naturally extend to multiword vocabulary, our lexicons (and typebased evaluation of them) are of course dominated by rarer terms."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-238",
"text": "This is not, we would argue, a serious drawback, since in practical terms there is very little value in focusing on common FS like of course which manually-built lexicons already contain; most of the potential in automatic extraction comes from the long tail."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-239",
"text": "However, we did investigate the other end of the Zipfian curve by extracting the 20 most common MWEs (including both strong and weak) from the Schneider et al. (2014b) corpus."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-240",
"text": "In the ICWSM lexicon, our recall for these common terms was fairly high (0.7), with errors mostly resulting from longer phrases containing these terms \"winning out\" (in the lattice) over shorter phrases, which have relatively low LPR due to extremely common constituent words; for example, we missed on time, but had 19 FS which contain it (e.g. right on time, show up on time, and start on time)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-241",
"text": "In one case which showed this same problem, waste * time, the lexicon did have its ungapped version, highlighting the potential for improved handling of this issue."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-242",
"text": "In Section 2, we noted that FS is generally a much broader category than MWE, which we take as referring to terms which carry significant noncompositional meaning."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-243",
"text": "We decided to investigate the distinction at a practical level by annotating the positive examples in the ICWSM test set for being MWE or non-MWE FS."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-244",
"text": "5 First, we note that only 28% of our FS types were labeled MWE; this is in contrast to, for instance, the annotation of Schneider et al. (2014b) where \"weak\" MWE make up a small fraction of MWE types."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-245",
"text": "Even without any explicit representation of compositionality, our model did much better at identifying MWE FS than non-MWE FS: 0.7 versus 0.32 recall."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-246",
"text": "This may simply reflect, however, the fact that a disproportionate number of MWEs were noun-noun compounds, which are fairly easy for the model to identify."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-247",
"text": "Due to the lack of spaces between words and an agglutinative morphology, the standard approach to tokenization and lemmatization in Japanese involves morphological rather than word segmentation."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-248",
"text": "In terms of the content of the resulting lexicon we believe the effect of this difference on FS extraction is modest, since much of the extra FS in Japanese would simply be single words in other languages (and considered trivially part of the FS lexicon)."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-249",
"text": "However, from a theoretical perspective we might very much prefer to build FS for all languages starting from morphemes rather than words."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-250",
"text": "Such a framework could, for instance, capture inflectional flexibility versus fixedness directly in the model, with fixed inflectional morphemes included as a distinct element of the FS and flexible morphemes becoming gaps."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-251",
"text": "However, for many languages this would result in a huge blow up in complexity with only modest increases in the scope of FS identification."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-252",
"text": "Though it is indisputable that inflectional fixedness is part of the lexical information contained in an FS, in practice this sort of information can be efficiently derived post hoc from the corpus statistics."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-253",
"text": "Though we have demonstrated that competition within a lattice is a powerful method for the production of multiword lexicons, its usefulness derives less from the specific choices we have made in this instantiation of the model, and more from the flexiblity that such a model provides for future research."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-254",
"text": "Not only do alternatives like DP-seg and LocalMaxs fail to scale up to large corpora, there are few obvious ways to improve on their simple underlying algorithms without compromising their elegance and worsening tractability."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-255",
"text": "Fast and functional, the LPR decomp approach is nevertheless algorithmically ungainly, involving multiple layers of heuristic-driven filtering with no possibility of correcting errors."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-256",
"text": "Our lattice method is aimed at something between these extremes: a practical, optimizable model, but with various component heuristics that can be improved upon."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-257",
"text": "For instance, though the current version of clearing is effective and has practical advantages relative to simpler options that we tested, it could be enhanced by more careful investigation of the statistical properties of n-grams which contain FS."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-258",
"text": "We can also consider adding new terms to the exponents of the two parts of our objective function, analagous to the cover, clear, and overlap functions, based on other relationships between nodes in the lattice."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-259",
"text": "One which we have considered is creating new connections between identical or similar syntactic patterns, which could serve to encourage the model to generalize."
},
{
"sent_id": "57ef27eefdf272bead22212863a8a8-C001-260",
"text": "In English, for instance, it might learn that verb-particle combinations are generally likely to be FS, whereas verb-determiner combinations are not."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-35"
],
[
"57ef27eefdf272bead22212863a8a8-C001-49"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-35",
"57ef27eefdf272bead22212863a8a8-C001-49"
]
},
"@USE@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-52",
"57ef27eefdf272bead22212863a8a8-C001-53",
"57ef27eefdf272bead22212863a8a8-C001-54",
"57ef27eefdf272bead22212863a8a8-C001-55"
],
[
"57ef27eefdf272bead22212863a8a8-C001-62"
],
[
"57ef27eefdf272bead22212863a8a8-C001-163"
],
[
"57ef27eefdf272bead22212863a8a8-C001-166"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-55",
"57ef27eefdf272bead22212863a8a8-C001-62",
"57ef27eefdf272bead22212863a8a8-C001-163",
"57ef27eefdf272bead22212863a8a8-C001-166"
]
},
"@DIF@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-52",
"57ef27eefdf272bead22212863a8a8-C001-53",
"57ef27eefdf272bead22212863a8a8-C001-54",
"57ef27eefdf272bead22212863a8a8-C001-55"
],
[
"57ef27eefdf272bead22212863a8a8-C001-75"
],
[
"57ef27eefdf272bead22212863a8a8-C001-179"
],
[
"57ef27eefdf272bead22212863a8a8-C001-198"
],
[
"57ef27eefdf272bead22212863a8a8-C001-264"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-55",
"57ef27eefdf272bead22212863a8a8-C001-75",
"57ef27eefdf272bead22212863a8a8-C001-179",
"57ef27eefdf272bead22212863a8a8-C001-198",
"57ef27eefdf272bead22212863a8a8-C001-264"
]
},
"@SIM@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-61"
],
[
"57ef27eefdf272bead22212863a8a8-C001-62"
],
[
"57ef27eefdf272bead22212863a8a8-C001-166"
],
[
"57ef27eefdf272bead22212863a8a8-C001-200"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-61",
"57ef27eefdf272bead22212863a8a8-C001-62",
"57ef27eefdf272bead22212863a8a8-C001-166",
"57ef27eefdf272bead22212863a8a8-C001-200"
]
},
"@EXT@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-61"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-61"
]
},
"@MOT@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-94"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-94"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"57ef27eefdf272bead22212863a8a8-C001-185"
]
],
"cite_sentences": [
"57ef27eefdf272bead22212863a8a8-C001-185"
]
}
}
},
"ABC_74b8684eaabda30a2d8705adcb19a2_4": {
"x": [
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-94",
"text": "**EXTRACTING ENTITY MENTIONS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-45",
"text": "Their model achieved state-of-the-art results on the GENIA dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-2",
"text": "We propose a novel recurrent neural network-based approach to simultaneously handle nested named entity recognition and nested entity mention detection."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-3",
"text": "The model learns a hypergraph representation for nested entities using features extracted from a recurrent neural network."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-4",
"text": "In evaluations on three standard data sets, we show that our approach significantly outperforms existing state-of-the-art methods, which are feature-based."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-5",
"text": "The approach is also efficient: it operates linearly in the number of tokens and the number of possible output labels at any token."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-6",
"text": "Finally, we present an extension of our model that jointly learns the head of each entity mention."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-9",
"text": "Named entity recognition (or named entity detection) is the task of identifying text spans associated with proper names and classifying them according to their semantic class such as person, organization, etc."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-10",
"text": "It is related to the task of mention detection (or entity mention recognition) in which text spans referring to named, nominal or prominal entities are identified and classified according to their semantic class (Florian et al., 2004) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-11",
"text": "Both named entity recognition and entity mention detection are fundamental components in information extraction systems: several downstream tasks such as relation extraction (Mintz et al., 2009) , coreference resolution (Chang et al., 2013) and fine-grained opinion mining (Choi et al., 2006 ) rely on both."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-12",
"text": "Many approaches have been successfully employed for the tasks of named entity recognition and mention detection, including linear-chain conditional random fields (Lafferty et al., 2001 ) and semi-Markov conditional random fields (Sarawagi and Cohen, 2005) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-13",
"text": "However, most such methods suffer from an inability to handle nested named entities, nested entity mentions, or both."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-14",
"text": "As a result, the downstream tasks necessarily ignore these nested entities along with any semantic relations among them."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-15",
"text": "Consider, for example, the excerpts below:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-17",
"text": "**S1**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-18",
"text": "shows a nested named entity from the GENIA dataset (Ohta et al., 2002) : \"human B cell line\" and \"EBV -transformed human B cell line\" are both considered named entities of type CELL LINE where the former is embedded inside the latter."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-19",
"text": "S2, derived from the ACE corpora 1 , shows a PERSON named entity (\"Sheikh Abbad\") nested in an entity mention of type LOCATION (\"the burial site of Sheikh Abbad\")."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-20",
"text": "Most existing methods for named entity recognition and entity mention detection would miss the nested entity in each sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-46",
"text": "However, the time complexity of their model is O(n 3 ), where n is the number of tokens in the sentence, making inference slow."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-21",
"text": "Unfortunately, nested entities can be fairly common: 17% of the entities in the GENIA corpus are embedded within another entity; in the ACE corpora, 30% of sentences contain nested named entities or entity mentions, thus warranting the development of efficient models to effectively handle these linguistic phenomena."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-22",
"text": "Feature-based methods are the most common among those proposed for handling nested named entity and entity mention recognition."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-23",
"text": "Alex et al. (2007) , for example, proposed a cascaded CRF model but it does not identify nested named entities of the same type."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-24",
"text": "Finkel and Manning (2009) proposed building a constituency parser with constituents for each named entity in a sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-25",
"text": "Their approach is expensive, i.e., time complexity is cubic in the number of words in the sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-26",
"text": "Lu and Roth (2015) later proposed a mention hypergraph model for nested entity detection with linear time complexity."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-27",
"text": "And recently, Muis and Lu (2017) introduced a multigraph representation based on mention separators for this task."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-28",
"text": "All of these models depend on manually crafted features."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-29",
"text": "In addition, they cannot be directly applied to extend current state-of-the-art recurrent neural networkbased models -for flat named entity recognition (Lample et al., 2016) or the joint extraction of entities and relations (Katiyar and Cardie, 2016) to handle nested entities."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-30",
"text": "In this paper, we propose a recurrent neural network-based model for nested named entity and nested entity mention recognition."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-31",
"text": "We present a modification to the standard LSTM-based sequence labeling model that handles both problems and operates linearly in the number of tokens and the number of possible output labels at any token."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-32",
"text": "The proposed neural network approach additionally jointly models entity mention head 2 information, a subtask found to be useful for many information extraction applications."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-33",
"text": "Our model significantly outperforms the previously mentioned hypergraph model of Lu and Roth (2015) and Muis and Lu (2017) on entity mention recognition for the ACE2004 and ACE2005 corpora."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-34",
"text": "It also outperforms their model on joint extraction of nested entity mentions and their heads."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-35",
"text": "Finally, we evaluate our approach on nested named entity recognition using the GENIA dataset and show that our model outperforms the previous state-of-the-art parser-based approach of Finkel and Manning (2009) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-37",
"text": "**RELATED WORK**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-38",
"text": "Several methods have been proposed for named entity recognition in the existing literature as summarized by Nadeau and Sekine (2007) in their survey paper."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-39",
"text": "Early techniques in the supervised domain have been based on hidden markov models (e.g., Zhou and Su (2002) ) or, later, conditional random fields (CRFs) (e.g., McDonald and Pereira (2005) )."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-40",
"text": "Many fewer approaches, however, have addressed the problem of nested entities."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-41",
"text": "Alex et al. (2007) presented several techniques based on CRFs for nested named entity recognition for the GENIA dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-42",
"text": "They obtained their best results from a cascaded approach, where they applied CRFs in a specific order on the entity types, such that each CRF utilizes the output derived from previous CRFs."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-43",
"text": "Their approach could not identify nested entities of the same type."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-44",
"text": "Finkel and Manning (2009) proposed a CRF-based constituency parser for nested named entities such that each named entity is a constituent in the parse tree."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-47",
"text": "As a result, we do not adopt their parse tree-based representation of nested entities and propose instead a linear time directed hypergraph-based model similar to that of Lu and Roth (2015) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-48",
"text": "Directed hypergraphs were also introduced for parsing by Klein and Manning (2001) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-49",
"text": "While most previous efforts for nested entity recognition were limited to named entities, Lu and Roth (2015) addressed the problem of nested entity mention detection where mentions can either be named, nominal or pronominal."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-50",
"text": "Their hypergraph-based approach is able to represent the potentially exponentially many combinations of nested mentions of different types."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-51",
"text": "They adopted a CRF-like log-linear approach to learn these mention hypergraphs and employed several hand-crafted features defined over the input sentence and the output hypergraph structure."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-52",
"text": "Our approach also learns a similar hypergraph representation with differences in the types of nodes and edges in the hypergraph."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-53",
"text": "It does not depend on any manually crafted features."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-54",
"text": "Also, our model learns the hypergraph greedily and significantly outperforms their approach."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-55",
"text": "Recently, Muis and Lu (2017) introduced the notion of mention separators for nested entity mention detection."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-56",
"text": "In contrast to the hypergraph representation that we and Lu and Roth (2015) adopt, they learn a multigraph representation and are able to perform exact inference on their structure."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-57",
"text": "It is an interesting orthogonal possible approach for nested entity mention detection."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-58",
"text": "How-ever, we will show that our model also outperforms their approach on all tasks."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-59",
"text": "Recently, recurrent neural networks (RNNs) have been widely applied to several sequence labeling tasks achieving state-of-the-art results."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-60",
"text": "Lample et al. (2016) proposed neural models based on long short term memory networks (LSTMs) and CRFs for named entity recognition and another transition-based approach inspired by shift-reduce parsers."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-61",
"text": "Both models achieve performance comparable to a state-of-the-art model (Luo et al., 2015) , but neither handles nested named entities."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-62",
"text": "3 Encoding Scheme Figure 1 shows the desired sequence tagging output for each of three overlapping PER entities (\"his\", \"his fellow pilot\" and \"his fellow pilot David Williams\") according to the standard BILOU tag scheme."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-63",
"text": "Our approach relies on the fact that we can (1) represent these three tag sequences in the single hypergraph structure of Figure 2 and then (2) design an LSTM-based neural network that produces the correct nested entity hypergraph for a given input sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-64",
"text": "In the paragraphs just below we provide a general description of hypergraphs and our task-specific use of them."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-65",
"text": "Sections 3.1 and 3.2 describe the hypergraph construction process; Section 4 presents the LSTM-based sequence tagging method for automating hypergraph construction."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-66",
"text": "We express our structured prediction problem such that it corresponds to building a hypergraph that encodes the token-level gold labels for all entities in the input sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-67",
"text": "3 In particular, we represent the problem as a directed hypergraph."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-68",
"text": "For those new to this formalism, directed hypergraphs are very much like standard directed graphs except that nodes are connected by hyperarcs that connect a set of tail nodes to a set of head nodes."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-92",
"text": "We learn the set of head nodes connected to a tail node by expressing it as a multi-label learning problem as described in Section 5."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-69",
"text": "To better explain our desired output structure, we further distinguish between two types of hyperarcs -normal edges (or arcs) that connect a single tail node to a single head node, and hyperarcs that contain more than one node either as the head or as the tail."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-70",
"text": "The former are shown as straight lines in Figure 2 ; the latter as curved edges."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-71",
"text": "In our encoding of nested entities, a hyperarc is introduced when two or more entity mentions requiring different label types are present at the same position."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-72",
"text": "In Figure 2 , for example, the nodes \"O\" (corresponding to the input token \"that\") and the nodes \"U PER\" and \"B PER\" (corresponding to the input token \"his\") are connected by a hyperarc because three entity mentions start at this time step from the tail \"O\" node (two of which share the \"B PER\" tag)."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-73",
"text": "4"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-75",
"text": "**HYPERGRAPH CONSTRUCTION**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-76",
"text": "Let us first discuss how the problem of nested entity recognition can be expressed as finding a hypergraph."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-77",
"text": "Our goal is to represent the BILOU tag sequences associated with \"his\", \"his fellow pilot\" and \"his fellow pilot David Williams\" as the single hypergraph structure of Figure 2 ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-78",
"text": "This is accomplished by collapsing the shared states (labels) in the output entity label sequences into a single state as shown in Figure 2 : e.g., the three \"O\" labels for \"that\" become a single \"O\"; the two \"B PER\" labels at \"his\" are collapsed into one \"B PER\" node that joins \"U PER\", the latter of which represents the entity mention \"his\"."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-79",
"text": "Thus at any time step, the representation size is bounded by the number of possible output states instead of the potentially exponential number of output sequences."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-80",
"text": "We then also adjust the directed edges such that they have the same type of head node and the same type of tail node as before in Figure 1 ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-81",
"text": "If we look closely at Figure 2 then we realise that there is an extra \"O\" node in the hypergraph corresponding to the token \"his\" which did not appear in any entity output sequence in Figure 1 : in our task-specific hypergraph construction we make sure that there is an \"O\" node at every timestep to model the possibility of beginning of a new entity."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-82",
"text": "The need for this will become more clear in Section 4."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-83",
"text": "Note that the hypergraph representation of our model is similar to Lu and Roth (2015) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-84",
"text": "Also, the expressiveness of our model is exactly the same as Lu and Roth (2015) ; Muis and Lu (2017) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-85",
"text": "The major difference in the two approaches is in learning."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-87",
"text": "**EDGE PROBABILITY**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-88",
"text": "In this section, we discuss our assignment of probabilities to all the possible edges from a tail node which helps in the greedy construction of the hypergraph."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-89",
"text": "Thus at any timestep t, let g t\u22121 be the tail node and x be the current word of the sentence; then we model probability distribution over all the possible types of head nodes (different output tag types) conditioned on the tail node and the current word token."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-90",
"text": "In our work we use hidden representations learned from an LSTM model as features to learn these probability distributions using a crossentropy objective."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-91",
"text": "It is important to note that there are two types of directed edges in this hypergraph -simple edges for which there is only one head node for every tail node which can be learned as in a traditional sequence labeling task, or hyperarcs that connect more than one head node to a tail node."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-95",
"text": "As described in Section 3.2, we can assign probabilities to the different types of edges in the hypergraph and at the time of decoding we choose for each token the (normal) edge(s) with maximum probability and the hyperarcs with probability above a predefined threshold."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-96",
"text": "Thus, we can extract edges at the time of decoding."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-97",
"text": "Ultimately, however, we are interested in extracting nested entities from the hypergraph."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-98",
"text": "For this, we construct an adjacency matrix from the edges discovered and perform depth-first search from the sentenceinitial token to discover the entity mentions."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-99",
"text": "This is described in detail in Section 5.1."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-101",
"text": "**METHOD**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-102",
"text": "We use a standard LSTM-based sequence labeling model to learn the nested entity hypergraph structure for an input sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-103",
"text": "Figure 3 shows part of the network structure."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-104",
"text": "It is a standard bidirectional LSTM network except for a difference in the top hidden layer."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-105",
"text": "When computing the representation of the top hidden layer L at any time step t, in addition to making use of the hidden unit representation from the previous time step t \u2212 1 and hidden unit representation from the preceding layer L \u2212 1, we also input the label embedding of the gold labels from the previous time step."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-106",
"text": "For the token \"fellow\" in Figure 3 , for example, we compute three different top hidden layer representations, conditioned respectively on the three labels \"U PER\", \"B PER\" and \"O\" from the previous time step t \u2212 1."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-107",
"text": "Thus, we can model complex interactions between the input and the output."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-108",
"text": "Before passing the learned hidden representation to the next time step, we average the three different top hidden layer representations."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-109",
"text": "In this manner, we can model the interactions between the different overlapping labels and also it is computation- Figure 3 : Dynamically computed network structure based on bi-LSTMs for nested entity mention extraction."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-110",
"text": "We show part of the structure for the entity mentions in the running example in Figure 1. ally less expensive than storing the hidden layer representations for each label sequence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-112",
"text": "**MULTI-LAYER BI-LSTM**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-113",
"text": "We use a multi-layer bi-directional LSTM encoder, for its strength in capturing long-range dependencies between tokens, a useful property for information extraction tasks."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-114",
"text": "Using LSTMs, we can compute the hidden state \u2212 \u2192 h t in the forward direction and \u2190 \u2212 h t in the backward direction for every token, and use a linear combination of them as the token representation:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-116",
"text": "**TOP HIDDEN LAYER**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-117",
"text": "At the top hidden layer, we have a decoder-style model, with a crucial twist to accommodate the hypergraph structure, which may have multiple gold labels at the previous step."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-118",
"text": "At each token t and for each gold label at the previous step g k t\u22121 , our network takes the hidden representation from the previous layer z from the previous time step, and computes:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-119",
"text": "Unlike the encoder LSTM, this decoder LSTM is single-directional and bifurcates when multiple gold labels are present."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-120",
"text": "We use the decoder hidden states h (L),k t in the output layer for prediction, as explained in Section 4.3."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-121",
"text": "However, before passing the hidden representation to the next time step we average h (L),k t over all the gold labels k:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-122",
"text": "summarizes the information for all the gold labels from the previous time step."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-124",
"text": "**ENTITY EXTRACTION**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-125",
"text": "For each token t and previous gold label g k t\u22121 , we use the decoder state h (L),k t to predict a probability distribution over the possible candidate labels using a linear layer followed by a normalizing transform (illustrated below with softmax)."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-126",
"text": "The outputs can be interpreted as conditional probabilities for the next label given the current gold label:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-127",
"text": "Special care is required, however, since the desired output has hyperarcs."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-128",
"text": "As shown in Figure 2 , there is an hyperarc between \"I PER\" corresponding to the token \"fellow\" and the label set \"L PER\" and \"I PER\" corresponding to the token \"pilot\"."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-129",
"text": "Thus, in our network structure conditioned on the previous label \"I PER\" in this case, we would like to predict both \"L PER\" and \"I PER\" as the next labels."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-130",
"text": "To accommodate this, we use a multi-label training objective, as described in Section 5."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-132",
"text": "**TRAINING**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-133",
"text": "We train our model using two different multi-label learning objectives."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-134",
"text": "The idea is to represent the gold labels as a distribution over all possible labels, encoded as a vector e. Hence, for simple edges, the distribution has a probability of 1 for the unique gold label (e g = 1), and 0 everywhere else."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-135",
"text": "For hyperarcs, we distribute the probability mass uniformly over all the gold labels in the gold label set (e k g = 1 |G| for all k)."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-136",
"text": "Thus, for the example described earlier in Section 4.3, both the labels \"L PER\" and \"I PER\" receive a probability of 0.5 in the gold label distribution e k t , conditioned on the label \"I PER\" from the previous time step."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-137",
"text": "Softmax."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-138",
"text": "Our first training method uses softmax to estimate the predicted probabilities, and the KL-divergence multi-label loss between the true distribution e k t and the predicted distribution e k t = softmax(o k t ):"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-139",
"text": "Sparsemax."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-140",
"text": "Our second training method makes use of sparsemax, recently introduced by Martins and Astudillo (2016) as a sparse drop-in replacement to softmax, as well as a loss function."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-141",
"text": "Unlike softmax, which always outputs a nonzero probability for any output, sparsemax outputs zero probability for most of the unlikely classes, leading to good empirical results on multi-label tasks."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-142",
"text": "For our problem, there are only a few nested entities at any timestep in the gold labels thus using a training objective that learns a sparse distribution is more appropriate."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-143",
"text": "Sparsemax can be used to filter part of the output space as in the case for multilabel problems thus leaving non-zero probability on the desired output labels."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-144",
"text": "Formally, sparsemax returns the euclidean projection of its input o onto the probability simplex:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-145",
"text": "The corresponding loss, a sparse version of the KL divergence, is (up to a constant):"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-146",
"text": "This function is convex and differentiable, and the quantity \u03c4 is a biproduct of the simplex projection, as described in Martins and Astudillo (2016) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-147",
"text": "For either choice of probability estimation, the total loss of a training sample is the sum of losses for each token and for each previous gold label:"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-149",
"text": "**DECODING**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-150",
"text": "At the time of inference, we greedily decode our hypergraph from left-to-right to find the most likely sub-hypergraph."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-151",
"text": "During training, at each timestep the most likely label set is learned conditioned on a gold label from the previous timestep."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-152",
"text": "However, gold labels are not available at test time."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-153",
"text": "Thus, we use the predicted labels from the previous time step as an input to the current time step to find the most likely label set."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-154",
"text": "We use a hard threshold T to determine the predicted label set"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-155",
"text": "> T } We can get the most likely label set P c t for any predicted label at the previous time step c \u2208 P k t\u22121 using the above decoding strategy."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-156",
"text": "We now combine these inferences to find the most likely entity mention sequences."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-157",
"text": "We construct an adjacency matrix A for each time step, such that A[\u00ea c t\u22121 ][\u00ea k t ] += 1 for every c in the predicted label set P k t at timestep t conditioned on\u00ea k t and for every k in predicted labels P t\u22121 at time step t \u2212 1."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-158",
"text": "This can be viewed as a directed hypergraph with several connected components."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-159",
"text": "We then perform a depth-first search on this directed hypergraph to find all the entity mentions in the sentence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-161",
"text": "**MODELING ENTITY HEADS FOR ACE DATASETS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-162",
"text": "The ACE datasets also have annotations for mention heads along with the entity mentions."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-163",
"text": "For example, a sentence with the entity mention \"the U.S. embassy\" also contains an annotation for its head word which is \"embassy\" in this case."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-164",
"text": "Thus, we modify our model to also extract the head of the entity mentions for ACE dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-165",
"text": "We jointly model the entity mentions and their heads."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-166",
"text": "To do this, we propose a simple extension to our model by only changing the output label sequence."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-167",
"text": "We introduce new labels starting with \"H\" to indicate that the current token in the entity mention is part of its head."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-168",
"text": "Thus, we only change the output label sequence for the entity mentions to include the head label: We train with the label sequence \"B ORG I ORG H ORG\" instead of \"B ORG I ORG L ORG\"."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-169",
"text": "Also, for all our entity sequences we predict the \"O\" tag at the end, hence we can still extract the entity mentions."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-170",
"text": "At decoding time, we output the sequence of words with the \"H\" tag as the head words for a mention."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-172",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-173",
"text": "We evaluate our model on two tasks -nested entity mention detection for the ACE corpora and nested named entity recognition for the GENIA dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-175",
"text": "**ACE EXPERIMENTS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-177",
"text": "**DATA**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-178",
"text": "We perform experiments on the English section of the ACE2004 and ACE2005 corpora."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-179",
"text": "There are 7 main entity types -Person (PER), Organization (ORG), Geographical Entities (GPE), Location (LOC), Facility (FAC), Weapon (WEA) and Vehicle (VEH)."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-180",
"text": "For each entity type, there are annotations for the entity mention and mention heads."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-182",
"text": "**EVALUATION METRICS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-183",
"text": "We use a strict evaluation metric similar to Lu and Roth (2015) : an entity mention is considered correct if both the mention span and the mention type are exactly correct."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-184",
"text": "Similarly, for the task of joint extraction of entity mentions and mention heads, the mention span, head span and the entity type should all exactly match the gold label."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-185",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-186",
"text": "**BASELINES AND PREVIOUS MODELS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-187",
"text": "We compare our model with the feature-based model (MH-F) on hypergraph structure (Lu and Roth, 2015) on both entity mention detection as well as the joint mention and mention heads extraction."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-188",
"text": "We also compare with Muis and Lu (2017) on entity mention detection only as their model cannot detect head phrases of the entity mentions."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-189",
"text": "Lu and Roth (2015) compare their approach with CRF-based approaches such as a linear-chain CRF, semi-markov CRF and a cascaded approach (Alex et al., 2007) and show that their model outperforms them."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-190",
"text": "Hence, we do not include those results in our paper."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-191",
"text": "We also implement several LSTM-based baselines for comparison."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-192",
"text": "Our first baseline is a standard sequence labeling LSTM model (LSTM-flat)."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-193",
"text": "A sequence model is not capable of handling the nested mentions, so we remove the embedded entity mention and keep the mention longer in length."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-194",
"text": "Our second baseline is a hypergraph model (LSTM-output layer) except that the dependencies are only modeled at the output layer and hence there are no connections to the tophidden layer from the label embeddings from the previous timestep; instead, these connections are limited to the output layer."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-196",
"text": "**HYPERPARAMETERS AND TRAINING DETAILS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-197",
"text": "We use Adadelta (Zeiler, 2012) for training our models."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-198",
"text": "We initialize our word vectors with 300-dimensional word2vec (Mikolov et al., 2013) word embeddings."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-199",
"text": "These word embeddings are tuned during training."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-200",
"text": "We regularize our network using dropout (Srivastava et al., 2014) , with the dropout rate tuned on the development set."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-201",
"text": "There are 3 hidden layers in our network and the dimensionality of hidden units is 100 in all our experiments. And we set the threshold T as 0.3."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-202",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-203",
"text": "**RESULTS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-204",
"text": "We show the performance of our approaches in Table 1 compared to the previous state-of-the-art system (Lu and Roth, 2015; Muis and Lu, 2017) on both the ACE2004 and ACE2005 datasets."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-205",
"text": "We find that our LSTM-flat baseline that ignores embedded entity mentions during training performs worse than Lu and Roth (2015) ; however, our other neural network-based approaches all outperform the previous feature-based approach."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-206",
"text": "Among the neural network-based models, we find that our models that construct a hypergraph perform better than the LSTM-flat models."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-207",
"text": "Also, our approach that models dependencies between the input and the output by passing the prediction from the pre- vious timestep as shown in Figure 3 performs better than the LSTM-output layer model which only models dependencies at the output layer."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-208",
"text": "Also, as expected, the sparsemax method that produces a sparse probability distribution performs better than the softmax approach for modeling hyperedges."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-209",
"text": "In summary, our sparsemax model is the best performing model."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-210",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-211",
"text": "**JOINT MODELING OF HEADS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-212",
"text": "We report the performance of our best performing models on the joint modeling of entity mentions and its head in Table 2."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-213",
"text": "We show that our sparsemax model is still the best performing model."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-214",
"text": "We also find that as the total number of possible labels at any timestep increases because of the way we implemented the entity heads, the gains that we get after incorporating sparsemax are significantly higher compared to the results shown in Table 1 ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-215",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-216",
"text": "**GENIA EXPERIMENTS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-218",
"text": "**DATA**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-219",
"text": "We also evaluate our model on the GENIA dataset (Ohta et al., 2002) for nested named entity recognition."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-220",
"text": "We follow the same dataset split as Finkel and Manning (2009); Lu and Roth (2015) ; Muis and Lu (2017) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-221",
"text": "Thus, the first 90% of the sentences were used in training and the remaining 10% were used for evaluation."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-222",
"text": "We also consider five entity types -DNA, RNA, protein, cell line and cell type."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-223",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-224",
"text": "**BASELINES AND PREVIOUS MODELS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-225",
"text": "We compare our model with Finkel and Manning (2009) based on a constituency CRF-based parser and the mention hypergraph model by Lu and Roth (2015) and a recent multigraph model by Muis and Lu (2017) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-226",
"text": "Table 3 : Performance on the GENIA dataset on nested named entity recognition."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-227",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-228",
"text": "**RESULTS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-229",
"text": "Roth (2015) ."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-230",
"text": "We suspect that it is because we use pretrained word embeddings 5 trained on PubMed data (Pyysalo et al., 2013) whereas Lu and Roth (2015) did not have access to them."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-231",
"text": "We again find that our neural network model outperforms the previous state-of-the-art (Finkel and Manning, 2009; Muis and Lu, 2017) system."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-232",
"text": "However, we see that both softmax and sparsemax models perform comparably on this dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-233",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-234",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-235",
"text": "Consistent with existing results on the joint modeling of related tasks in NLP, we find that joint modeling of heads and their entity mentions leads to an increase in F-score by 1pt (i.e., 71.4 for the sparsemax model on the ACE2005 dataset) on the performance of the entity mentions."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-236",
"text": "The precision on extracting entity mentions is 72.1 (vs. 70.6 in Table 1) for our sparsemax model for the ACE2005 dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-237",
"text": "Example S1 below compares the output from a softmax vs. a sparsemax model on the joint modeling of an entity mention and its head on the ACE2005 dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-238",
"text": "Gold-standard annotations are shown in red."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-239",
"text": "for the high premiums of a few specialities?"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-240",
"text": "Based on the gold standard, the models are required to extract \"their\" -an entity mention of type PER as well as its head -and \"their patients\", which overlaps with the previous entity mention \"their\" and has the head word \"patients\"."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-241",
"text": "This means that the models are required to predict a hyperedge from \"O\" to \"H PER; B PER\"."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-242",
"text": "We find that the softmax model shown in blue can only predict the entity mention \"their\" omitting completely the entity mention \"their patients\" whereas the sparsemax model shown in green can predict both nested entities."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-243",
"text": "Overall then, sparsemax seems to allow the modeling of hyperedges more efficiently compared to the softmax model and performance gains are due to extracting more nested entities with the help of sparsemax model."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-244",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-245",
"text": "**LIMITATIONS AND FUTURE DIRECTIONS**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-246",
"text": "We also manually scanned the test set predictions on ACE dataset for our sparsemax model to understand its current limitations."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-247",
"text": "the sparsemax model predicts both entity mentions of \"they\" as PER entity type."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-248",
"text": "Only if the previous sentence in the corpus is accessible -\"And if you ride inside that tank, it is like riding in the bowels of a dragon\" -can we understand that \"they\" in S2 refers to the tank and hence is a VEH."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-249",
"text": "Thus, our model can be improved by providing additional context for each sentence rather than making predictions on each sentence in the corpus independently."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-250",
"text": "In the example sentences, \"It\" refers to a facility and an event, respectively."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-251",
"text": "Our model does not distinguish between the two cases and always predicts the token \"It\" as a non-entity."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-252",
"text": "We found this true for all occurrences of the token \"It\" in our test set."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-253",
"text": "The incorporation of coreference information can potentially overcome this limitation."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-254",
"text": "For S5, the gold-standard annotation for \"both of these teams\" is an ORG entity mention with the token \"teams\" as its head word."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-255",
"text": "Our sparsemax model identifies the entity mention correctly but instead predicts the token \"both\" as the head."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-256",
"text": "It also identifies \"these teams\" as another nested entity mention with the head word \"teams\"."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-257",
"text": "In contrast, however, we also found entity mentions such as \"all of the victims that get a little money\" for which the gold-standard has \"all\" annotated as its head and another nested mention \"the victims that get a little money\" with \"victims\" as the head."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-258",
"text": "We recognize this as an inconsistency in the goldstandard annotation."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-259",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-260",
"text": "**PRONOMINAL ENTITY MENTION (IT**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-261",
"text": "----------------------------------"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-262",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-263",
"text": "In this paper, we present a novel recurrent network-based model for nested named entity recognition and nested entity mention detection."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-264",
"text": "We propose a hypergraph representation for this problem and learn the structure using an LSTM network in a greedy manner."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-265",
"text": "We show that our model significantly outperforms a feature based mention hypergraph model (Lu and Roth, 2015) and a recent multigraph model (Muis and Lu, 2017) on the ACE dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-266",
"text": "Our model also outperforms the constituency parser-based approach of Finkel and Manning (2009) on the GENIA dataset."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-267",
"text": "In future work, it would be interesting to learn global dependencies between the output labels for such a hypergraph structure and training the model globally."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-268",
"text": "We can also experiment with different representations such as the one in Finkel and Manning (2009) and use the recent advances in neural network approaches (Vinyals et al., 2015) to learn the constituency parse tree efficiently."
},
{
"sent_id": "74b8684eaabda30a2d8705adcb19a2-C001-269",
"text": "interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF, DARPA or the U.S. Government."
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"74b8684eaabda30a2d8705adcb19a2-C001-33"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-205"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-230"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-265"
]
],
"cite_sentences": [
"74b8684eaabda30a2d8705adcb19a2-C001-33",
"74b8684eaabda30a2d8705adcb19a2-C001-205",
"74b8684eaabda30a2d8705adcb19a2-C001-230",
"74b8684eaabda30a2d8705adcb19a2-C001-265"
]
},
"@SIM@": {
"gold_contexts": [
[
"74b8684eaabda30a2d8705adcb19a2-C001-47"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-83"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-84"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-183"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-220"
]
],
"cite_sentences": [
"74b8684eaabda30a2d8705adcb19a2-C001-47",
"74b8684eaabda30a2d8705adcb19a2-C001-83",
"74b8684eaabda30a2d8705adcb19a2-C001-84",
"74b8684eaabda30a2d8705adcb19a2-C001-183",
"74b8684eaabda30a2d8705adcb19a2-C001-220"
]
},
"@BACK@": {
"gold_contexts": [
[
"74b8684eaabda30a2d8705adcb19a2-C001-49"
]
],
"cite_sentences": [
"74b8684eaabda30a2d8705adcb19a2-C001-49"
]
},
"@USE@": {
"gold_contexts": [
[
"74b8684eaabda30a2d8705adcb19a2-C001-56"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-220"
]
],
"cite_sentences": [
"74b8684eaabda30a2d8705adcb19a2-C001-56",
"74b8684eaabda30a2d8705adcb19a2-C001-220"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"74b8684eaabda30a2d8705adcb19a2-C001-187"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-204"
],
[
"74b8684eaabda30a2d8705adcb19a2-C001-225"
]
],
"cite_sentences": [
"74b8684eaabda30a2d8705adcb19a2-C001-187",
"74b8684eaabda30a2d8705adcb19a2-C001-204",
"74b8684eaabda30a2d8705adcb19a2-C001-225"
]
}
}
},
"ABC_e0b72115e1905226d22876e72aa304_4": {
"x": [
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-2",
"text": "This study focuses on acquisition of commonsense knowledge."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-26",
"text": "They proposed a simple neural network model that can embed arbitrary phrases on-the-fly and achieved reasonable accuracy for ConceptNet."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-3",
"text": "A previous study proposed a commonsense knowledge base completion (CKB completion) method that predicts a confidence score of triplet-style knowledge for improving the coverage of CKBs."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-4",
"text": "To improve the accuracy of CKB completion and expand the size of CKBs, we formulate a new commonsense knowledge base generation task (CKB generation) and propose a joint learning method that incorporates both CKB completion and CKB generation."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-5",
"text": "Experimental results show that the joint learning method improved completion accuracy and the generation model created reasonable knowledge."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-6",
"text": "Our generation model could also be used to augment data and improve the accuracy of completion."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-9",
"text": "Knowledge bases (KBs) are a kind of information network, and they have been applied to many natural language processing tasks such as question answering (Yang and Mitchell, 2017; Long et al., 2017) and dialog tasks (Young et al., 2018) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-10",
"text": "In this paper, we focus on commonsense knowledge bases (CKBs)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-11",
"text": "Commonsense knowledge is also referred to as background knowledge and is used in natural language application tasks that require reasoning based on implicit knowledge."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-12",
"text": "For example, machine comprehension tasks that need commonsense reasoning have been proposed very recently (Lin et al., 2017; Ostermann et al., 2018) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-13",
"text": "In particular, Wang et al. (2018) used commonsense knowledge provided by ConceptNet (Speer et al., 2017) to efficiently resolve ambiguities and infer implicit information."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-14",
"text": "Information in CKB is represented in RDFstyle triples \u27e8t 1 , r, t 2 \u27e9, where t 1 and t 2 are arbitrary words or phrases, and r \u2208 R is a relation between t 1 and t 2 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-15",
"text": "For example, \u27e8go to restaurant, subevent, order food\u27e9 means \"order food\" happens as a subevent of \"go to restaurant\"."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-16",
"text": "Although researchers have developed techniques for acquiring CKB from raw text with patterns (Angeli and Manning, 2013) , it has been pointed out that some sorts of knowledge are rarely expressed explicitly in textual corpora (Gordon and Van Durme, 2013) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-17",
"text": "Therefore, researchers have developed curated CKB resources by manual annotation (Speer et al., 2017) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-18",
"text": "While manually created knowledge has high precision, these resources suffer from lack of coverage."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-19",
"text": "Knowledge base completion methods are used to improve the coverage of existing generalpurpose KBs, such as Freebase (Bollacker et al., 2008; Bordes et al., 2013; Lin et al., 2015) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-20",
"text": "For example, given a node pair \u27e8Athens, Greece\u27e9, a completion method predicts the missing relation \"IsLocatedIn\"."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-21",
"text": "Such KBs consist of wellconnected entities; thus, the completion methods are mainly used to find missing links of the existing nodes."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-22",
"text": "On the other hand, CKBs are very sparse because their nodes contain arbitrary phrases and it is difficult to define all phrases in advance."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-23",
"text": "Therefore, it is important to consider CKB completion that can robustly take arbitrary phrases as input queries, even if they are not contained in the CKBs, to improve the coverage."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-24",
"text": "Li et al. (2016b) proposed an on-the-fly CKB completion model to improve the coverage of CKBs."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-25",
"text": "They defined the CKB completion task as a binary classification distinguishing true knowledge from false knowledge for arbitrary triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-27",
"text": "Here, in order to acquire new knowledge by using a CKB completion model, we have to prepare triplet candidates as input for the completion model, because the model can only verify whether the triple is true or not."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-28",
"text": "Li et al. (2016b) extracted such triplet candidates from the raw text of Wikipedia and also randomly selected from the phrase and relation set of ConceptNet."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-29",
"text": "Extracts from raw text likely contain unseen phrases, i.e., ones which do not exist in the CKB, and these phrases are useful for expanding the node size of the CKB; however, they reported that the quality of triples acquired from Wikipedia were significantly lower than that of combination triples from ConceptNet, because patterns extracted from Wikipedia by using linguistic patterns are noisier than those from ConceptNet."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-30",
"text": "For acquiring new knowledge with high quality, there are still problems with expanding new nodes and with the accuracy of CKB completion."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-31",
"text": "In this study, we focus on problems of increasing the node size of CKBs and increasing the connectivity of CKBs."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-32",
"text": "We introduce a new commonsense knowledge base generation (CKB generation) task for generating new nodes."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-33",
"text": "We also devise a model that jointly learns the completion and generation tasks."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-34",
"text": "The generation task can generate an arbitrary phrase t 2 from an input query and relation pair \u27e8t 1 , r\u27e9. The joint learning of the two tasks improves the completion task and triples generated by the generation model can be used as additional training data for the completion model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-35",
"text": "Our contributions are summarized as follows:"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-36",
"text": "\u2022 We define a new task, called commonsense knowledge base generation, and propose a method for joint learning of knowledge base completion and knowledge base generation."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-37",
"text": "\u2022 Experimental results demonstrate that our method achieved state-of-the-art CKB completion results on both ConceptNet and Japanese commonsense knowledge datasets."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-38",
"text": "\u2022 Experimental results show that our CKB generation can generate reasonable knowledge and augmented data generated by the model can improve CKB completion."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-39",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-40",
"text": "**TASK DEFINITION**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-41",
"text": "Our study focuses on two tasks, CKB completion and CKB generation."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-42",
"text": "We describe the settings of these tasks below."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-43",
"text": "Problem 1 (CKB completion)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-45",
"text": "**PROPOSED METHOD**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-46",
"text": "The proposed method is illustrated in Figure 1 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-47",
"text": "Our method consists of two models."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-48",
"text": "It performs both the CKB completion task and CKB generation task."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-49",
"text": "Two models share the parameters of a phrase encoder, word embeddings, and relation embeddings."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-50",
"text": "We describe these models in detail in Sections 3.1 and 3.2."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-52",
"text": "**CKB COMPLETION MODEL**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-53",
"text": "The basic structure of our CKB completion model is similar to that of Li et al. (2016b) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-54",
"text": "The main difference between ours and theirs is that our method learns the CKB completion and generation tasks jointly."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-55",
"text": "The completion model only considers the binary classification task, and therefore, it can be easily overfitted when there are not enough training data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-56",
"text": "By incorporating the generation model, the shared layers are trained for both binary classification and phrase generation."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-57",
"text": "This is expected to be a good constraint to prevent overfitting."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-58",
"text": "Previous model Li et al. (2016b) defined a CKB completion model that estimates a confidence score of an arbitrary triple \u27e8t 1 , r, t 2 \u27e9. They used a simple neural network model to formulate score(t 1 , r, t 2 ) \u2208 R."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-59",
"text": "where"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-60",
"text": "is a phrase representation of concatenating t 1 and t 2 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-61",
"text": "v r \u2208 R dr is a relation embedding for r. g is a nonlinear activation function."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-62",
"text": "Note that we use ReLU for g."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-63",
"text": "Our model Our CKB completion model is based on Li et al.'s (2016b) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-64",
"text": "However, the shared structure and the formulation of the phrase representations v 12 are different."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-65",
"text": "Li et al. (2016b) formulate the phrase embedding by using attention pooling of LSTM and a bilinear function."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-66",
"text": "where J is the word length of phrase t i , u is a linear transformation vector for calculating the attention vector, x i j and h i j are the j th word embedding and hidden state of the LSTM for phrase t i , and v r is the relation embedding."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-67",
"text": "Note that we calculated v 12 for DNN AVG and DNN LSTM by concatenating v 1 and v 2 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-68",
"text": "We used batch normalization (Ioffe and Szegedy, 2015) for v in before passing through the next layer."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-70",
"text": "**CKB GENERATION MODEL**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-71",
"text": "We use an attentional encoder-decoder model to generate phrase knowledge."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-72",
"text": "Here, we expected that the quality of the phrase representation would be increased by sharing the BiLSTM and embeddings between the CKB completion and CKB generation models."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-73",
"text": "For constructing the encoder-decoder model, we use relation information in addition to word sequences."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-74",
"text": "Let X = (x 1 , x 2 , ..., x J ) be the input word sequences and Y = (y 1 , y 2 , ..., y T ) be the output word sequences."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-75",
"text": "The conditional generation probability of Y is as follows:"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-76",
"text": "where \u03b8 is a set of model parameters, s t is a hidden state of the decoder, and c t is a context vector of input sequences that is weighted by the attention probability and calculated as"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-77",
"text": "Here, the BiLSTM, which is the encoder of the CKB generation model, is shared with that of the CKB completion model described in equation (2)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-78",
"text": "As shown in equation (9), we use relation embedding v r as additional input information."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-79",
"text": "There are several related studies on incorporating additional label information in a decoder (Li et al., 2016a) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-80",
"text": "Although the previous work used additional labels mainly for representing individuality or style information, we use this idea to represent relation information."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-81",
"text": "We also use the technique of tying word vectors and word classifiers (Inan et al., 2016) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-82",
"text": "The encoder BiLSTM is a singlelayer bidirectional LSTM, and the decoder LSTM is a single-layer LSTM."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-83",
"text": "We use a triple \u27e8t 1 , r, t 2 \u27e9 for training the encoder-decoder model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-84",
"text": "We train our models to be dual directional."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-85",
"text": "In the forward direction, the model predicts t 2 with the input \u27e8t 1 , r\u27e9, and in the backward direction, it predicts t 1 with the input \u27e8t 2 , r\u27e9. Here, since the relation r has a direction, we introduce a new relation r \u2032 for each r to train dual-directional CKB generation in one model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-86",
"text": "In the reverse direction, we replace the relation label r with r \u2032 ; namely, the output is t 1 , and the input is \u27e8t 2 , r \u2032 \u27e9. Therefore, in our CKB generation model, the vocabulary size of the relation is twice that of the original relation set."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-88",
"text": "**TRAINING**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-89",
"text": "Loss Function We use the following loss function for training: L(\u03b8) = L c + \u03bbL g , where \u03b8 is the set of model parameters, L c is the loss function of our CKB completion model, and L g is the loss function of our CKB generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-90",
"text": "We use binary cross entropy for L c ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-91",
"text": "where \u03c4 indicates the triple \u27e8t 1 , r, t 2 \u27e9, l is a binary variable that is 1 if the triple is a positive example (true triple) and 0 if the triple is a negative example (false triple), which we will explain in the next subsection."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-92",
"text": "\u03c3 is a sigmoid function."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-93",
"text": "We formulate the loss function for the encoder-decoder (CKB generation) model by using the cross entropy:"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-94",
"text": "where N is the sample size, T (n) is the number of words in the output phrase, c t is the context vector of the input sequence, and r is the relation label."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-95",
"text": "with the input \u27e8t 2 , r \u2032 \u27e9. This idea is inspired by a technique for improving NMT models (Sennrich et al., 2016) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-96",
"text": "To filter out unreliable candidates, we use the CKB completion score as a threshold."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-97",
"text": "We refer to the generated augmentation data as \"auggen\" in the experiment section."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-99",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-101",
"text": "**DATA**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-102",
"text": "For the experiments with English, we used the ConceptNet 100K data released by Li et al. (2016b) 1 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-103",
"text": "The original ConceptNet is a largescale and multi-lingual CKB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-104",
"text": "However, the evaluation set, which was created from a subset of the whole ConceptNet, consists of data only in English and contains many short phrases including single words."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-105",
"text": "In order to evaluate the robustness of CKB completion models in terms of the language and long phrases, we created a new open-domain Japanese commonsense knowledge dataset, Ja-KB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-106",
"text": "The statistics of these data are listed in Table 1."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-107",
"text": "There are more relation labels in ConceptNet than in Ja-KB, because we limited the relation types, which often contain nouns and verbs, when creating the Ja-KB data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-108",
"text": "The relation set of Ja-KB is Causes, MotivatedBy, Subevent, HasPrerequisite, ObstructedBy, Antonym, and Synonym."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-109",
"text": "The average length of phrases in Ja-KB is longer than in ConceptNet because of the data creation process."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-110",
"text": "The details of our dataset are described below:"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-111",
"text": "To create the Ja-KB data, we used crowdsourcing like in Open Mind Common Sense (OMCS) (Singh et al., 2002) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-112",
"text": "Since data annotated by crowd workers is usually noisy, we performed a two-step data collection process to eliminate noisy data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-113",
"text": "In the data creating step, a crowd worker created triples \u27e8t 1 , r, t 2 \u27e9 from the provided keywords."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-114",
"text": "The keywords consisted of combinations of nouns and verbs that frequently appeared in Web texts."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-115",
"text": "Each crowd worker created an arbitrary phrase t 1 (or t 2 ) by using the provided keywords and then selected a relation r \u2208 R and created a corresponding phrase t 2 (or t 1 )."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-116",
"text": "In the evaluation step, three workers chose a suitable r \u2208 R when they were given \u27e8t 1 , t 2 \u27e9, which were created by another worker."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-117",
"text": "Since a worker does not know which relation r the creator selected in the creation step, we can measure the reliability of the created knowledge from the overlap of the selected relations."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-118",
"text": "We used triples for which three or more workers selected the same relation label r. In our preliminary study, we found that the accuracy of CKB completion is lower when using low-reliability data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-119",
"text": "We randomly selected the test and validation data among the data for which all workers chose the same label."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-120",
"text": "The remaining data were used as training data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-121",
"text": "For the training data, we added the same number of triples as the evaluator selected same label for considering data reliability."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-122",
"text": "For example, if three evaluators selected the same label for a triple, we added the three triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-123",
"text": "For the test and validation data, we randomly sampled negative examples, as described in Section 4, whose size was the same as the number of positive examples according to (Li et al., 2016b )."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-124",
"text": "The details are described in the Supplementary Material."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-126",
"text": "**MODEL CONFIGURATIONS**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-127",
"text": "We set the dimensions of the hidden layer of the shared BiLSTM to 200, the word and relation embeddings to 200, and the intermediate hidden layer of the completion model to 1000."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-128",
"text": "We set the batch size to 100, dropout rate to 0.2, and weight decay to 0.00001."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-129",
"text": "For optimization, we used SGD and set the initial learning rate to 1.0."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-130",
"text": "We set the reduction of the learning rate to 0.5 and adjusted the learning rate."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-131",
"text": "We set \u03bb of the loss function to 1.0."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-132",
"text": "fastText (Bojanowski et al., 2016) and Wikipedia text were used to train the initial word embeddings."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-133",
"text": "When generating the augmentation data, we set the threshold score of CKB completion to 0.95 for the ConceptNet data and 0.8 for the Ja-KB data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-134",
"text": "The additional data amounted to about 200,000 triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-136",
"text": "**BASELINE METHOD**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-137",
"text": "CKB completion As baselines, we used the DNN AVG and DNN LSTM models (Li et al., 2016b ) that were described in Section 3.1."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-138",
"text": "To assess the effectiveness of joint learning, we compared our CKB completion model only (proposed w/o CKB generation) and the joint model (proposed w/ CKB generation)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-139",
"text": "Moreover, we evaluated the effectiveness of simply adding augmentation data, as described in Section 4 to the training data (+auggen)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-140",
"text": "We used the accuracy of binary classification as the evaluation measure."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-141",
"text": "The threshold was determined by using the validation1 data to maximize the accuracy of binary classification for each method, as in (Li et al., 2016b) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-142",
"text": "CKB generation We used a simple attentional encoder-decoder model that does not use relation information as a baseline (base)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-143",
"text": "We compared the proposed model with and without joint learning (proposed and proposed w/o CKBC)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-144",
"text": "We also evaluated the effectiveness of simply adding augmentation data as described in Section 4 to the training data (+auggen)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-146",
"text": "**RESULTS**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-148",
"text": "**CKB COMPLETION**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-149",
"text": "Does joint learning method improve the accuracy of CKB completion? Table 2 shows the accuracy of the CKB completion model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-150",
"text": "The bottom two lines show the best performances reported in (Li et al., 2016b) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-151",
"text": "The results indicate that our method improved the accuracy of CKB completion compared with the previous method."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-152",
"text": "Our method achieved 0.945 accuracy on the validation2 data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-153",
"text": "This result is close to human accuracy (about 0.95)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-177",
"text": "For that reason, we evaluated our CKB generation model from different viewpoints."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-154",
"text": "By comparing the results of the single model (proposed w/o CKB generation) and joint model (proposed w/ CKB generation), we can see that the joint model improved the accuracy for both ConceptNet and Ja-KB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-155",
"text": "This indicates that the loss function of CKB generation works as a good constraint for the CKB completion model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-156",
"text": "Does data augmentation from CKB generation improve the accuracy of CKB completion? Table 2 shows that augmentation data slightly improved the accuracy of both the ConceptNet test data and Ja-KB test data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-157",
"text": "(Li et al., 2016b) 0.913 0.920 -human (Li et al., 2016b) 0.950 -- Human evaluation for assessing the quality of CKB completion Since negative examples were randomly selected from the whole test set in the experiments described above (Table 2) , it was easy to distinguish some of them as positive and negative examples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-158",
"text": "To evaluate the ability of CKB completion in a more difficult setting, we eliminated obviously-false triples and performed manual annotation with the remaining triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-159",
"text": "Then we conducted a binary classification experiment with these annotated triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-160",
"text": "The details are described below: First, we prepared triple candidates by using the ConceptNet and Ja-KB datasets."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-161",
"text": "We replaced one of the phrases of the existing triple with a similar phrase, where the similarity was calculated by using the average of the word embeddings."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-162",
"text": "We made 100 replacement triples per triple."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-163",
"text": "Next, we scored the prepared triples by using our CKB completion model and randomly sampled 500 triples whose CKB completion scores were larger than a threshold."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-164",
"text": "Then, ten annotators gave subjective evaluation scores to all 500 triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-165",
"text": "In this evaluation, the annotators rated the degree of agreement with each statement (triple) on 0-4 rating scale (0 = strongly disagree, 4 = strongly agree), where the annotator interpreted each triple as a statement by using the relation explanation."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-166",
"text": "For example, \u27e8dog, HasA, tail\u27e9 means \"a dog has a tail\"."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-167",
"text": "Finally, we sampled the top 100 triples which had small variance from the 500 annotated data and labeled those having average scores of 3 or over with 1 (positive examples; 57% and 55% of the top 100 triples of CN and Ja-KB, respectively) and those having average scores lower than 3 with 0 (negative examples; 43% and 45%)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-168",
"text": "Table 3 indicates the binary classification accuracy for the 100 sampled triples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-169",
"text": "While the proposed method improved accuracy, the accuracy of +auggen was slightly lower than it."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-170",
"text": "This indicates that we have to select the augmentation data and the thresholds more carefully to improve the accuracy of difficult examples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-171",
"text": "Moreover, the overall score is lower than the result of Table 2 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-172",
"text": "This indicates there is room for improving the CKB completion accuracy for difficult examples."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-173",
"text": "To distinguish more difficult examples and improve the accuracy of knowledge acquisition, we have to develop a better negative sampling strategy for training."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-175",
"text": "**CKB GENERATION**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-176",
"text": "It is difficult to evaluate the quality of the CKB generation model directly, since there are many correct phrase candidates in addition to phrases that appear in the test data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-178",
"text": "Can our CKB generation model generate reasonable phrases?"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-179",
"text": "To see whether the top-n phrases generated from each query in the test set included the reference phrase that corresponds to the query, we calculated the recall of the reference phrases as follows:"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-180",
"text": "where N match is the number of generated phrases that exactly match the reference phrases."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-181",
"text": "Figure 2 shows the recall of the reference phrases for each CKB generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-182",
"text": "The results shown in the figure are averages over the test queries."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-183",
"text": "Compared with the baseline system, our CKB generation model achieved higher recalls on both ConceptNet and Ja-KB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-184",
"text": "This indicates that considering relation information worked well."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-185",
"text": "The effectiveness of using augmentation data is also illustrated in Figure 2 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-186",
"text": "For the Ja-KB data, recall improved as a result of adding augmentation data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-187",
"text": "Since the phrase length of the node in ConceptNet is shorter than in Ja-KB, it is easier to cover reference phrases for ConceptNet."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-188",
"text": "Can our CKB generation model generate new phrases?"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-189",
"text": "To evaluate the effectiveness of our generation model at increasing the node size of a CKB, we determined whether our model could generate new phrases that are not included in the existing CKB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-190",
"text": "Figure 3 shows that the average number of such new phrases in the n-best outputs of our model that were generated from a query pair of a phrase and a relation in the test set of ConceptNet and Ja-KB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-191",
"text": "We can see from the figure that our model could make triples that contain new phrases by generating multiple phrases from a query pair."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-192",
"text": "The figure also plots the average CKB completion score of each generated triple that contains new phrases; the results confirm that the generated triples had a high CKB completion score."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-193",
"text": "Table 4 lists examples of phrases created by the generation model; score-g indicates the logarithmic probability of the generation model, and score-c indicates the score of the completion model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-194",
"text": "Table 4 : Examples of phrases created using CKB generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-195",
"text": "The relation label \"HP\" represents HasPrerequisite."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-196",
"text": "t 2 is the generated phrase, and the input is \u27e8t 1 , r\u27e9. * represents that the generated triple is new, and ** represents that the generated t 2 is new."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-197",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-198",
"text": "**GENERATED EXAMPLES**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-199",
"text": "jective evaluations of the quality of the triples generated with our model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-200",
"text": "First, we generated two types of query pairs: ones generated from ConceptNet (CN gen) and ones generated from Wikipedia (Wiki gen)."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-201",
"text": "In CN gen, we used all phrase and relation pairs \u27e8t, r\u27e9 appearing in the test data."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-202",
"text": "In Wiki gen, we used triples extracted by using the POS tag sequence pattern for each relation according to Li et al. (2016b) and scored each triple with CKB completion scores."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-203",
"text": "Then, we used \u27e8t, r\u27e9 pairs of 10000 triples that had higher scores than a threshold as the input query pairs."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-204",
"text": "Next, we generated a phrase t gen from \u27e8t, r\u27e9 and made new triples \u27e8t, r, t gen \u27e9 with our CKB generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-205",
"text": "We sorted the generated triples according to the CKB completion score and selected the top-100 new triples for CN gen and Wiki gen. The annotators assigned a (semantic) quality score and grammatical score to each triple."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-206",
"text": "We used a 0-4 degree agreement score (described in 6.1) for evaluating triple quality and a 0-2 score (0. Doesn't make sense."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-207",
"text": "1. There are some grammatical errors."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-208",
"text": "2. There are no grammatical errors.) for the evaluation of grammatical quality."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-209",
"text": "We recruited ten annotators who were native speakers of each language."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-210",
"text": "We show the results in Table 5 ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-211",
"text": "The quality score of each triple of CN gen was quite high."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-212",
"text": "The quality score of Wiki gen was lower than that of CN gen. Since Wikipedia has lots of specific information, it is difficult to extract an input query that is useful for making commonsense knowledge."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-213",
"text": "This tendency is similar to the results reported in Li et al. (Li et al., 2016b) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-214",
"text": "The grammatical score was high for both CN gen and Wiki gen."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-215",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-216",
"text": "**RELATED WORK**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-217",
"text": "Knowledge base completion for entity-relation triples There are many studies that embed graph structures such as TransE, TransR, HolE, and STransE (Bordes et al., 2013; Lin et al., 2015; Nickel et al., 2016; Nguyen et al., 2016) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-218",
"text": "Their methods aim to learn low-dimensional representations for entities and relationships by using topological features."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-219",
"text": "Although these methods are widely used, they rely on the connectivity of the existing KB and are only suitable for predicting relationships between existing, well-connected entities (Shi and Weninger, 2018) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-220",
"text": "Therefore, it is difficult to get good representations for new nodes that have no connections with existing nodes."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-221",
"text": "Several studies have added text information to the graph embeddings (Zhong et al., 2015; Wang and Li, 2016; Xiao et al., 2017) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-222",
"text": "These studies aim to incorporate richer information in the graph embedding."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-223",
"text": "They combine a graph embedding model and a text embedding model into one."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-224",
"text": "The text information they use is the description or definition statement of each node."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-225",
"text": "For example, they would use the description \"Barack Obama is the 44th and current President of United States\" for the node \"Barack Obama\" and make better quality embeddings."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-226",
"text": "Although these methods effectively incorporate text information, they assume that the descriptions of entities can be easily acquired."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-227",
"text": "For example, they use the originally aligned descriptions (e.g., DBpedia, Freebase) or descriptions acquired by using a simple entity linking method."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-228",
"text": "Moreover, the methods use topological information, and they are not designed for on-the-fly knowledge base completion."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-229",
"text": "Knowledge base completion for commonsense triples In commonsense knowledge base completion, the nodes of the KB consist of arbitrary phrases (word sequences), and there are a huge number of unique nodes."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-230",
"text": "In such case, the KB graph becomes very sparse, and consequently, there is almost no merit to considering the topological features of the KBs."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-231",
"text": "Moreover, on-the-fly KBC is needed because we have to handle new nodes as input."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-232",
"text": "It is thus more important to formulate phrase and relation embeddings that can robustly represent arbitrary phrases."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-233",
"text": "There are a few studies on CKB completion models."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-234",
"text": "In particular, Li et al. (2016b) and Socher et al. (2013) proposed a simple KBC model for CKB."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-235",
"text": "The formulations of CKB completion in the two studies are the same, and we evaluated Li et al. (2016b) 's method as a baseline."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-236",
"text": "Open Information Extraction Open Information Extraction (OpenIE) aims to extract triple knowledge from raw text."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-237",
"text": "It finds triples that have specific predefined relations by using lexical and syntactic patterns (Mintz et al., 2009; Fader et al., 2011) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-238",
"text": "Several neural-network-based relation extraction methods have been proposed (Lin et al., 2016; Zhang et al., 2017) ."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-239",
"text": "These models construct classifiers to estimate the relation between two arbitrary entities."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-240",
"text": "OpenIE models are trained with sentence-level annotation data or distant supervision, while our model is trained with triples in a knowledge base."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-241",
"text": "Since openIE can extract new triples from raw text, it can be used to make augmentation data for the CKB completion model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-242",
"text": "Knowledge generation There are several studies on the knowledge generation task that use neural network models."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-243",
"text": "For example, Hu et al. (2017) proposed an event prediction model that uses a sequence-to-sequence model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-244",
"text": "Prakash et al. (2016) and Li et al. (2017) proposed a paraphrase generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-245",
"text": "These studies targeted only specific relationships and did not explicitly incorporate relations into the generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-246",
"text": "Our CKB generation model explicitly incorporates relation information into the decoder and can model multiple relationships in one model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-247",
"text": "----------------------------------"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-248",
"text": "**CONCLUSION**"
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-249",
"text": "We proposed a new CKB generation task and joint learning method of CKB completion and generation."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-250",
"text": "Experimental results with two commonsense datasets demonstrated that our model has two strengths: it improves the coverage of the knowledge bases."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-251",
"text": "While conventional completion tasks are limited to verifying given triples, our generative model can create new knowledge including new phrases that are not in the knowledge bases."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-252",
"text": "Second, our completion model can improve the verification accuracy."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-253",
"text": "Two characteristics of our completion model contribute to this improvement: (i) the model shares the hidden layers, word embedding, and relation embedding with the generation model to acquire good phrase and relation representations, and (ii) it can be trained with the augmentation data created by the generation model."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-254",
"text": "In this study, we did not utilize raw text information such as from Wikipedia during training except for pre-trained word embeddings."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-255",
"text": "We would like to extend our method so that it can incorporate raw text information."
},
{
"sent_id": "e0b72115e1905226d22876e72aa304-C001-256",
"text": "Moreover, we would like to develop a method that effectively utilizes this commonsense knowledge for other NLP tasks that need commonsense reasoning."
}
],
"y": {
"@SIM@": {
"gold_contexts": [
[
"e0b72115e1905226d22876e72aa304-C001-53"
],
[
"e0b72115e1905226d22876e72aa304-C001-141"
],
[
"e0b72115e1905226d22876e72aa304-C001-213"
]
],
"cite_sentences": [
"e0b72115e1905226d22876e72aa304-C001-53",
"e0b72115e1905226d22876e72aa304-C001-141",
"e0b72115e1905226d22876e72aa304-C001-213"
]
},
"@DIF@": {
"gold_contexts": [
[
"e0b72115e1905226d22876e72aa304-C001-53",
"e0b72115e1905226d22876e72aa304-C001-54"
],
[
"e0b72115e1905226d22876e72aa304-C001-150",
"e0b72115e1905226d22876e72aa304-C001-151"
]
],
"cite_sentences": [
"e0b72115e1905226d22876e72aa304-C001-53",
"e0b72115e1905226d22876e72aa304-C001-150"
]
},
"@BACK@": {
"gold_contexts": [
[
"e0b72115e1905226d22876e72aa304-C001-58"
],
[
"e0b72115e1905226d22876e72aa304-C001-65"
],
[
"e0b72115e1905226d22876e72aa304-C001-234"
]
],
"cite_sentences": [
"e0b72115e1905226d22876e72aa304-C001-58",
"e0b72115e1905226d22876e72aa304-C001-234"
]
},
"@EXT@": {
"gold_contexts": [
[
"e0b72115e1905226d22876e72aa304-C001-63"
]
],
"cite_sentences": []
},
"@USE@": {
"gold_contexts": [
[
"e0b72115e1905226d22876e72aa304-C001-102"
],
[
"e0b72115e1905226d22876e72aa304-C001-137"
],
[
"e0b72115e1905226d22876e72aa304-C001-141"
],
[
"e0b72115e1905226d22876e72aa304-C001-202"
],
[
"e0b72115e1905226d22876e72aa304-C001-235"
]
],
"cite_sentences": [
"e0b72115e1905226d22876e72aa304-C001-102",
"e0b72115e1905226d22876e72aa304-C001-137",
"e0b72115e1905226d22876e72aa304-C001-141",
"e0b72115e1905226d22876e72aa304-C001-202",
"e0b72115e1905226d22876e72aa304-C001-235"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"e0b72115e1905226d22876e72aa304-C001-123"
]
],
"cite_sentences": [
"e0b72115e1905226d22876e72aa304-C001-123"
]
}
}
},
"ABC_abc19723df6670960705eadbaa6c13_4": {
"x": [
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-2",
"text": "Abstract."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-3",
"text": "This paper describes a novel method for a word sense disambiguation that utilizes relatives (i.e. synonyms, hypernyms, meronyms, etc in WordNet) of a target word and raw corpora."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-4",
"text": "The method disambiguates senses of a target word by selecting a relative that most probably occurs in a new sentence including the target word."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-5",
"text": "Only one cooccurrence frequency matrix is utilized to efficiently disambiguate senses of many target words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-6",
"text": "Experiments on several English datum present that our proposed method achieves a good performance."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-9",
"text": "With its importance, a word sense disambiguation (WSD) has been known as a very important field of a natural language processing (NLP) and has been studied steadily since the advent of NLP in the 1950s."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-10",
"text": "In spite of the long study, few WSD systems are used for practical NLP applications unlike part-of-speech (POS) taggers and syntactic parsers."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-11",
"text": "The reason is because most of WSD studies have focused on only a small number of ambiguous words based on sense tagged corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-12",
"text": "In other words, the previous WSD systems disambiguate senses of just a few words, and hence are not helpful for other NLP applications because of its low coverage."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-13",
"text": "Why have the studies about WSD stayed on the small number of ambiguous words?"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-14",
"text": "The answer is on sense tagged corpus where a few words are assigned to correct senses."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-15",
"text": "Since the construction of the sense tagged corpus needs a great amount of times and cost, most of current sense tagged corpora contain a small number of words less than 100 and the corresponding senses to the words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-16",
"text": "The corpora, which have sense information of all words, have been built recently, but are not large enough to provide sufficient disambiguation information of the all words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-17",
"text": "Therefore, the methods based on the sense tagged corpora have difficulties in disambiguating senses of all words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-18",
"text": "In this paper, we proposed a novel WSD method that requires no sense tagged corpus 1 and that identifies senses of all words in sentences or documents, not a small number of words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-19",
"text": "Our proposed method depends on raw corpus, which is relatively very large, and on WordNet [1] , which is a lexical database in a hierarchical structure."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-21",
"text": "**RELATED WORKS**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-22",
"text": "There are several works for WSD that do not depend on a sense tagged corpus, and they can be classified into three approaches according to main resources used: raw corpus based approach [2] , dictionary based approach [3, 4] and hierarchical lexical database approach."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-23",
"text": "The hierarchical lexical database approach can be reclassified into three groups according to usages of the database: gloss based method [5] , conceptual density based method [6, 7] and relative based method [8, 9, 10] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-24",
"text": "Since our method is a kind of the relative based method, this section describes the related works of the relative based method."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-25",
"text": "[8] introduced the relative based method using International Roget's Thesaurus as a hierarchical lexical database."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-26",
"text": "His method is conducted as follows: 1) Get relatives of each sense of a target word from the Roget's Thesaurus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-27",
"text": "2) Collect example sentences of the relatives, which are representative of each sense."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-28",
"text": "3) Identify salient words in the collective context and determine weights for each word."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-29",
"text": "4) Use the resulting weights to predict the appropriate sense for the target word occurring in a novel text."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-30",
"text": "He evaluated the method on 12 English nouns, and showed over than 90% precision."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-31",
"text": "However, the evaluation was conducted on just a small part of senses of the words, not on all senses of them."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-32",
"text": "He indicated that a drawback of his method is on the ambiguous relative: just one sense of the ambiguous relative is usually related to a target word but the other senses of the ambiguous relatives are not."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-33",
"text": "Hence, a collection of example sentences of the ambiguous relative includes the example sentences irrelevant to the target word, which prevent WSD systems from collecting correct WSD information."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-34",
"text": "For example, an ambiguous word rail is a relative of a meaning bird of a target word crane at WordNet, but the word rail means railway for the most part, not the meaning related to bird."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-35",
"text": "Therefore, most of the example sentences of rail are not helpful for WSD of crane."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-36",
"text": "His method has another problem in disambiguating senses of a large number of target words because it requires a great amount of time and storage space to collect example sentences of relatives of the target words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-37",
"text": "[9] followed the method of [8] , but tried to resolve the ambiguous relative problem by using just unambiguous relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-38",
"text": "That is, the ambiguous relative rail is not utilized to build a training data of the word crane because the word rail is ambiguous."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-39",
"text": "Another difference from [8] is on a lexical database: they utilized WordNet as a lexical database for acquiring relatives of target words instead of International Roget's Thesaurus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-40",
"text": "Since WordNet is freely available for research, various kinds of WSD studies based on WordNet can be compared with the method of [9] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-41",
"text": "They evaluated their method on 14 ambiguous nouns and achieved a good performance comparable to the methods based on the sense tagged corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-42",
"text": "However, the evaluation was conducted on a small part of senses of the target words like [8] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-43",
"text": "However, many senses in WordNet do not have unambiguous relatives through relationships such as synonyms, direct hypernyms, and direct hyponyms."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-44",
"text": "2 A possible alternative is to use the unambiguous relatives in the long distance from a target word, but the way is still problematic because the longer the distance of two senses is, the weaker the relationship between them is."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-45",
"text": "In other words, the unambiguous relatives in the long distance may provide irrelevant examples for WSD like ambiguous relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-46",
"text": "Hence, the method has difficulties in disambiguating senses of words that do not have unambiguous relatives near the target words in the WordNet."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-47",
"text": "The problem becomes more serious when verbs, which most of the relatives are ambiguous, are disambiguated."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-48",
"text": "Like [8] , the method also has a difficulty in disambiguating senses of many words because the method collects the example sentences of relatives of many words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-49",
"text": "[10] reimplemented the method of [9] using a web, which may be a very large corpus, in order to collect example sentences."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-50",
"text": "They built training datum of all noun words in WordNet whose size is larger than 7GB, but evaluated their method on a small number of nouns of lexical sample task of SENSEVAL-2 as [8] and [9] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-52",
"text": "**WORD SENSE DISAMBIGUATION BY RELATIVE SELECTION**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-53",
"text": "Our method disambiguates senses of a target word in a sentence by selecting only a relative among the relatives of the target word that most probably occurs in the sentence."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-54",
"text": "A flowchart of our method is presented in Figure 1 with an example 3 : 1) Given a new sentence including a target word, a set of relatives of the target word is created by looking up in WordNet."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-55",
"text": "2) Next, the relative that most probably occurs in the sentence is chosen from the set."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-56",
"text": "In this step, cooccurrence frequencies between relatives and words in the sentence are used in order to calculate the probabilities of relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-57",
"text": "Our method does not depend on the training data, but on co-occurrence frequency matrix."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-58",
"text": "Hence in our method, it is not necessary to build the training data, which requires too much time and space."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-59",
"text": "3) Finally, a sense of the target word is determined as the sense that is related to the selected relative."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-60",
"text": "In this example, the relative stork is selected with the highest probability and the proper sense is determined as crane#1, which is related to the selected relative stork."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-61",
"text": "Our method makes use of ambiguous relatives as well as unambiguous relatives unlike [9] and hence overcomes the shortage problem of relatives and also reduces the problem of ambiguous relatives in [8] by handling relatives separately instead of putting example sentences of the relatives together into a pool."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-63",
"text": "**RELATIVE SELECTION**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-64",
"text": "The selected relative of the i-th target word tw i in a sentence C is defined to be the relative of tw i that has the largest co-occurrence probability with the words in the sentence:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-65",
"text": "where SR is the selected relative, r ij is the j-th relative of tw i , S rij is a sense of tw i that is related to the relative r ij , and W is a weight of r ij ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-66",
"text": "The right hand side of Eq. 1 is logarithmically calculated by Bayesian rule:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-67",
"text": "The first probability in Eq. 2 is computed under the assumption that words in C occur independently as follows:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-68",
"text": "where w k is the k-th word in C and n is the number of words in C. The probability of w k given r ij is calculated:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-69",
"text": "where P (r ij , w k ) is a joint probability of r ij and w k , and P (r ij ) is a probability of r ij ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-70",
"text": "Other probabilities in Eq. 2 and 4 are computed as follows:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-71",
"text": "where f req(r ij , w k ) is the frequency that r ij and w k co-occur in a raw corpus, f req(r ij ) is the frequency of r ij in the corpus, and CS is a corpus size, which is the sum of frequencies of all words in the raw corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-72",
"text": "W Nf(S rij ) and W Nf(tw i ) is the frequency of a sense related to r ij and tw i in WordNet."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-73",
"text": "4 In Eq. 7, 0.5 is a smoothing factor and n is the number of senses of tw i ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-74",
"text": "Finally, in Eq. 2, the weights of relatives, W (r ij , tw i ), are described in following Section 3.1."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-75",
"text": "Relative Weight."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-76",
"text": "WordNet provides relatives of words, but all of them are not useful for WSD."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-77",
"text": "That is to say, it is clear that most of ambiguous relatives may bring about a problem by providing example sentences irrelevant to the target word to WSD system as described in the previous section."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-78",
"text": "However, WordNet as a lexical database is classified as a fine-grained dictionary, and consequently some words are classified into ambiguous words though the words represent just one sense in the most occurrences."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-79",
"text": "Such ambiguous relatives may be useful for WSD of target words that are related to the most frequent senses of the ambiguous relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-80",
"text": "For example, a relative bird of a word crane is an ambiguous word, but it usually represents one meaning, \"warm-blooded egglaying vertebrates characterized by feathers and forelimbs modified as wings\", which is closely related to crane."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-81",
"text": "Hence, the word bird can be a useful relative of the word crane though the word bird is ambiguous. But the ambiguous relative is not useful for other target words that are related to the least frequent senses of the relatives: that is, a relative bird is never helpful to disambiguate the senses of a word birdie, which is related to the least frequent sense of the relative bird."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-82",
"text": "We employ a weighting scheme for relatives in order to identify useful relatives for WSD."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-83",
"text": "In terms of weights of relatives, our intent is to provide the useful relative with high weights, but the useless relatives with low weights."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-84",
"text": "For instance, a relative bird of a word crane has a high weight whereas a relative bird of a word birdie get a low weight."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-85",
"text": "For the sake of the weights, we calculate similarities between a target word and its relatives and determine the weight of each relative based on the degree of the similarity."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-86",
"text": "Among similarity measures between words, the total divergence to the mean (TDM) is adopted, which is known as one of the best similarity measures for word similarity [11] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-87",
"text": "Since TDM estimates a divergence between vectors, not between words, words have to be represented by vectors in order to calculate the similarity between the words based on the TDM."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-88",
"text": "We define vector elements as words that occur more than 10 in a raw corpus, and build vectors of words by counting co-occurrence frequencies of the words and vector elements."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-89",
"text": "TDM does measure the divergence between words, and hence a reciprocal of the TDM measure is utilized as the similarity measure:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-90",
"text": "where Sim("
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-92",
"text": "**CO-OCCURRENCE FREQUENCY MATRIX**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-93",
"text": "In order to select a relative for a target word in a given sentence, we must calculate probabilities of relatives given the sentence, as described in previous section."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-94",
"text": "These probabilities as Eq. 5 and 6 can be estimated based on frequencies of relatives and co-occurrence frequencies between each relative and each word in the sentence."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-95",
"text": "In order to acquire the frequency information for calculating the probabilities, the previous relative based methods constructed a training data by collecting example sentences of relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-96",
"text": "However, to construct the training data requires a great amount of time and storage space."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-97",
"text": "What is worse, it is an awful work to construct training datum of all ambiguous words, whose number is over than 20,000 in WordNet."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-98",
"text": "Instead, we build a co-occurrence frequency matrix (CFM) from a raw corpus that contains frequencies of words and word pairs."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-99",
"text": "A value in the i-th row and j-th column in the CFM represents the co-occurrence frequency of the i-th word and j-th word in a vocabulary, and a value in the i-th row and the i-th column represents the frequency of the i-th word."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-100",
"text": "The CFM is easily built by counting words and word pairs in a raw corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-101",
"text": "Furthermore, it is not necessary to make a CFM per each ambiguous word since a CFM contains frequencies of all words including relatives and word pairs."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-102",
"text": "Therefore, our proposed method disambiguates senses of all ambiguous words efficiently by referring to only one CFM."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-103",
"text": "The frequencies in Eq. 5 and 6 can be obtained through a CFM as follows:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-104",
"text": "where w i is a word, and cf m(i, j) represents the value in the i-th row and j-th column of the CFM, in other word, the frequency that the i-th word and j-th word co-occur in a raw corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-106",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-108",
"text": "**EXPERIMENTAL ENVIRONMENT**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-109",
"text": "Experiments were carried out on several English sense tagged corpora: SemCor and corpora for both lexical sample task and all words task of both SENSEVAL-2 & -3."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-110",
"text": "5 SemCor [12] 6 is a semantic concordance, where all content words (i.e. noun, verb, adjective, and adverb) are assigned to WordNet senses."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-111",
"text": "SemCor consists of three parts: brown1, brown2 and brownv."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-112",
"text": "We used all of the three parts of the SemCor for evaluation."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-113",
"text": "In our method, raw corpora are utilized in order to build a CFM and to calculate similarities between words for the sake of the weights of relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-114",
"text": "We adopted Wall Street Journal corpus in Penn Treebank II [13] and LATIMES corpus in TREC as raw corpora, which contain about 37 million word occurrences."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-115",
"text": "Our CFM contains frequencies of content words and content word pairs."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-116",
"text": "In order to identify the content words from the raw corpus, Tree-Tagger [14] , which is a kind of automatic POS taggers, is employed."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-117",
"text": "WordNet provides various kinds of relationships between words or synsets."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-118",
"text": "In our experiments, the relatives in Table 1 are utilized according to POSs of target words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-119",
"text": "In the table, hyper3 means 1 to 3 hypernyms (i.e. parents, grandparents and great-grandparent) and hypo3 is 1 to 3 hyponyms (i.e. children, grandchildren and great-grandchildren)."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-121",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-122",
"text": "Comparison with Other Relative Based Methods."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-123",
"text": "We tried to compare our proposed method with the previous relative based methods."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-124",
"text": "However, both of [8] and [9] did not evaluate their methods on a publicly available data."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-125",
"text": "We implemented their methods and compared our method with them on the same evaluation data."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-126",
"text": "When both of the methods are implemented, it is practically difficult to collect example sentences of all target words in the evaluation data."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-127",
"text": "Instead, we implemented the previous methods to work with our CFM."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-128",
"text": "WordNet was utilized as a lexical database to acquire relatives of target words and the sense disambiguation modules were implemented by using on Na\u00efve Bayesian classifier, which [9] adopted though [8] utilized International Roget's Thesaurus and other classifier similar to decision lists."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-129",
"text": "Also the bias of word senses, which is presented at WordNet, is reflected on the implementation in order to be in a same condition with our method."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-130",
"text": "Hence, the reimplemented methods in this paper are not exactly same with the previous methods, but the main ideas of the methods are not corrupted."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-131",
"text": "A correct sense of a target word tw i in a sentence C is determined as follows:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-132",
"text": "where Sense(tw i , C) is a sense of tw i in C, s ij is the j-th sense of tw i ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-133",
"text": "P wn (s ij ) is the WordNet probability of s ij ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-134",
"text": "The right hand side of Eq. 10 is calculated logarithmically under the assumption that words in C occur independently:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-135",
"text": "where w k is the k-th word in C and n is the number of words in C. In Eq. 11, we assume independence among the words in C. Probabilities in Eq. 11 are calculated as follows:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-136",
"text": "where f req(s ij , w k ) is the frequency that s ij and w k co-occur in a corpus, f req(s ij ) is the frequency of s ij in a corpus, which is the sum of frequencies of all relatives related to s ij ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-137",
"text": "CS means corpus size, which is the sum of frequencies of all words in a corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-138",
"text": "W Nf(s ij ) and W Nf(tw i ) are the frequencies of a s ij and tw i in WordNet, respectively, which represent bias of word senses."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-139",
"text": "Eq. 14 is the same with Eq. 7 in Section 3."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-140",
"text": "Since the training data are built by collecting example sentences of relatives in the previous works, the frequencies in Eq. 12 and 13 are calculated with our matrix as follows:"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-141",
"text": "where r l is a relative related to the sense s ij ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-142",
"text": "f req(r l , w k ) and f req(r l ) are the co-occurrence frequency between r l and w k and the frequency of r l , respectively, and both frequencies can be obtained by looking up the matrix since the matrix contains the frequencies of words and word pairs."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-143",
"text": "The main difference between [8] and [9] is whether ambiguous relatives are utilized or not."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-144",
"text": "Considering the difference, we implemented the method of [8] to include the ambiguous relatives into relatives, but the method of [9] to exclude the ambiguous relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-145",
"text": "Table 3 ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-146",
"text": "Comparison results with top 3 systems at SENSEVAL S2 LS S2 ALL S3 ALL [15] 40.2% 56.9% ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-147",
"text": "[16] 29.3% 45.1% ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-148",
"text": "[5] 24.4% 32.8% ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-149",
"text": "[17] . . 58.3% [18] . . 54.8% [19] . . 48.1% Our method 40.94% 45.12% 51.35% Table 2 shows the comparison results."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-150",
"text": "7 In the table, All Relatives and Unambiguous Relatives represent the results of the reimplemented methods of [8] and [9] , respectively."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-151",
"text": "It is observed in the table that our proposed method achieves better performance on all evaluation data than the previous methods though the improvement is not large."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-152",
"text": "Hence, we may have an idea that our method handles relatives and in particular ambiguous relatives more effectively than [8] and [9] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-153",
"text": "Compared with [9] , [8] obtains a better performance, and the difference between the performance of them are totally more than 15 % on all of the evaluation data."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-154",
"text": "From the comparison results, it is desirable to utilize ambiguous relatives as well as unambiguous relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-155",
"text": "[10] evaluated their method on nouns of lexical sample task of SENSEVAL-2."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-156",
"text": "Their method achieved 49.8% recall."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-157",
"text": "When evaluated on the same nouns of the lexical sample task, our proposed method achieved 47.26%, and the method of [8] 45.61%, and the method of [9] 38.03%."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-158",
"text": "Compared with our implementations, [10] utilized a web as a raw corpus that is much larger than our raw corpus, and employed various kinds of features such as bigram, trigram, part-of-speeches, etc."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-159",
"text": "8 Therefore, it can be conjectured that a size of a raw corpus and features play an important role in the performance."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-160",
"text": "We can observe that in our implementation of the method of [9] , the data sparseness problem is very serious since unambiguous relatives are usually not frequent in the raw corpus."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-161",
"text": "In the web, the problem seems to be alleviated."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-162",
"text": "Further studies are required for the effects of various features."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-163",
"text": "Comparison with Systems Participated in SENSEVAL."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-164",
"text": "We also compared our method with the top systems at SENSEVAL that did not use sense tagged corpora."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-165",
"text": "9 Table 3 shows the official results of the top 3 participating systems at SENSEVAL-2 & 3 and experimental performance of our method."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-166",
"text": "In the table, it is observed that our method is ranked in top 3 systems."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-168",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-169",
"text": "We have proposed a simple and novel method that determines senses of all contents words in sentences by selecting a relative of the target words in WordNet."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-170",
"text": "The relative is selected by using a co-occurrence frequency between the relative and the words surrounding the target word in a given sentence."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-171",
"text": "The cooccurrence frequencies are obtained from a raw corpus, not from a sense tagged corpus that is often required by other approaches."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-172",
"text": "We tested the proposed method on SemCor data and SENSEVAL data, which are publicly available."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-173",
"text": "The experimental results show that the proposed method effectively disambiguates many ambiguous words in SemCor and in test data for SENSEVAL all words task, as well as a small number of ambiguous words in test data for SENSEVAL lexical sample task."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-174",
"text": "Also our method more correctly disambiguates senses than [8] and [9] ."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-175",
"text": "Furthermore, the proposed method achieved comparable performance with the top 3 ranked systems at SENSEVAL-2 & 3."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-176",
"text": "In consequence, our method has two advantages over the previous methods ( [8] and [9] ): our method 1) handles the ambiguous relatives and unambiguous relatives more effectively, and 2) utilizes only one co-occurrence matrix for disambiguating all contents words instead of collecting training data of the content words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-177",
"text": "However, our method did not achieve good performances."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-178",
"text": "One reason of the low performance is on the relatives irrelevant to the target words."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-179",
"text": "That is, investigation of several instances which assign to incorrect senses shows that relatives irrelevant to the target words are often selected as the most probable relatives."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-180",
"text": "Hence, we will try to devise a filtering method that filters out the useless relatives before the relative selection phase."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-181",
"text": "Also we will plan to investigate a large number of tagged instances in order to find out why our method did not achieve much better performance than the previous works and to detect how our method selects the correct relatives more precisely."
},
{
"sent_id": "abc19723df6670960705eadbaa6c13-C001-182",
"text": "Finally, we will conduct experiments with various features such as bigrams, trigrams, POSs, etc, which [10] considered and examine a relationship of a size of a raw corpus and a system performance."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"abc19723df6670960705eadbaa6c13-C001-23"
],
[
"abc19723df6670960705eadbaa6c13-C001-37"
],
[
"abc19723df6670960705eadbaa6c13-C001-40"
],
[
"abc19723df6670960705eadbaa6c13-C001-49"
],
[
"abc19723df6670960705eadbaa6c13-C001-50"
],
[
"abc19723df6670960705eadbaa6c13-C001-128"
],
[
"abc19723df6670960705eadbaa6c13-C001-143"
],
[
"abc19723df6670960705eadbaa6c13-C001-153"
]
],
"cite_sentences": [
"abc19723df6670960705eadbaa6c13-C001-23",
"abc19723df6670960705eadbaa6c13-C001-37",
"abc19723df6670960705eadbaa6c13-C001-40",
"abc19723df6670960705eadbaa6c13-C001-49",
"abc19723df6670960705eadbaa6c13-C001-50",
"abc19723df6670960705eadbaa6c13-C001-128",
"abc19723df6670960705eadbaa6c13-C001-143",
"abc19723df6670960705eadbaa6c13-C001-153"
]
},
"@DIF@": {
"gold_contexts": [
[
"abc19723df6670960705eadbaa6c13-C001-61"
],
[
"abc19723df6670960705eadbaa6c13-C001-123",
"abc19723df6670960705eadbaa6c13-C001-124"
],
[
"abc19723df6670960705eadbaa6c13-C001-152"
],
[
"abc19723df6670960705eadbaa6c13-C001-157"
],
[
"abc19723df6670960705eadbaa6c13-C001-174"
],
[
"abc19723df6670960705eadbaa6c13-C001-176"
]
],
"cite_sentences": [
"abc19723df6670960705eadbaa6c13-C001-61",
"abc19723df6670960705eadbaa6c13-C001-124",
"abc19723df6670960705eadbaa6c13-C001-152",
"abc19723df6670960705eadbaa6c13-C001-157",
"abc19723df6670960705eadbaa6c13-C001-174",
"abc19723df6670960705eadbaa6c13-C001-176"
]
},
"@USE@": {
"gold_contexts": [
[
"abc19723df6670960705eadbaa6c13-C001-125"
],
[
"abc19723df6670960705eadbaa6c13-C001-144"
],
[
"abc19723df6670960705eadbaa6c13-C001-150"
],
[
"abc19723df6670960705eadbaa6c13-C001-160"
]
],
"cite_sentences": [
"abc19723df6670960705eadbaa6c13-C001-144",
"abc19723df6670960705eadbaa6c13-C001-150",
"abc19723df6670960705eadbaa6c13-C001-160"
]
}
}
},
"ABC_34346688a7e5166ee7b559ccbfe8e3_4": {
"x": [
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-2",
"text": "Polysynthetic languages pose a challenge for morphological analysis due to the rootmorpheme complexity and to the word class \"squish\"."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-3",
"text": "In addition, many of these polysynthetic languages are low-resource."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-4",
"text": "We propose unsupervised approaches for morphological segmentation of low-resource polysynthetic languages based on Adaptor Grammars (AG) (Eskander et al., 2016) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-5",
"text": "We experiment with four languages from the Uto-Aztecan family."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-6",
"text": "Our AG-based approaches outperform other unsupervised approaches and show promise when compared to supervised methods, outperforming them on two of the four languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-9",
"text": "Computational morphology of polysynthetic languages is an emerging field of research."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-10",
"text": "Polysynthetic languages pose unique challenges for computational approaches, including machine translation and morphological analysis, due to the rootmorpheme complexity and to word class gradations (Homola, 2011; Mager et al., 2018d; Klavans, 2018a) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-11",
"text": "Previous approaches include rule-based methods based on finite state transducers (Farley, 2009; Littell, 2018; Kazeminejad et al., 2017) , hybrid models (Mager et al., 2018b; Moeller et al., 2018) , and supervised machine learning, particularly deep learning approaches (Micher, 2017; Kann et al., 2018) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-12",
"text": "While each rule-based method is developed for a specific language (Inuktitut (Farley, 2009 ), or Arapaho (Littell, 2018 Moeller et al., 2018) ), machine learning, including deep learning approaches, might be more rapidly scalable to many additional languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-13",
"text": "We propose an unsupervised approach for morphological segmentation of polysynthetic languages based on Adaptor Grammars (Johnson et al., 2007) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-14",
"text": "We experiment with four UtoAztecan languages: Mexicanero (MX), Nahuatl (NH), Wixarika (WX) and Yorem Nokki (YN) (Kann et al., 2018) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-15",
"text": "Adaptor Grammars (AGs) are nonparametric Bayesian models that generalize probabilistic context free grammars (PCFG), and have proven to be successful for unsupervised morphological segmentation, where a PCFG is a morphological grammar that specifies word structure (Johnson, 2008; Sirts and Goldwater, 2013; Eskander et al., 2016 Eskander et al., , 2018 ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-16",
"text": "Our main goal is to examine the success of Adaptor Grammars for unsupervised morphological segmentation when applied to polysynthetic languages, where the morphology is synthetically complex (not simply agglutinative), and where resources are minimal."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-17",
"text": "We use the datasets introduced by Kann et al. (2018) in an unsupervised fashion (unsegmented words)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-18",
"text": "We design several AG learning setups: 1) use the best-on-average AG setup from Eskander et al. (2016) ; 2) optimize for language using just the small training vocabulary (unsegmented) and dev vocabulary (segmented) from Kann et al. (2018) ; 3) approximate the effect of having some linguistic knowledge; 4) learn from all languages at once and 5) add additional unsupervised data for NH and WX (Section 3)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-19",
"text": "We show that the AG-based approaches outperform other unsupervised methods -M orf essor (Creutz and Lagus, 2007) and M orphoChain (Narasimhan et al., 2015) ) -, and that for two of the languages (NH and YN), the best AG-based approaches outperform the best supervised methods (Section 4)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-21",
"text": "**LANGUAGES AND DATASETS**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-22",
"text": "Typically, polysynthetic languages demonstrate holophrasis, i.e. the ability of an entire sentence to be expressed as what is considered by native speakers to be just one word."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-23",
"text": "To illustrate, consider the following example from Inuktitut (Kla-vans, 2018b) , where the morpheme -tusaa-is the root and all the other morphemes are synthetically combined with it in one unit:"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-24",
"text": "tusaa-tsia-runna-nngit-tu-alu-u-jung hear-well-be.able-NEG-DOE-very-BE-PT.1S I can't hear very well."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-25",
"text": "Another example from WX, one of the languages in the dataset for this paper (from (Mager et al., 2018c) ) shows this complexity:"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-26",
"text": "yu-huta-me ne-p+-we-iwa an-two-ns 1sg:s-asi-2pl:o-brother I have two brothers."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-27",
"text": "In linguistic typology, the broader gradient is: isolating/analytic to synthetic to polysynthetic."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-28",
"text": "Agglutinating refers to the clarity of boundaries between morphemes."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-29",
"text": "This more specific gradation is: agglutinating to mildly fusional to fusional."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-30",
"text": "Thus a language might be characterized overall as polysynthetic and agglutinating, i.e. generally a high number of morphemes per word, with clear boundaries between morphemes and thus easily segmentable."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-31",
"text": "Another language might be characterized as polysynthetic and fusional, so again, many morphemes per word, but many phonological and other processes so it is difficult to segment morphemes."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-32",
"text": "Thus, morphological analysis of polysynthetic languages is challenging due to the rootmorpheme complexity and to word class gradations."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-33",
"text": "Linguists recognize a gradience in word classes, known as \"squishiness\", a term first discussed in Ross (1972) who argued that, instead of a fixed, distinct inventory of syntactic categories, a quasi-continuum from verb, adjective and noun best reflects most lexical distinctions."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-34",
"text": "The rootmorpheme complexity and the word class \"squish\" makes developing segmented training data with reliability across annotators difficult to achieve."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-35",
"text": "Kann et al. (2018) have made a first step by releasing a small set of morphologically segmented datasets although even in these carefully curated datasets, the distinction between affix and clitic is not always indicated."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-36",
"text": "We use these datasets in an unsupervised fashion (i.e., we use the unsegmented words)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-37",
"text": "These datasets were taken from detailed descriptions in the Archive of Indigenous Languages collection for MX (Canger, 2001 ), NH (de Su\u00e1rez, 1980) , WX (G\u00f3mez and L\u00f3pez, 1999) , and YN (Freeze, 1989) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-38",
"text": "They were constructed so they include both segmentable as well as non- Kann et al. (2018) , for training we do not use the segmented version of the data (our approach is unsupervised)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-39",
"text": "In addition to the datasets, for NH and WX we also have available the Bible (Christodouloupoulos and Steedman, 2015; Mager et al., 2018a ), which we consider for one of our experimental setups as additional training data."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-40",
"text": "In the dataset from (Kann et al., 2018) , the maximum number of morphemes per word for MX is seven with an average of 2.13; for NH, six with an average of 2.2; for WX, maximum of ten with an average of 3.3; and for YN, the maximum is ten, with an average of 2.13."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-42",
"text": "**USING ADAPTOR GRAMMARS FOR POLYSYNTHETIC LANGUAGES**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-43",
"text": "An Adaptor Grammar is typically composed of a PCFG and an adaptor that adapts the probabilities of individual subtrees."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-44",
"text": "For morphological segmentation, a PCFG is a morphological grammar that specifies word structure, where AGs learn latent tree structures given a list of words."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-45",
"text": "In this paper, we experiment with the grammars and the learning setups proposed by Eskander et al. (2016) , which we outline briefly below."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-46",
"text": "Grammars."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-47",
"text": "We use the nine grammars from Eskander et al. (2016 Eskander et al. ( , 2018 that were designed based on three dimensions: 1) how the grammar models word structure (e.g., prefix-stem-suffix vs. morphemes), 2) the level of abstraction in nonterminals (e.g., compounds, morphemes and submorphemes) and 3) how the output boundaries are specified (see Table 2 for a sample grammars)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-48",
"text": "For example, the PrStSu+SM grammar models the Table 2 : Sample grammar setups used by Eskander et al. (2018 Eskander et al. ( , 2016 ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-49",
"text": "Compound = Upper level representation of the word as a sequence of compounds; Morph = affix/morpheme representation as a sequence of morphemes."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-50",
"text": "SubMorph (SM) = Lower level representation of characters as a sequence of sub-morphemes. \"+\" denotes one or more."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-51",
"text": "word as a complex prefix, a stem and a complex suffix, where the complex prefix and suffix are composed of zero or more morphemes, and a morpheme is a sequence of sub-morphemes."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-52",
"text": "The boundaries in the output are based on the prefix, stem and suffix levels."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-53",
"text": "Learning Settings."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-54",
"text": "The input to the learner is a grammar and a vocabulary of unsegmented words."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-55",
"text": "We consider the three learning settings in (Eskander et al., 2016) : Standard, Scholarseeded Knowledge and Cascaded."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-56",
"text": "The Standard setting is language-independent and fully unsupervised, while in the Scholar-seeded-Knowledge setting, some linguistic knowledge (in the form of affixes taken from grammar books) is seeded into the grammar trees before learning takes place."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-57",
"text": "The Cascaded setting simulates the effect of seeding scholar knowledge in a language-independent manner by first running an AG of high precision to derive a set of affixes, and then seeding those affixes into the grammars."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-59",
"text": "**AG SETUPS FOR POLYSYNTHETIC LANGUAGES**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-60",
"text": "We experimented with several setups using AGs for unsupervised segmentation."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-61",
"text": "Language-Independent Morphological Segmenter."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-62",
"text": "LIMS is the best-on-average AG setup obtained by Eskander et al. (2016) when trained on six languages (English, German, Finnish, Estonian, Turkish and Zulu), which is the Cascaded PrStSu+SM configuration."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-63",
"text": "We use this AG setup for each of the four languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-64",
"text": "We refer to this system as AG LIM S ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-65",
"text": "Best AG Configuration per Language."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-66",
"text": "In this experimental setup, we consider all nine grammars from Eskander et al. (2016) using both the Standard and the Cascaded approaches and choosing the one that is best for each polysynthetic language by training on the training set and evaluating on the development set."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-67",
"text": "We denote this system as AG BestL ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-68",
"text": "Using Seeded Knowledge."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-69",
"text": "To approximate the effect of Scholar-seeded-Knowledge in Eskander et al. (2016), we used the training set to derive affixes and use them as scholar-seeded knowledge added to the grammars (before the learning happens)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-70",
"text": "However, since affixes and stems are not distinguished in the training annotations from Kann et al. (2018) , we only consider the first and last morphemes that appear at least five times."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-71",
"text": "We call this setup AG Scholar BestL ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-72",
"text": "Multilingual Training."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-73",
"text": "Since the vocabulary in Kann et al. (2018) for each language is small, and the languages are from the same language family, one data augmentation approach is to train on all languages and test then on each language individually."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-74",
"text": "We call this setup AG M ulti ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-75",
"text": "Data Augmentation."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-76",
"text": "In this setup, we examine the performance of the best AG configuration per language (AG BestL ) when more data is available."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-77",
"text": "We merge the training corpus with unique words in the New Testament of the Bible (train Bible )."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-78",
"text": "We run this only on NH and WX since the Bible text is only available for these two languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-79",
"text": "We denote this setup as AG Aug ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-81",
"text": "**EVALUATION AND DISCUSSION**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-82",
"text": "We evaluate the different AG setups on the blind test set from Kann et al. (2018) and compare our AG approaches to state-of-the-art unsupervised systems as well as supervised models including the best supervised deep learning models from Kann et al. (2018) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-83",
"text": "As the metric, we use the segmentation-boundary F1-score, which is standard for this task (Virpioja et al., 2011) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-84",
"text": "Evaluating different AG setups."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-85",
"text": "Table 3 shows the performance of our AG setups on the four languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-86",
"text": "The best AG setup learned for each of the four polysynthetic languages (AG BestL ) is the PrStSu+SM grammar using the Cascaded learning setup."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-87",
"text": "This is an interesting finding as the Cascaded PrSTSu+SM setup is in fact AG LIM S -the best-on-average AG setup obtained by Eskander et al. (2016) Table 4 : Best AG results compared to supervised approaches from Kann et al. (2018) ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-88",
"text": "Bold indicates best scores."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-89",
"text": "WX and YN, respectively."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-90",
"text": "Seeding affixes into the grammar trees (AG Scholar BestL ) improves the performance of the Cascaded P rStSu + SM setup only for MX and WX (additional absolute F1-scores of 0.023 and 0.019, respectively)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-91",
"text": "However, it does not help for NH, while it even decreases the performance on YN."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-92",
"text": "This occurs because AGs are able to recognize the main affixes in the Cascaded setup, while the seeded affixes were either abundant or conflicting with the automatically discovered ones."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-93",
"text": "The multilingual setup (AG M ulti ) does not improve the performance on any of the languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-94",
"text": "This could be because the datasets are too small to generalize common patterns across languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-95",
"text": "Finally, augmenting with Bible text in the cases of NH and WX leads to an absolute F1-score increase of 0.015 for both languages when compared to AG BestL ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-96",
"text": "There are two possible explanations for why we only see a slight increase when adding more data: 1) AGs are able to generalize from small data and 2) the added Bible data represents a domain that is different from those of the datasets we are experimenting with as only 4.8% and 9% of the words in the training sets from Kann et al. (2018) appear in the augmented data of NH and WX, respectively."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-97",
"text": "Overall, AG BestL is the best setup for YN, AG Scholar BestL is the best setup for MX and WX, while AG Aug is the best for NH."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-98",
"text": "Comparison with unsupervised baselines."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-99",
"text": "We consider M orf essor (Creutz and Lagus, 2007) , a commonly-used toolkit for unsupervised morphological segmentation, and M orphoChain (Narasimhan et al., 2015) , another unsupervised morphological system based on constructing morphological chains."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-100",
"text": "Our AG approaches significantly outperform both M orf essor and M orphoChain on all four languages, as shown in Table 3 ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-101",
"text": "Comparison with supervised baselines."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-102",
"text": "To obtain an upper bound, we compare the best AG setup to the best supervised neural methods presented in Kann et al. (2018) for each language."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-103",
"text": "We consider their best multi-task approach (BestMTT) and the best data-augmentation approach (BestDA), using F1 scores from their Table 4 for each language."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-104",
"text": "In addition, we report the results on their other supervised baselines: a supervised seq-to-seq model (S2S) and a supervised CRF approach."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-105",
"text": "As can be seen in Table 4 , our unsupervised AG-based approaches outperform the best supervised approaches for NH and YN with absolute F1-scores of 0.010 and 0.012, respectively."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-106",
"text": "An interesting observation is that for YN we only used the words in the training set of Kann et al. (2018) (unsegmented) , without any data augmentation."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-107",
"text": "For MX and WX, the neural models from Kann et al. (2018) (BestMTT and BestDA), outperform our unsupervised AG-based approaches."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-108",
"text": "Error Analysis."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-109",
"text": "For the purpose of error analysis, we train our unsupervised segmentation on the training sets and perform the analysis of results on the output of the development sets based on our best unsupervised models AG BestL ."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-110",
"text": "Since there is no distinction between stems and affixes in the labeled data, we only consider the morphemes that appear at least three times in order to eliminate open-class morphemes in our statistics."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-111",
"text": "We first define the degree of ambiguity of a morpheme to be the percentage of times its sequence of characters does not form a segmentable morpheme when they appear in the training set."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-112",
"text": "We also define the degree of ambiguity of a language as the average degree of ambiguity of the morphemes in that language."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-113",
"text": "Table 5 shows the number of morphemes, average length of a morpheme (in characters) and the degree of morpheme Table 6 : Examples of correct and incorrect segmentation ambiguity in each language."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-114",
"text": "Looking at the two languages where our models perform worse than the supervised models, we notice that MX has the least number of morphemes, and our unsupervised methods tend to oversegment; WX has the highest degree of ambiguity with a large number of one-letter morphemes, which makes the task more challenging for unsupervised segmentation as opposed to the case of a supervised setup."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-115",
"text": "Analyzing all the errors that our AG-based models made across all languages, we noticed one, or a combination, of the following factors: a high degree of morpheme ambiguity, short morpheme length and/or low frequency of a morpheme."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-116",
"text": "Examples."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-117",
"text": "Table 6 shows some examples of correctly and incorrectly segmented words by our models (blue indicates correct morphemes while red are wrong ones)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-118",
"text": "For MX, our models fail to recognize ka as a correct affix 100% of the time due to its high degree of ambiguity (71.79%), while we often wrongly detect ro as an affix, most likely since ro tends to appear at the end of a word; our approaches tend to oversegment in such cases."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-119",
"text": "On the other hand, our method correctly identify ki as a correct affix 100% of the time since it appears frequently in the training data."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-120",
"text": "For NH, the morpheme tla has a high degree of ambiguity at 79.12%, which lead the model to fail in recognizing it as an affix (see an example in Table 6 )."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-121",
"text": "On the other hand, NH has a higher percentage of correctly recognized morphemes, due to their less ambiguous nature and higher frequency (such as ke, tl or mo)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-122",
"text": "For WX, a large portion of errors stem from one-letter morphemes that are highly ambiguous (e.g., u, a, e, m, n, p and r), in addition to having morphemes in the training set which are not frequent enough to learn from, such as ki,nua and wawi (see Table 6 )."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-123",
"text": "Examples of correct segmentation involve morphemes that are more frequent and less ambiguous (pe, p@ and ne)."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-124",
"text": "For YN, ambiguity is the main source of segmentation errors (e.g., wa, wi and \u00dfa).slight"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-126",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-127",
"text": "Unsupervised approaches based on Adaptor Grammars show promise for morphological segmentation of low-resource polysynthetic languages."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-128",
"text": "We worked with the AG grammars developed by Eskander et al. (2016 Eskander et al. ( , 2018 for languages that are not polysynthetic."
},
{
"sent_id": "34346688a7e5166ee7b559ccbfe8e3-C001-129",
"text": "We showed that even when using these approaches and very little data, we can obtain encouraging results, and that using additional unsupervised data is a promising path."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"34346688a7e5166ee7b559ccbfe8e3-C001-11"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-14"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-40"
]
],
"cite_sentences": [
"34346688a7e5166ee7b559ccbfe8e3-C001-11",
"34346688a7e5166ee7b559ccbfe8e3-C001-14",
"34346688a7e5166ee7b559ccbfe8e3-C001-40"
]
},
"@USE@": {
"gold_contexts": [
[
"34346688a7e5166ee7b559ccbfe8e3-C001-17"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-18"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-35",
"34346688a7e5166ee7b559ccbfe8e3-C001-36"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-38"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-82"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-106"
]
],
"cite_sentences": [
"34346688a7e5166ee7b559ccbfe8e3-C001-17",
"34346688a7e5166ee7b559ccbfe8e3-C001-18",
"34346688a7e5166ee7b559ccbfe8e3-C001-38",
"34346688a7e5166ee7b559ccbfe8e3-C001-82",
"34346688a7e5166ee7b559ccbfe8e3-C001-106"
]
},
"@MOT@": {
"gold_contexts": [
[
"34346688a7e5166ee7b559ccbfe8e3-C001-70"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-73",
"34346688a7e5166ee7b559ccbfe8e3-C001-74"
]
],
"cite_sentences": [
"34346688a7e5166ee7b559ccbfe8e3-C001-70",
"34346688a7e5166ee7b559ccbfe8e3-C001-73"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"34346688a7e5166ee7b559ccbfe8e3-C001-87"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-96"
],
[
"34346688a7e5166ee7b559ccbfe8e3-C001-102"
]
],
"cite_sentences": [
"34346688a7e5166ee7b559ccbfe8e3-C001-87",
"34346688a7e5166ee7b559ccbfe8e3-C001-96",
"34346688a7e5166ee7b559ccbfe8e3-C001-102"
]
},
"@DIF@": {
"gold_contexts": [
[
"34346688a7e5166ee7b559ccbfe8e3-C001-107"
]
],
"cite_sentences": [
"34346688a7e5166ee7b559ccbfe8e3-C001-107"
]
}
}
},
"ABC_fde7f77d4685e1c9ce32a82aed4683_4": {
"x": [
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-32",
"text": "In Section 3 qualitatively evaluate our model."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-2",
"text": "Many NLP applications require disambiguating polysemous words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-3",
"text": "Existing methods that learn polysemous word vector representations involve first detecting various senses and optimizing the sensespecific embeddings separately, which are invariably more involved than single sense learning methods such as word2vec."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-4",
"text": "Evaluating these methods is also problematic, as rigorous quantitative evaluations in this space is limited, especially when compared with single-sense embeddings."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-5",
"text": "In this paper, we propose a simple method to learn a word representation, given any context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-6",
"text": "Our method only requires learning the usual single sense representation, and coefficients that can be learnt via a single pass over the data."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-7",
"text": "We propose several new test sets for evaluating word sense induction, relevance detection, and contextual word similarity, significantly supplementing the currently available tests."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-8",
"text": "Results on these and other tests show that while our method is embarrassingly simple, it achieves excellent results when compared to the state of the art models for unsupervised polysemous word representation learning."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-9",
"text": "Our code and data are at https://github.com/dingwc/ multisense/"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-12",
"text": "Recent advances in word representation learning such as word2vec (Mikolov et al., 2013b) have significantly boosted the performance of numerous Natural Language Processing tasks (Mikolov et al., 2013b; Pennington et al., 2014; Levy et al., 2015) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-13",
"text": "Despite their empirical performance, the inherent one-vector-per-word setting limits its application on tasks that require contextual understanding due to the existence of polysemous words such as part-of-speech tagging and semantic relatedness (Li and Jurafsky, 2015) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-14",
"text": "To this end, various sense-specific word embeddings have been proposed to account for the contextual subtlety of language (Reisinger and Mooney, 2010b,a; Huang et al., 2012; Neelakantan et al., 2015; Tian et al., 2014; Li and Jurafsky, 2015; Arora et al., 2016) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-15",
"text": "A majority of these methods propose to learn multiple vectors for each word via clustering."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-16",
"text": "(Reisinger and Mooney, 2010b; Huang et al., 2012; Neelakantan et al., 2015) uses neural networks to learn cluster embeddings in order to matcha polysemous word with its correct sense embeddings."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-17",
"text": "Side information such as topical understanding (Liu et al., 2015b,a) or paralleled foreign language data (Guo et al., 2014; \u0160uster et al., 2016; Shyam et al., 2017) have also been exploited for clustering different meanings of multi-sense words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-18",
"text": "Another trend is to forgo word embeddings in favor of sentence or paragraph embeddings for specific tasks (? Kiros et al., 2015; Le and Mikolov, 2014) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-19",
"text": "While being more flexible and adaptive to context, all these approaches require sophisticated neural network structures and are problem specific, taking away the advantage offered by the unsupervised embedding approaches of single-sense embeddings."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-20",
"text": "This paper bridges this gap."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-21",
"text": "In this paper we propose a novel and extremely simple approach to learn sense-specific word embeddings."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-22",
"text": "The essence of our approach is to assign each word a global base vector and model the contextual embedding as a linear combination of its context base vectors."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-23",
"text": "Instead of a joint optimization to learn both base vector and combination weights, we propose to use the standard unisense word representation for the base vectors, and the (suitably normalized) word co-occurrence statistics as the linear combination weights; no further training computations are required in our approach."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-24",
"text": "We evaluate our approach on various tasks that require contextual understanding of words, combining existing and new test datasets and evaluation metrics: word-sense induction ( (Koeling et al., 2005; Bartunov et al., 2015) ), contextual word similarity ((Huang et al., 2012 ) and a new test set), and relevance detection ( (Arora et al., 2016) and a new test set)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-25",
"text": "To the best of our knowledge, no prior literature has provided a comprehensive evaluation of all these multisense-specific tasks."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-26",
"text": "Our simple, intuitive model retains almost all advantages offered by more complicated multisense embedding models, and often surpasses the performance of nonlinear \"deep\" models."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-27",
"text": "Our code and data are at https://github.com/ dingwc/multisense/ To summarize, the contributions of our paper are as follows:"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-28",
"text": "1. We propose an extremely simple model for learning polysemous word representations 2."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-29",
"text": "We propose several new larger test sets to evaluate polysemous word embeddings, supplementing those that already exist 3."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-30",
"text": "We perform extensive comparisons of our model to other widely used multisense models in the literature, and show that the simplicity of our model does not tradeoff performance"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-31",
"text": "The rest of the paper is organized as follows: in the next section, we introduce our model and provide a detailed explanation for obtaining the multisense word embeddings."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-58",
"text": "The w2v embeddings are trained using skipgram model with negative sampling."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-59",
"text": "We set the number of negative samples to be 10 and number of training epochs to be 15."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-60",
"text": "The C and W matrices are attached alongside our submission."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-61",
"text": "Before we present qualitative examples, our first observation of the resulting embedding of word-context pairs is that the embedding vectors of words in irrelevant contexts have very low norms."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-62",
"text": "In Fig. 1 we selected around 500 wordcontext pairs from a word-context-relevance evaluation dataset (see Section 4 for details) and plot the histogram of the contextual embedding vector norms."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-63",
"text": "In this evaluation data word-context pairs are labeled either relevant or irrelevant."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-64",
"text": "As depicted in Fig. 1 , the norm filtering effect is essential, given that unlike previous embeddings, our embeddings allow any word to act as context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-65",
"text": "In contrast, the multi-sense embeddings from prior literature ( This relevance filtering effect is advantageous in sentences where many neighboring words may not be describing at the query word."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-66",
"text": "However, the extreme decaying distribution of our method ) in the above figures can make it difficult to measure contextual word similarity using simply cosine distance, as it magnifies words with very small norm that had already been identified as irrelevant."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-67",
"text": "In the other extreme, using dot product overemphasizes common words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-68",
"text": "To mitigate this, we present a generalized similarity measure, with a tunable parameter \u03b1 that describes exactly how much the norm should be taken into account."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-70",
"text": "**SIMILARITY MEASURE OF OUR EMBEDDING**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-71",
"text": "We propose to measure the contextual similarity using the geometric mean of the cosine similarity and dot product, as a tradeoff of these two extremes:"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-72",
"text": "for 0 \u2264 \u03b1 \u2264 1."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-73",
"text": "Specifically, d(x, y; 1) = x T y and d(x, y; 0) is the cosine distance between x and y. Table 1 looks at the closest words to bank in its two contexts for different choice of \u03b1-distance measure."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-74",
"text": "For dot-product we see an overemphasis of popular words that are only marginally related (gently, steeply)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-75",
"text": "For cosine similarity, rare words are overpromoted (saxony, sacramento)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-76",
"text": "In general, \u03b1 = 0.75 and 0.9 can reasonably measure contextual similarity."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-77",
"text": "Table 1 : Closest words to bank in the context of finance and geology, for various choices of \u03b1 in (3)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-78",
"text": "Recall that \u03b1 = 0 is the dot product and \u03b1 = 1 is cosine similarity."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-79",
"text": "To show the effect of \u03b1 when only relevant (w, S) pairs are used, Figure 3 .2 plots the SCWS score (see Section 4; also (Huang et al., 2012) ) for varying \u03b1, for the top-performing embedding from each method."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-80",
"text": "Cos-distance works best for all embeddings."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-81",
"text": "However, for embeddings from (Huang et al., 2012) and (Neelakantan et al., 2015) , which all have norm \u2248 1, the choice of \u03b1 makes little difference."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-82",
"text": "On the other hand, using the embeddings from and our method, which both have highly varying norms, the choice of \u03b1 greatly affects performance."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-84",
"text": "**QUALITATIVE EXAMPLE**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-85",
"text": "Having explained the norm-filtering property of our approach and the \u03b1-distance measure in Eq. (3), we are now able to show a few qualitative examples of our model."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-86",
"text": "First, Table 2 shows closest words to bank in three different context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-87",
"text": "Here we used the GloVe embeddings as C and set \u03b1 = 0.9."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-88",
"text": "Contexts are selected from news articles about finance, weather, and sports (an irrelevant context)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-89",
"text": "The third case illustrates the filtering effect, with a norm that is an order of magnitude smaller than the first two."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-90",
"text": "Banco Santander of Spain said on Wednesday that its profit declined by nearly half in the second quarter on restructuring charges and a contribution to a fund to help finance bank bail ins in Europe."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-91",
"text": "norm 2.83 \u00b7 10 3 neighbors banco, santander, hsbc, brasil, barclays The Seine has continued to swell since the river burst its banks on Wed., raising alarms throughout the city."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-92",
"text": "As of 10pm on Friday its waters had reached 20 feet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-93",
"text": "The river was expected to crest on Sat."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-94",
"text": "morning at up to 21 feet and to remain at high levels throughout the weekend."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-95",
"text": "Next, we investigate the use of our embedding model on phrase embeddings, constructed for example by averaging the embeddings of all words in the phrase, with the phrase itself as the context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-96",
"text": "In Table 3 , we pick three well-known bi-grams (the is a stopword and ignored)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-97",
"text": "The bi-gram embedding is the average of either the unisense GloVe embeddings (U) or the multisense embedding (M) using our model."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-98",
"text": "The closest words to these embeddings are listed in Table 3 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-99",
"text": "We observe that the contextual phrase embeddings are able to pull out meanings having to do with the phrase as a whole, and not just the sum of its parts."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-100",
"text": "Finally, in Table 4 , we list the words with highest norms, when projected in a single word context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-101",
"text": "In all the cases we observe, high-norm words are highly relevant to the single context words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-102",
"text": "In the case of multisense words, a mixture of the different senses appear."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-103",
"text": "(e.g., chips have potato and pentium as relevant, keyboard has layout and harpsichord.)"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-105",
"text": "**EMPIRICAL STUDY**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-106",
"text": "In this section we validate the performance of our embedding approach on a wide range of tasks that explicitly require contextual understanding of words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-107",
"text": "In Table 5 we collect (to our knowledge) the extent of multisense quantitative evaluations (columns 3-6 and 11-13), and supplement them with new, larger test sets (columns 7-9)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-108",
"text": "All tests are provided in the attached dataset folder."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-109",
"text": "In order to keep the evaluation fair across all embeddings, any word that is not in the intersection of the vocabularies of all embeddings is removed from the tests; for the preexisting test sets, this results in slightly smaller test sets than those first proposed."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-110",
"text": "Similarity measure When irrelevant words are present, using \u03b1 = 0.9 is essential to leverage the norm distribution filtering effect."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-111",
"text": "However, in all standard word similarity tasks, only relevant words are used as comparison."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-112",
"text": "Therefore, in order to have our evaluations comparable with standard metrics, we keep \u03b1 = 1 (measuring cosine similarity for all tasks)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-113",
"text": "We compare our method against multisense approaches in (Huang et al., 2012; Neelakantan et al., 2015; ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-114",
"text": "In each case, we use their pre-trained model and choose the embedding of a target word that is closest to the context representation (as they suggest)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-115",
"text": "Since the code in (Huang et al., 2012) allows choosing various distance functions, we pick all and report the best scores."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-116",
"text": "For (Neelakantan et al., 2015; we use the cosine distance as recommended."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-117",
"text": "Overall Performance Table 5 shows that our method consistently outperforms (Huang et al., 2012; Neelakantan et al., 2015) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-118",
"text": "We note that ) is learned using additional supervision from the WordNet knowledge-base in clustering; therefore, it achieves comparably much higher scores in WSR and CWS tasks in which the evaluation is also based on WordNet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-119",
"text": "We now describe each task in detail."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-120",
"text": "Word-Context Relevance (WCR) This task is proposed in (Arora et al., 2016) and aims to detect when word-context pairs are relevant."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-121",
"text": "In (Huang et al., 2012; Neelakantan et al., 2015; , the relevance metric can be seen as the distance (cosine or Euclidean) between the query word and the context cluster center."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-122",
"text": "In our method, we rely on the filtering effect of W ij values to diminish the norm of words in irrelevant contexts; thus we propose the 2 -norms of the contextual embedding as the metric of relevance, where the target word is excluded from the context if present."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-123",
"text": "5 In all cases, the ability of this metric to capture relevancy is essential for the success of that embedding to be applied to real world corpora, where not all neighboring words are relevant."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-124",
"text": "The task is as follows."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-125",
"text": "We have available to us some databases of words and their related words, which we view as that word's relevant context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-126",
"text": "We create the ground-truth by setting the labels of related word-context pairs to be 1, and for randomly picked word pairs to be 0."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-127",
"text": "Specifically, R1 and R2 are constructed from the dataset in (Arora et al., 2016) , and R3 is a newly provided much larger test set, separately constructed from WordNet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-128",
"text": "In R1, the negative samples are created by keeping the word unchanged and sampling m = 10 random contexts."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-129",
"text": "In R2, the context is unchanged, and m = 10 random words are provided."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-130",
"text": "Note that for each word in each of the tests above, there is a single example with label 1."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-131",
"text": "In total, there are 137 words and 534 senses, with on average 6.98 context words per query word."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-132",
"text": "Some examples are provided in Table 4."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-133",
"text": "R3 is a new test set that significantly augmnents R1 and R3."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-134",
"text": "Here, we manually collect a set of 100 Table 5 : Summary of all quantitative tests and performance metric of our embedding approach against compared baselines."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-135",
"text": "For (Huang et al., 2012) different rows correspond to various types of distance used to get the contextual embedding."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-136",
"text": "For (Neelakantan et al., 2015) the row labels follow the terminology in the paper."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-137",
"text": "We highlight the top 3 results for each test and metric in the table."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-138",
"text": "Sp = Spearman correlation."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-139",
"text": "P@1 = Precision@1."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-140",
"text": "AP = Average Precision."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-141",
"text": "AUC = Area Under Curve."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-142",
"text": "Acc = Classification Accuracy."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-143",
"text": "polysemous words 6 and retrieve all their senses from WordNet (Leacock and Chodorow, 1998) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-144",
"text": "We combine the definitions, synonyms, and examples sentences in WordNet as the context for each sense."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-145",
"text": "We have 566 tests and for each test have m = 5 negative samples, with random words in unchanged context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-146",
"text": "In total, there are 1938 words, 3234 senses, and on average 7.88 contexts per word, with some examples provided in Table t-WCRR3."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-147",
"text": "For each valid pair, we measure the Spearman correlation (Sp.) between relevance metrics and ground-truth labels, as well as the Precision@1, i.e. the fraction of tests where the top item based on predicted scores is the valid pair."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-148",
"text": "The reported performance metrics are averaged over all valid word-context pairs."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-149",
"text": "To further visualize the performance of each embedding on this task, we plot the distribution of the relevance metric for relevant and Figure 4 for the compared methods."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-150",
"text": "It is clear that has the best separation, corresponding to the highest score in the WCR columns of Table 5."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-151",
"text": "This can partially be explained by the fact that the construction of the embedding of et al., 2012) , which contains 2003 tests, each consisting of two word-context pairs (w 1 , S 1 ) and (w 2 , S 2 )."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-152",
"text": "At all times, S 1 is relevant to w 1 , and S 2 to w 2 , but w 1 in the context of S 1 may not be synonymous with w 2 in the context of S 2 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-153",
"text": "An example is given in Table 4 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-154",
"text": "In our evaluation, we first prune the test (Neelakantan et al., 2015) texttt300d 10s 1.32c\u03bb 0mv embedding (top), (Huang et al., 2012) cossim embedding (middle), and (Chen et al., 2014) (bottom) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-155",
"text": "The amount of separation between the two distributions seems correlated with the success of the embedding on the WCR tasks."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-156",
"text": "In comparison, our methoduses vector norms to distinguish relevance, the separation of which is shown in Figure 1 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-157",
"text": "set to only include words present in vocabularies available to all embeddngs."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-158",
"text": "Following (Huang et al., 2012) , we sort all the n = 2003 test pairs based on predicted similarity score and compare such ranking against the ground-truth ranking indicated by the average human evaluation score."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-159",
"text": "The distance between two rank-lists is measured using the Spearman correlation score."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-160",
"text": "Example of SCWS test for admission and confession ... the reason the Buddha himself gave was that the admission of women would weaken the Sangha and shorten its lifetime ... ... They included a confession said to have been inadvertently included on a computer disk that was given to the press... avg. human-given sim."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-161",
"text": "score: 2.3 Table 8 : An example of a pair of word-contexts for a single SCWS task."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-162",
"text": "We note that in (Huang et al., 2012 ) the similarity between two word-context pairs is the measured using avgSimC, a weighted average of cosine similarities between all possible representation vectors of w 1 and w 2 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-163",
"text": "This metric, however, can not be applied to our approach since we have an infinite number of possible contextual representation for each word."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-164",
"text": "Therefore, we use the cosine similarity without averaging, which is reasonable for all the embedding approaches."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-165",
"text": "We note that the cosine similarity is also used in (Neelakantan et al., 2015; Reisinger and Mooney, 2010a) Of course, this is disadvantageous for the embeddings of (Huang et al., 2012) , and our scores of their embedding are closer to that reported in (Neelakantan et al., 2015) , which also does not use averaging."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-166",
"text": "Our Contextual Word Similarity (CWS) We expand upon the SCWS test by providing our own larger CWS test, constructed mostly in an unsupervised manner based on WordNet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-167",
"text": "We retrieve a set of multisense words and their senses from WordNet as in WCR R3, with contexts as the concatenation of the definition and all example sentences."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-168",
"text": "The full list of tests are attached in the dataset folder, with an example in Table 4 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-169",
"text": "Note that compared to the SCWS test set, the contexts are much shorter and less noisy."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-170",
"text": "For each query word-context (w, S) pair, we attach a positive label to another (w , S ) if S = S and w, w are similar words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-171",
"text": "We collect negative samples as pairs (w , S ) if S = S, where either w = w or w and w are marked similar words by WordNet in the context S ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-172",
"text": "Given a query (w, S) pair, the goal is to rank the similar (positive) pairs above the dissimilar (negative) ones in the context of S. In all, we create a set of 3, 955 tests based on 154 polysemous words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-173",
"text": "We calculated the cosine similarity between the contextual embeddings of negative/positive samples and the query, and report scores in Table 5 ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-174",
"text": "Word-Sense Classification (WSC) Both WCR and CWS tasks are heavily based on WordNet, and offer an unfair advantage to multisense embeddings whose construction is also based on WordNet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-175",
"text": "In this sense, SCWS offers a more generalizable evaluation."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-176",
"text": "Two additional word-sense tests are and (Koeling et al., 2005; Bartunov et al., 2015) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-177",
"text": "Similar to the Word-Sense Induction (WSI) task provided by the same works, we devise a Word Sense Classification (WSC) task, to predict the correct sense of a polysemous word in a given sentence or paragraph."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-178",
"text": "We construct the tests from sense-labeled (w, S) pairs in (Koeling et al., 2005 ) (C1) and (Raganato et al., 2017) (C2) by merge all the training and test data and further remove rare senses with < 10 examples sentences."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-179",
"text": "Some examples of the C1 dataset are provided in Table 4 (C2 is very similar)."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-180",
"text": "In total, C1 contains 39 words, 116 senses, and 11,064 lines."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-181",
"text": "C2 is much bigger, containing 783 words, 5,188 senses, and 961,670 examples."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-182",
"text": "Given that such large datasets were already available, we did not need to create our own."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-183",
"text": "For each word we create 80 \u2212 20% train-test splits, train a K-NN multiclass classifier with Euclidean distance between contextual word embeddings, and report the mean classification accuracy (Acc) averaged across all words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-184",
"text": "Discussion We have provided a wide array of evaluations for measuring different aspects of multisense word embeddings, both collected from existing test sources and formed ourselves through WordNet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-185",
"text": "Overall, we find that our simple model performs surprisingly well on all evaluations, with the only consistent competitor a WordNet based model."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-186",
"text": "One thing to note is that the SCWS Spearman scores of the (Huang et al., 2012) listed here are much smaller than that first reported."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-187",
"text": "This is entirely attributed to the fact that we use direct cosine similarity between word embeddings, whereas they use an averaged similarity across their provided context words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-188",
"text": "Both are perfectly valid metrics; our choice is solely so that the identical metric can be applied to all embeddings, where this averaged similarity metric cannot be used."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-190",
"text": "**CONCLUSION**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-191",
"text": "In this paper, we developed a method that can yield contextual word embeddings for any word under any context."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-192",
"text": "When the context is irrelevant to the word, the embedding norms will be almost 0."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-193",
"text": "A key highlight of our method is the simplicity, both from the modeling and the learning point of view."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-194",
"text": "Experiments on several datasets and on several tasks show that the method we propose is competitive with the state of the art when it comes to unsupervised methods to learn polysemous word representations."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-33",
"text": "In Section 4 we introduce our new evaluation tasks and datasets in details."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-34",
"text": "We also perform extensive experiments on four quantitative tests that are multisense specific."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-35",
"text": "We finally conclude out paper in Section 5."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-37",
"text": "**OUR CONTEXTUAL EMBEDDING MODEL**"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-38",
"text": "Like unisense vectors, sense-specific vectors should be closely aligned to words in that sense."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-39",
"text": "This idea of local similarity has been widely used to obtain context sense representation Huang et al., 2012; Le and Mikolov, 2014; Neelakantan et al., 2015) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-40",
"text": "It was also used to decompose unisense vector into sense specific vectors (Arora et al., 2016) ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-41",
"text": "In this paper, we exploit this intuition and model the contextual embedding of a word as a linear combination of its contexts."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-42",
"text": "Specifically, we consider a corpus drawn from a vocabulary V = (word 1 , . . . , word V )."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-43",
"text": "We define the normalized cooccurence matrix as the V \u00d7 V (sparse 1 ) symmetric matrix W where"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-44",
"text": "# cooccurences of word i , word j freq. of word i \u00d7 freq. of word j ."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-45",
"text": "We define a context S as a collection of words provided alongside the target word."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-46",
"text": "The context is flexible."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-47",
"text": "It can be a sentence or a paragraph in which the word appeared, or a set of synonyms from WordNet."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-48",
"text": "A standard unisense embedding (such as word2vec (Mikolov et al., 2013b) ) can be represented as a d \u00d7 V matrix C, where d is the embedding dimension and the ith column of C is the embedding vector for the ith word in V. Then the multisense embedding of word i given context"
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-49",
"text": "and C j is the jth column of C. Take, for example, the word bank with context I must stop by the bank for a quick withdrawal."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-50",
"text": "The multisense embedding u is a weighted sum of the base embeddings of each context word."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-51",
"text": "Note that some words (withdrawal) are more relevant than others (need, stop, quick); the weight for each context word is the normalized co-occurance, which filters for relevant context words."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-52",
"text": "We can view the sets of columns of C as spanning sense subspaces."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-53",
"text": "For example, the (likely low dimensional) subspace spanned by the submatrix of C corresponding to vectors for financial terms should also be highly correlated with savings and much less correlated with river; in other words, the mutisense word bank provided in either context will be well separated."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-54",
"text": "Implementation In the remaining of the paper, we use W via (1), constructed from the 2016 English Wikipedia Dump 2 with a local window of size 5."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-55",
"text": "The final vocabulary results after filtering away non-English words, stop words, and rare words occurring under 2,000 times."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-56",
"text": "This results in a vocabulary of size V = 26974."
},
{
"sent_id": "fde7f77d4685e1c9ce32a82aed4683-C001-57",
"text": "For base vectors C we use either the pre-trained GLoVe embedding with d = 100 3 , or the word2vec (w2v) embedding trained over the wikipedia corpus with d = 50 and 100."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"fde7f77d4685e1c9ce32a82aed4683-C001-14"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-16"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-39"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-121"
]
],
"cite_sentences": [
"fde7f77d4685e1c9ce32a82aed4683-C001-14",
"fde7f77d4685e1c9ce32a82aed4683-C001-16",
"fde7f77d4685e1c9ce32a82aed4683-C001-39",
"fde7f77d4685e1c9ce32a82aed4683-C001-121"
]
},
"@USE@": {
"gold_contexts": [
[
"fde7f77d4685e1c9ce32a82aed4683-C001-24"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-158"
]
],
"cite_sentences": [
"fde7f77d4685e1c9ce32a82aed4683-C001-24",
"fde7f77d4685e1c9ce32a82aed4683-C001-158"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"fde7f77d4685e1c9ce32a82aed4683-C001-79"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-113"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-135"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-165"
]
],
"cite_sentences": [
"fde7f77d4685e1c9ce32a82aed4683-C001-79",
"fde7f77d4685e1c9ce32a82aed4683-C001-113",
"fde7f77d4685e1c9ce32a82aed4683-C001-135",
"fde7f77d4685e1c9ce32a82aed4683-C001-165"
]
},
"@DIF@": {
"gold_contexts": [
[
"fde7f77d4685e1c9ce32a82aed4683-C001-81",
"fde7f77d4685e1c9ce32a82aed4683-C001-82"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-117"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-162",
"fde7f77d4685e1c9ce32a82aed4683-C001-163"
],
[
"fde7f77d4685e1c9ce32a82aed4683-C001-186",
"fde7f77d4685e1c9ce32a82aed4683-C001-187"
]
],
"cite_sentences": [
"fde7f77d4685e1c9ce32a82aed4683-C001-81",
"fde7f77d4685e1c9ce32a82aed4683-C001-117",
"fde7f77d4685e1c9ce32a82aed4683-C001-162",
"fde7f77d4685e1c9ce32a82aed4683-C001-186"
]
},
"@MOT@": {
"gold_contexts": [
[
"fde7f77d4685e1c9ce32a82aed4683-C001-115"
]
],
"cite_sentences": [
"fde7f77d4685e1c9ce32a82aed4683-C001-115"
]
}
}
},
"ABC_32e860cdf03df7f6cb58b7f9e85ac0_4": {
"x": [
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-2",
"text": "To be able to interact better with humans, it is crucial for machines to understand sound -a primary modality of human perception."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-3",
"text": "Previous works have used sound to learn embeddings for improved generic semantic similarity assessment."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-4",
"text": "In this work, we treat sound as a first-class citizen, studying downstream 6textual tasks which require aural grounding."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-5",
"text": "To this end, we propose sound-word2vec -a new embedding scheme that learns specialized word embeddings grounded in sounds."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-6",
"text": "For example, we learn that two seemingly (semantically) unrelated concepts, like leaves and paper are similar due to the similar rustling sounds they make."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-7",
"text": "Our embeddings prove useful in textual tasks requiring aural reasoning like text-based sound retrieval and discovering Foley sound effects (used in movies)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-8",
"text": "Moreover, our embedding space captures interesting dependencies between words and onomatopoeia and outperforms prior work on aurallyrelevant word relatedness datasets such as AMEN and ASLex."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-11",
"text": "Sound and vision are the dominant perceptual signals, while language helps us communicate complex experiences via rich abstractions."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-12",
"text": "For example, a novel can stimulate us to mentally construct the image of the scene despite having never physically perceived it."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-13",
"text": "Indeed, language has evolved to contain numerous constructs that help depict visual concepts."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-14",
"text": "For example, we can easily form the picture of a white, furry cat with blue eyes via."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-15",
"text": "a description of the cat in terms of its visual attributes (Lampert et al., 2009; Parikh and Grauman, 2011) ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-16",
"text": "Need for Onomatopoeia."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-17",
"text": "However, how would one describe the auditory instantiation of cats?"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-18",
"text": "While a first thought might be to use audio descriptors like loud, shrill, husky etc."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-19",
"text": "as mid-level constructs or \"attributes\", arguably, it is difficult to precisely convey and comprehend sound through such language."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-20",
"text": "Indeed, Wake and Asahi (1998) find that humans first communicate sounds using \"onomatopoeia\" -words that are suggestive of the phonetics of sounds while having no explicit meaning e.g. meow, tic-toc."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-21",
"text": "When asked for further explanation of sounds, humans provide descriptions of potential sound sources or impressions created by the sound (pleasant, annoying, etc.) Need for Grounding in Sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-22",
"text": "While onomatopoeic words exist for commonly found concepts, a vast majority of concepts are not as perceptually striking or sufficiently frequent for us to come up with dedicated words describing their sounds."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-23",
"text": "Even worse, some sounds, say, musical instruments, might be difficult to mimic using speech."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-24",
"text": "Thus, for a large number of concepts there seems to be a gap between sound and its counterpart in language (Sundaram and Narayanan, 2006) ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-25",
"text": "This becomes problematic in specific situations where we want to talk about the heavy tail of concepts and their sounds, or while describing a particular sound we want to create as an effect (say in movies)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-26",
"text": "To alleviate this, a common literary strategy is to provide metaphors to more relatable exemplars."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-27",
"text": "For example, when we say, \"He thundered angrily\", we compare the person's angry speech to the sound of thunder to convey the seriousness of the situation."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-28",
"text": "However, without this grounding in sound, thunder and anger both appear to be seemingly unrelated concepts in terms of semantics."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-29",
"text": "Contributions."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-30",
"text": "In this work, we learn embeddings to bridge the gap between sound and its counterpart in language."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-31",
"text": "We follow a retrofitting strategy, capturing similarity in sounds associated with words, while using distributional semantics (from word2vec) to provide smoothness to the embeddings."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-32",
"text": "Note that we are not interested in capturing phonetic similarity, but the grounding in sound of the concept associated with the word (say \"rustling\" of leaves and paper.) We demonstrate the effectiveness of our embeddings on three downstream tasks that require reasoning about related aural cues: 1. Text-based sound retrieval -Given a textual query describing the sound and a database containing sounds and associated textual tags, we retrieve sound samples by matching text (Sec. 5.1) 2."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-33",
"text": "Foley Sound Discovery -Given a short phrase that outlines the technique of producing Foley sounds 1 , we discover other relevant words (objects or actions) which can produce similar sound effects (Sec. 5.2) 3."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-34",
"text": "Aurally-relevant word relatedness assessment on AMEN and ASLex (Kiela and Clark, 2015) (Sec. 5.3)"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-35",
"text": "We also qualitatively compare with word2vec to highlight the unique notions of word relatedness captured by imposing auditory grounding."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-37",
"text": "**RELATED WORK**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-38",
"text": "Audio and Word Embeddings."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-39",
"text": "Multiple works in the recent past (Bruni et al., 2014; Lazaridou et al., 2015; Lopopolo and van Miltenburg, 2015; Kiela and Clark, 2015; Kottur et al., 2016) have explored using perceptual modalities like vision and sound to learn language embeddings."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-40",
"text": "While Lopopolo and van Miltenburg (2015) show preliminary results on using sound to learn distributional representations, Kiela and Clark (2015) build on ideas from Bruni et al. (2014) to learn word embeddings that respect both linguistic and auditory relationships by optimizing a joint objective."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-41",
"text": "Further, they propose various fusion strategies to combine knowledge from both the modalities."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-42",
"text": "Instead, we \"specialize\" embeddings to exclusively respect relationships defined by sounds, while initializing with word2vec embeddings for smoothness."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-43",
"text": "Similar to previous findings (Melamud et al., 2016) , we observe that our specialized embeddings outperform language-only as well as other multi-modal embeddings in the downstream tasks of interest."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-44",
"text": "In an orthogonal and interesting direction, other recent works (Chung et al., 2016; He et al., 2016; Settle and Livescu, 2016) learn word representations based on similarity in their pronunciation and not the sounds associated with them."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-45",
"text": "In other words, phonetically similar words that have near identical pronunciations are brought closer in the embedding space (e.g., flower and flour)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-46",
"text": "Sundaram and Narayanan (2006) study the applicability of onomatopoeia to obtain semantically meaningful representations of audio."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-47",
"text": "Using a novel word-similarity metric and principal component analysis, they find representations for sounds and cluster them in this derived space to reason about similarities."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-48",
"text": "In contrast, we are interested in learning word representations that respect aural-similarity."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-49",
"text": "More importantly, our approach learns word representations for in a data-driven manner without having to first map the sound or its tags to corresponding onomatopoeic words."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-50",
"text": "Multimodal Learning with Surrogate Supervision."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-51",
"text": "Kottur et al. (2016) and Owens et al. (2016) use a surrogate modality to induce supervision to learn representations for a desired modality."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-52",
"text": "While the former learns word embeddings grounded in cartoon images, the latter learns visual features grounded in sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-53",
"text": "In contrast, we use sound as the surrogate modality to supervise representation learning for words."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-55",
"text": "**DATASETS**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-56",
"text": "Freesound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-57",
"text": "We use the freesound database (Font et al., 2013) , also used in prior work (Kiela and Clark, 2015; Lopopolo and van Miltenburg, 2015) to learn the proposed sound-word2vec embeddings."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-58",
"text": "Freesound is a freely available, collaborative dataset consisting of user uploaded sounds permitting reuse."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-59",
"text": "All uploaded sounds have human descriptions in the form of tags and captions in natural language."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-60",
"text": "The tags contain a broad set of relevant topics for a sound (e.g., ambience, electronic, birds, city, reverb) and captions describing the content of the sound, in addition to details pertaining to audio quality."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-61",
"text": "For the text-based sound retrieval task, we use a subset of 234,120 sounds from this database and divide it into training (80%), validation (10%) and testing splits (10%)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-62",
"text": "Further, for foley sound discovery, we aggregate descriptions of foley sound production provided by sound engineers (epicsound, accessed 23-Jan-2017; Singer, accessed 23-Jan-2017) to create a list of 30 foley sound pairs, which forms our ground truth for the task."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-63",
"text": "For example, the description to produce a foley \"driving on gravel\" sound is to record the \"crunching sound of plastic or polyethene bags\"."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-64",
"text": "AMEN and ASLex."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-65",
"text": "AMEN and ASLex (Kiela and Clark, 2015) are subsets of the standard MEN (Bruni et al., 2014) and SimLex (Hill et al., 2015) word similarity datasets consisting of word-pairs that \"can be associated with a distinctive associated sound\"."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-66",
"text": "We evaluate on this dataset for completeness to benchmark our approach against previous work."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-67",
"text": "However, we are primarily interested in the slightly different problem of relating words with similar auditory instantions that may or may not be semantically related as opposed to relating semantically similar words that can be associated with some common auditory signal."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-69",
"text": "**APPROACH**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-70",
"text": "We use the Freesound database to construct a dataset of tuples {s, T }, where s is a sound and T is the set of associated user-provided tags."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-71",
"text": "We then aim to learn an embedding space for the tags that respects auditory grounding using sound information as cross-modal context -similar to word2vec (Mikolov et al., 2013 ) that uses neighboring words as context / supervision."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-72",
"text": "We now explain our approach in detail."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-74",
"text": "**AUDIO FEATURES AND CLUSTERING.**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-75",
"text": "We represent each sound s by a feature vector consisting of the mean and variance of the following audio descriptors that are readily available as part of Freesound database:"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-76",
"text": "\u2022 Mel-Frequency Cepstral Co-efficients: This feature represents the short-term power spectrum of an audio and closely approximates the response of the human auditory system -computed as given in (Ganchev et al., 2005 )."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-77",
"text": "\u2022 Spectral Contrast: It is the magnitude difference Figure 1 : The model used to learn the proposed soundword2vec embeddings."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-78",
"text": "The projection matrix WP containing that is used as the sound-word2vec embedding is learned by training the model to accurately predict the cluster assignment of the sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-79",
"text": "in the peaks and valleys of the spectrum -computed according to (Akkermans et al., 2009 )."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-80",
"text": "\u2022 Dissonance: It measures the perceptual roughness of the sound (Plomp and Levelt, 1965 (Ricard, 2004) ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-81",
"text": "We then use K-Means algorithm to cluster the sounds in this feature space to assign each sound to a cluster C(s) \u2208 {1, . . ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-82",
"text": "K}. We set K to 30 by evaluating the performance of the embeddings on text-based audio-retrieval on the held out validation set."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-83",
"text": "Note that the clustering is only performed once, prior to representation learning described below."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-84",
"text": "Representation Learning."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-85",
"text": "We represent each tag t \u2208 T using a |V| dimensional one-hot encoding denoted by v t , where V is the set of all unique tags in the training set (the size of our dictionary)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-86",
"text": "This one-hot vector v t is projected into a Ddimensional vector space via W P \u2208 R |V|\u00d7D , the projection matrix."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-87",
"text": "This projection matrix computes the representation for each word in V. The idea of our approach is to use W P to accurately predict cluster assignments (for sounds associated with words), which enforces grounding in sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-88",
"text": "For each data-point, we obtain the summary of the tags T , by averaging the projections of all tags in the set as 1 |T | t\u2208T W P v t ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-89",
"text": "We then transform the so obtained summary representation via a linear layer (with parameters W O ) and pass the output through the softmax function to obtain a distribution, p(c|T ) over the K sound clusters."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-90",
"text": "We perform maximum-likelihood training for the correct cluster assignment C(s) 2 , optimizing for parameters W P and W O :"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-91",
"text": "We use SGD with momentum to optimize this objective which essentially is the cross-entropy between cluster assignments and p(c|T )."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-92",
"text": "We set D to 300 to be consistent with the publicly available word2vec embeddings."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-93",
"text": "Initialization."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-94",
"text": "We initialize W P with word2vec embeddings (Mikolov et al., 2013 ) trained on the Google news corpus dataset with \u223c3M words."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-95",
"text": "We fine-tune on a subset of 9578 tags which are present in both Freesound as well as Google news corpus datasets, which is 55.68% of the original tags in the Freesound dataset."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-96",
"text": "This helps us remove noisy tags unrelated to the content of the sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-97",
"text": "In addition to enlarging the vocabulary, the pretraining helps induce smoothness in the soundword2vec embeddings -allowing us to transfer semantics learnt from sounds to words that were not present as tags in the Freesound database."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-98",
"text": "Indeed, we find that word2vec pre-training helps improve performance (Sec. 5.3)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-99",
"text": "Our use of language embeddings as an initialization to fine-tune (specialize) from, as opposed to formulating a joint objective with language and audio context (Kiela and Clark, 2015) is driven by the fact that we are interested in embeddings for words grounded in sounds, and not better generic word similarity."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-101",
"text": "**RESULTS**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-102",
"text": "Ablations."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-103",
"text": "In addition to the language-only baseline word2vec (Mikolov et al., 2013) , we compare against tag-word2vec -that predicts a tag using other tags of the sound as context, inspired by (Font et al., 2014) ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-104",
"text": "We also report results with a randomly initialized projection matrix (soundword2vec(r) to evaluate the effectiveness of pretraining with word2vec."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-105",
"text": "Prior work."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-106",
"text": "We compare against previous works Lopopolo and van Miltenburg (2015) and Kiela and Clark (2015) ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-107",
"text": "While the former uses a standard bag of words and SVD pipeline to arrive at distributional representations for words, the latter trains under a joint objective that respects both linguistic and auditory similarity."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-108",
"text": "We use the openly available implementation for Lopopolo and van Miltenburg (2015) and re-implement Kiela and Clark (2015) and train them on our dataset for a fair comparison of the methods."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-109",
"text": "In addition, we show a comparison to word-vectors released by (Kiela and Clark, 2015) in the supplementary material."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-110",
"text": "All approaches use an embedding size of 300 for consistency."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-112",
"text": "**TEXT-BASED SOUND RETRIEVAL**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-113",
"text": "Given a textual description of a sound as query, we compare it with tags associated with sounds in the database to retrieve the sound with the closest matching tags."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-114",
"text": "Note that this is a purely textual task, albeit one that needs awareness of sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-115",
"text": "In a sense, this task exactly captures what we want our model to be able to do -bridge the semantic gap between language and sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-116",
"text": "We use the training split (Sec. 3) to learn the sound-word2vec vectors, validation to pick the number of clusters (K), and report results on the test split."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-117",
"text": "For retrieval, we represent sounds by averaging the learnt embeddings for the associated tags."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-118",
"text": "We embed the caption provided for the sound (in the Freesound database) in a similar manner, and use it as the query."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-119",
"text": "We then rank sounds based on the cosine similarity between the tag and query representations for retrieval."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-120",
"text": "We evaluate using standard retrieval metrics -Recall@{1,10,50,100}. Note that the entire testing set (\u224810k sounds) is present in the retrieval pool."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-121",
"text": "So, recall@100 corresponds to obtaining the correct result in the top 1% of the search results, which is a relatively stringent evaluation criterion."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-122",
"text": "Results. Table."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-123",
"text": "1 shows that our sound-word2vec embeddings outperform the baselines."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-124",
"text": "We see that specializing the embeddings for sound using our two-stage training outperforms prior work (Kiela and Clark (2015) and Lopopolo and van Miltenburg (2015) ), which did not do specialization."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-125",
"text": "Among our approaches, tag-word2vec performs second best -this is intuitive since the tag distributions implicitly capture auditory relatedness (a sound may have tags cat and meow), while word2vec and sound-word2vec(r) have the lowest performance."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-126",
"text": "(Kiela and Clark, 2015) (higher is better)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-127",
"text": "Our approach performs better than Kiela and Clark (2015) ."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-129",
"text": "**FOLEY SOUND DISCOVERY**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-130",
"text": "In this task, we evaluate how well embeddings identify matching pairs of target sounds (flapping bird wings) and descriptions of Foley sound production techniques (rubbing a pair of gloves)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-131",
"text": "Intuitively, one expects sound-aware word embeddings to do better at this task than sound-agnostic ones."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-132",
"text": "We setup a ranking task by constructing a set of original Foley sound pairs and decoy pairs formed by pairing the target description with every word from the vocabulary."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-133",
"text": "We rank using cosine similarity between the average word-vectors in each member of the pair."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-134",
"text": "A good embedding is one in which the original Foley sound pair has the lowest rank."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-135",
"text": "We use the mean rank of the Foley sound in the dataset for evaluation."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-136",
"text": "We transfer the embeddings from Sec. 5.1 to this task, without additional training."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-138",
"text": "**RESULTS.**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-139",
"text": "We find that Sound-word2vec performs the best with a mean rank of 34.6 compared to other baselines tag-word2vec (38.9), soundword2vec(r) (114.3) and word2vec (189.45)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-140",
"text": "As observed previously, the second best performing approach is tag-word2vec."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-141",
"text": "Lopopolo and van Miltenburg (2015) and Kiela and Clark (2015) perform worse than tag-word2vec with a mean rank of 48.4 and 42.1 respectively."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-142",
"text": "Note that random chance gets a rank of (|V| + 1)/2 = 4789.5."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-144",
"text": "**EVALUATION ON AMEN AND ASLEX**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-145",
"text": "AMEN and ASLex (Kiela and Clark, 2015) are subsets of the MEN and SimLex-999 datasets for word relatedness grounded in sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-146",
"text": "From Table 2, we can see that our embeddings outperform (Kiela and Clark, 2015) on both AMEN and ASLex."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-147",
"text": "These datasets were curated by annotating concepts related by sound; however we observe that relatedness is often confounded."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-148",
"text": "For example, (river, water), (automobile, car) are marked as aurally related however they do not stand out as aurally-related examples as they are already semantically related."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-149",
"text": "In contrast, we are interested in how onomatopoeic words relate to regular words (Table 3) , which we study by explicit grounding in sound."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-150",
"text": "Thus while we show competitive performance on this dataset, it might not be best suited for studying the benefits of our approach."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-151",
"text": "----------------------------------"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-152",
"text": "**DISCUSSION AND CONCLUSION**"
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-153",
"text": "We show nearest neighbors in both soundword2vec and word2vec space (Table 3) to qualitatively demonstrate the unique dependencies captured due to auditory grounding."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-154",
"text": "While word2vec maps a word (say, apple) to other semantically similar words (other fruits), similar 'sounding' words (chips) or onomatopoeia (munch) are closer in our embedding space."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-155",
"text": "Moreover, onomatopoeic words (say, boom and slam) are mapped to relevant objects (explosion and door)."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-156",
"text": "Interestingly, parts (e.g., lock, latch) and actions (closing) are also closer to the onomatopoeic query -exhibiting an understanding of the auditory scene."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-157",
"text": "Conclusion."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-158",
"text": "In this work we introduce a novel word embedding scheme that respects auditory grounding."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-159",
"text": "We show that our embeddings provide strong performance on text-based sound retrieval, Foley sound discovery along with intuitive nearest neighbors for onomatopoeia that are tasks in text requiting auditory reasoning."
},
{
"sent_id": "32e860cdf03df7f6cb58b7f9e85ac0-C001-160",
"text": "We hope our work motivates further efforts on understanding and relating onomatopoeia words to \"regular\" words."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-39"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-40"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-65"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-145"
]
],
"cite_sentences": [
"32e860cdf03df7f6cb58b7f9e85ac0-C001-39",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-40",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-65",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-145"
]
},
"@USE@": {
"gold_contexts": [
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-57"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-65",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-66"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-108"
]
],
"cite_sentences": [
"32e860cdf03df7f6cb58b7f9e85ac0-C001-57",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-65",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-108"
]
},
"@DIF@": {
"gold_contexts": [
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-99"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-124"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-127"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-139",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-140",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-141"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-146"
]
],
"cite_sentences": [
"32e860cdf03df7f6cb58b7f9e85ac0-C001-99",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-124",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-127",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-141",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-146"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-106"
],
[
"32e860cdf03df7f6cb58b7f9e85ac0-C001-109"
]
],
"cite_sentences": [
"32e860cdf03df7f6cb58b7f9e85ac0-C001-106",
"32e860cdf03df7f6cb58b7f9e85ac0-C001-109"
]
}
}
},
"ABC_ef6f1050651a4c3ac9a53438ac1f87_4": {
"x": [
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-2",
"text": "Abstract."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-3",
"text": "Predicting mental health from smartphone and social media data on a longitudinal basis has recently attracted great interest, with very promising results being reported across many studies [3, 9, 13, 26] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-4",
"text": "Such approaches have the potential to revolutionise mental health assessment, if their development and evaluation follows a real world deployment setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-41",
"text": "**PROBLEM STATEMENT**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-5",
"text": "In this work we take a closer look at state-of-the-art approaches, using different mental health datasets and indicators, different feature sources and multiple simulations, in order to assess their ability to generalise."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-6",
"text": "We demonstrate that under a pragmatic evaluation framework, none of the approaches deliver or even approach the reported performances."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-7",
"text": "In fact, we show that current state-of-the-art approaches can barely outperform the most na\u00efve baselines in the real-world setting, posing serious questions not only about their deployment ability, but also about the contribution of the derived features for the mental health assessment task and how to make better use of such data in the future."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-10",
"text": "Establishing the right indicators of mental well-being is a grand challenge posed by the World Health Organisation [7] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-11",
"text": "Poor mental health is highly correlated with low motivation, lack of satisfaction, low productivity and a negative economic impact [20] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-12",
"text": "The current approach is to combine census data at the population level [19] , thus failing to capture well-being on an individual basis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-13",
"text": "The latter is only possible via self-reporting on the basis of established psychological scales, which are hard to acquire consistently on a longitudinal basis, and they capture long-term aggregates instead of the current state of the individual."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-14",
"text": "The widespread use of smart-phones and social media offers new ways of assessing mental well-being, and recent research [1, 2, 3, 5, 9, 10, 13, 14, 22, 23, 26] has started exploring the effectiveness of these modalities for automatically assessing"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-15",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-16",
"text": "**ARXIV:1807.07351V1 [CS.CY] 19 JUL 2018**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-17",
"text": "the mental health of a subject, reporting very high accuracy. What is typically done in these studies is to use features based on the subjects' smart phone logs and social media, to predict some self-reported mental health index (e.g., \"wellbeing\", \"depression\" and others), which is provided either on a Likert scale or on the basis of a psychological questionnaire (e.g., PHQ-8 [12] , PANAS [29] , WEMWBS [25] and others)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-18",
"text": "Most of these studies are longitudinal, where data about individuals is collected over a period of time and predictions of mental health are made over a sliding time window."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-19",
"text": "Having such longitudinal studies is highly desirable, as it can allow fine-grained monitoring of mental health."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-20",
"text": "However, a crucial question is what constitutes an appropriate evaluation framework, in order for such approaches to be employable in a real world setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-21",
"text": "Generalisation to previously unobserved users can only be assessed via leave-N-users-out cross-validation setups, where typically, N is equal to one (LOUOCV, see Table 1 )."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-22",
"text": "However, due to the small number of subjects that are available, such generalisation is hard to achieve by any approach [13] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-23",
"text": "Alternatively, personalised models [3, 13] for every individual can be evaluated via a within-subject, leave-N-instances-out cross-validation (for N=1, LOIOCV), where an instance for a user u at time i is defined as a {X ui , y ui } tuple of {features(u, i), mental-health-score(u, i)}. In a real world setting, a LOIOCV model is trained on some user-specific instances, aiming to predict her mental health state at some future time points."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-24",
"text": "Again however, the limited number of instances for every user make such models unable to generalize well."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-25",
"text": "In order to overcome these issues, previous work [2, 5, 9, 10, 22, 26] has combined the instances {X uj i , y uj i } from different individuals u j and performed evaluation using randomised cross validation (MIXED)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-26",
"text": "While such approaches can attain optimistic performance, the corresponding models fail to generalise to the general population and also fail to ensure effective personalised assessment of the mental health state of a single individual."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-27",
"text": "In this paper we demonstrate the challenges that current state-of-the-art models face, when tested in a real-world setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-28",
"text": "We work on two longitudinal datasets with four mental health targets, using different features derived from a wide range of heterogeneous sources."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-29",
"text": "Following the state-of-the-art experimental methods and evaluation settings, we achieve very promising results, regardless of the features we employ and the mental health target we aim to predict."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-30",
"text": "However, when tested under a pragmatic setting, the performance of these models drops heavily, failing to outperform the most na\u00efve -from a modelling perspectivebaselines: majority voting, random classifiers, models trained on the identity of the user, etc."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-31",
"text": "This poses serious questions about the contribution of the features derived from social media, smartphones and sensors for the task of automatically assessing well-being on a longitudinal basis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-32",
"text": "Our goal is to flesh out, study and discuss such limitations through extensive experimentation across multiple settings, and to propose a pragmatic evaluation and model-building framework for future research in this domain."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-34",
"text": "**RELATED WORK**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-35",
"text": "Research in assessing mental health on a longitudinal basis aims to make use of relevant features extracted from various modalities, in order to train models for automatically predicting a user's mental state (target), either in a classification or a regression manner [1, 2, 3, 9, 10, 13, 26] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-36",
"text": "Examples of state-of-the-art work in this domain are listed in Table 2 , along with the number of subjects that was used and the method upon which evaluation took place."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-37",
"text": "Most approaches have used the \"MIXED\" approach to evaluate models [1, 2, 5, 9, 10, 22, 26] , which, as we will show, is vulnerable to bias, due to the danger of recognising the user in the test set and thus simply inferring her average mood score."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-38",
"text": "LOIOCV approaches that have not ensured that their train/test sets are independent are also vulnerable to bias in a realistic setting [3, 13] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-39",
"text": "From the works listed in Table 2 , only Suhara et al. [23] achieves unbiased results with respect to model generalisability; however, the features employed for their prediction task are derived from self-reported questionnaires of the subjects and not by automatic means."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-42",
"text": "We first describe three major problems stemming from unrealistic construction and evaluation of mental health assessment models and then we briefly present the state-of-the-art in each case, which we followed in our experiments."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-44",
"text": "**P1**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-45",
"text": "Training on past values of the target variable: This issue arises when the past N mood scores of a user are required to predict his/her next mood score in an autoregressive manner."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-46",
"text": "Since such an approach would require the previous N scores of past mood forms, it would limit its ability to generalise without the need of manual user input in a continuous basis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-47",
"text": "This makes it impractical for a real-world scenario."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-48",
"text": "Most importantly, it is difficult to measure the contribution of the features towards the prediction task, unless the model is evaluated using target feature ablation."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-49",
"text": "For demonstration purposes, we have followed the experimental setup by LiKamWa et al. [13] , which is one of the leading works in this field."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-50",
"text": "P2 Inferring test set labels: When training personalised models (LOIOCV ) in a longitudinal study, it is important to make sure that there are no overlapping instances across consecutive time windows."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-51",
"text": "Some past works have extracted features {f (t \u2212 N ), ..., f (t)} over N days, in order to predict the score t on day N + 1 [3, 13] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-52",
"text": "Such approaches are biased if there are overlapping days of train/test data."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-53",
"text": "To illustrate this problem we have followed the approach by Canzian and Musolesi [3] , as one of the pioneering works on predicting depression with GPS traces, on a longitudinal basis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-54",
"text": "P3 Predicting users instead of mood scores: Most approaches merge all the instances from different subjects, in an attempt to build user-agnostic models in a randomised cross-validation framework [2, 9, 10, 26] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-55",
"text": "This is problematic, especially when dealing with a small number of subjects, whose behaviour (as captured through their data) and mental health scores differ on an individual basis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-56",
"text": "Such approaches are in danger of \"predicting\" the user in the test set, since her (test set) features might be highly correlated with her features in the training set, and thus infer her average well-being score, based on the corresponding observations of the training set."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-57",
"text": "Such approaches cannot guarantee that they will generalise on either a population-wide (LOUOCV ) or a personalised (LOIOCV ) level."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-58",
"text": "In order to examine this effect in both a regression and a classification setting, we have followed the experimental framework by Tsakalidis et al. [26] and Jaques et al. [9] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-60",
"text": "**P1: TRAINING ON PAST VALUES OF THE TARGET (LOIOCV, LOUOCV)**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-61",
"text": "LiKamWa et al. [13] collected smartphone data from 32 subjects over a period of two months."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-62",
"text": "The subjects were asked to self-report their \"pleasure\" and \"activeness\" scores at least four times a day, following a Likert scale (1 to 5), and the average daily scores served as the two targets."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-63",
"text": "The authors aggregated various features on social interactions (e.g., number of emails sent to frequently interacting contacts) and routine activities (e.g., browsing and location history) derived from the smartphones of the participants."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-64",
"text": "These features were extracted over a period of three days, along with the two most recent scores on activeness and pleasure."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-65",
"text": "The issue that naturally arises is that such a method cannot generalise to new subjects in the LOUOCV setup, as it requires their last two days of self-assessed scores."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-66",
"text": "Moreover, in the LOIOCV setup, the approach is limited in a real world setting, since it requires the previous mental health scores by the subject to provide an estimate of her current state."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-67",
"text": "Even in this case though, the feature extraction should be based on past information only -under LOIOCV in [13] , the current mood score we aim at predicting is also used as a feature in the (time-wise) subsequent two instances of the training data."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-68",
"text": "Experiments in [13] are conducted under LOIOCV and LOUOCV, using Multiple Linear Regression (LR) with Sequential Feature Selection (in LOUOCV, the past two pairs of target labels of the test user are still used as features)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-69",
"text": "In order to better examine the effectiveness of the features for the task, the same model can be tested without any ground-truth data as input."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-70",
"text": "Nevertheless, a simplistic model predicting the per-subject average outperforms their LR in the LOUOCV approach, which poses the question of whether the smartphone-derived features can be used effectively to create a generalisable model that can assess the mental health of unobserved users."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-71",
"text": "Finally, the same model tested in the LOIOCV setup achieves the lowest error; however, this is trained not only on target scores overlapping with the test set, but also on features derived over a period of three days, introducing further potential bias, as discussed in the following."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-73",
"text": "**P2: INFERRING TEST LABELS (LOIOCV)**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-74",
"text": "Canzian and Musolesi [3] extracted mobility metrics from 28 subjects to predict their depressive state, as derived from their daily self-reported PHQ-8 questionnaires."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-75",
"text": "A 14-day moving average filter is first applied to the PHQ-8 scores and the mean value of the same day (e.g. Monday) is subtracted from the normalised scores, to avoid cyclic trends."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-76",
"text": "This normalisation results into making the target score s t on day t dependent on the past {s t\u221214 , ..., s t\u22121 } scores."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-77",
"text": "The normalised PHQ-8 scores are then converted into two classes, with the instances deviating more than one standard deviation above the mean score of a subject being assigned to the class \"1\" (\"0\", otherwise)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-78",
"text": "The features are extracted over various time windows (looking at T HIST = {0, ..., 14} days before the completion of a mood form) and personalised model learning and evaluation are performed for every T HIST separately, using a LOIOCV framework."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-79",
"text": "What is notable is that the results improve significantly when features are extracted from a wider T HIST window."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-80",
"text": "This could imply that the depressive state of an individual can be detected with a high accuracy if we look back at her history."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-81",
"text": "However, by training and testing a model on instances whose features are derived from the same days, there is a high risk of over-fitting the model to the timestamp of the day in which the mood form was completed."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-82",
"text": "In the worstcase scenario, there will be an instance in the train set whose features (e.g. total covered distance) are derived from the 14 days, 13 of which will also be used for the instance in the test set."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-83",
"text": "Additionally, the target values of these two instances will also be highly correlated due to the moving average filter, making the task artificially easy for large T HIST and not applicable in a real-world setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-84",
"text": "While we focus on the approach in [3] , a similar approach with respect to feature extraction was also followed in LiKamWa et al. [13] and Bogomolov et al. [2] , extracting features from the past 2 and 2 to 5 days, respectively."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-86",
"text": "**P3: PREDICTING USERS (LOUOCV)**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-87",
"text": "Tsakalidis et al. [26] monitored the behaviour of 19 individuals over four months."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-88",
"text": "The subjects were asked to complete two psychological scales [25, 29] on a daily basis, leading to three target scores (positive, negative, mental well-being); various features from smartphones (e.g., time spent on the preferred locations) and textual features (e.g., ngrams) were extracted passively over the 24 hours preceding a mood form timestamp."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-89",
"text": "Model training and evaluation was performed in a randomised (MIXED) cross-validation setup, leading to high accuracy (R 2 = 0.76)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-90",
"text": "However, a case demonstrating the potential user bias is when the models are trained on the textual sources: initially the highest R 2 (0.22) is achieved when a model is applied to the mental-wellbeing target; by normalising the textual features on a per-user basis, the R 2 increases to 0.65."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-91",
"text": "While this is likely to happen because the vocabulary used by different users is normalised, there is also the danger of over-fitting the trained model to the identity of the user."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-92",
"text": "To examine this potential, the LOIOCV/LOUOCV setups need to be studied alongside the MIXED validation approach, with and without the per-user feature normalisation step."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-93",
"text": "A similar issue is encountered in Jaques et al. [9] who monitored 68 subjects over a period of a month."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-94",
"text": "Four types of features were extracted from survey and smart devices carried by subjects."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-95",
"text": "Self-reported scores on a daily basis served as the ground truth."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-96",
"text": "The authors labelled the instances with the top 30% of all the scores as \"happy\" and the lowest 30% as \"sad\" and randomly separated them into training, validation and test sets, leading to the same user bias issue."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-97",
"text": "Since different users exhibit different mood scores on average [26] , by selecting instances from the top and bottom scores, one might end up separating users and convert the mood prediction task into a user identification one."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-98",
"text": "A more suitable task could have been to try to predict the highest and lowest scores of every individual separately, either in a LOIOCV or in a LOUOCV setup."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-99",
"text": "While we focus on the works of Tsakalidis et al. [26] and Jaques et al. [9] , similar experimental setups were also followed in [10] , using the median of scores to separate the instances and performing five-fold cross-validation, and by Bogomolov et al. in [2] , working on a user-agnostic validation setting on 117 subjects to predict their happiness levels, and in [1] , for the stress level classification task."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-101",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-103",
"text": "**DATASETS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-104",
"text": "By definition, the aforementioned issues are feature-, dataset-and target-independent (albeit the magnitude of the effects may vary)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-105",
"text": "To illustrate this, we run a series of experiments employing two datasets, with different feature sources and four different mental health targets."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-107",
"text": "**DATASET 1:**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-108",
"text": "We employed the dataset obtained by Tsakalidis et al. [26] , a pioneering dataset which contains a mix of longitudinal textual and mobile phone usage data for 30 subjects."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-109",
"text": "From a textual perspective, this dataset consists of social media posts (1,854/5,167 facebook/twitter posts) and private messages (64,221/132/47,043 facebook/twitter/ SMS messages) sent by the subjects."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-110",
"text": "For our ground truth, we use the {positive, negative, mental well-being} mood scores (in the ranges of , , , respectively) derived from self-assessed psychological scales during the study period."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-112",
"text": "**DATASET 2:**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-113",
"text": "We employed the StudentLife dataset [28] , which contains a wealth of information derived from the smartphones of 48 students during a 10-week period."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-114",
"text": "Such information includes samples of the detected activity of the subject, timestamps of detected conversations, audio mode of the smartphone, status of the smartphone (e.g., charging, locked), etc."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-115",
"text": "For our target, we used the selfreported stress levels of the students (range [0-4]), which were provided several times a day."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-116",
"text": "For the approach in LiKamWa et al. [13] , we considered the average daily stress level of a student as our ground-truth, as in the original paper; for the rest, we used all of the stress scores and extracted features based on some time interval preceding their completion, as described next, in 4.3 4 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-118",
"text": "**TASK DESCRIPTION**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-119",
"text": "We studied the major issues in the following experimental settings (see Table 3 ):"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-120",
"text": "P1: Using Past Labels: We followed the experimental setting in [13] (see section 3.1): we treated our task as a regression problem and used Mean Squared Error (MSE) and classification accuracy 5 for evaluation."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-121",
"text": "We trained a Linear Regression (LR) model and performed feature selection using Sequential Feature Selection under the LOIOCV and LOUOCV setups; feature extraction is performed over the previous 3 days preceding the completion of a mood form."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-122",
"text": "For comparison, we use the same baselines as in [13] : Model A always predicts the average mood score for a certain user (AVG); Model B predicts the last entered scores (LAST); Model C makes a prediction using the LR model trained on the ground-truth features only (-feat)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-123",
"text": "We also include Model D, trained on non-target features only (-mood) in an unbiased LOUOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-125",
"text": "**P2: INFERRING TEST LABELS:**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-126",
"text": "We followed the experimental setting presented in [3] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-127",
"text": "We process our ground-truth in the same way as the original paper (see 4 For P3, this creates the P2 cross-correlation issue in the MIXED/LOIOCV settings."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-128",
"text": "For this reason, we ran the experiments by considering only the last entered score in a given day as our target."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-129",
"text": "We did not witness any major differences that would alter our conclusions."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-130",
"text": "5 Accuracy is defined in [13] as follows: 5 classes are assumed (e.g., [0, ..., 4]) and the squared error e between the centre of a class halfway towards the next class is calculated (e.g., 0.25)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-131",
"text": "If the squared error of a test instance is smaller than e, then it is considered as having been classified correctly."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-132",
"text": "section 3.2) and thus treat our task as a binary classification problem."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-133",
"text": "We use an SVM RBF classifier, using grid search for parameter optimisation, and perform evaluation using specificity and sensitivity."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-134",
"text": "We run experiments in the LOIOCV and LOUOCV settings, performing feature extraction at different time windows (T HIST = {1, ..., 14})."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-135",
"text": "In order to better demonstrate the problem that arises here, we use the previous label classifier (LAST) and the SVM classifier to which we feed only the mood timestamp as a feature (DATE) for comparison."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-136",
"text": "Finally, we replace our features with completely random data and train the same SVM with T HIST = 14 by keeping the same ground truth, performing 100 experiments and reporting averages of sensitivity and specificity (RAND)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-138",
"text": "**P3: PREDICTING USERS:**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-139",
"text": "We followed the evaluation settings of two past works (see section 3.3), with the only difference being the use of 5-fold CV instead of a train/dev/test split that was used in [9] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-140",
"text": "The features of every instance are extracted from the past day before the completion of a mood form."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-141",
"text": "In Experiment 1 we follow the setup in [26] : we perform 5-fold CV (MIXED) using SVM (SVR RBF ) and evaluate performance based on R 2 and RM SE."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-142",
"text": "We compare the performance when tested under the LOIOCV /LOUOCV setups, with and without the per-user feature normalisation step."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-143",
"text": "We also compare the performance of the MIXED setting, when our model is trained on the one-hot-encoded user id only."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-144",
"text": "In Experiment 2 we follow the setup in [9] : we label the instances as \"high\" (\"low\"), if they belong to the top-30% (bottom-30%) of mood score values (\"UNIQ\" -for \"unique\" -setup)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-145",
"text": "We train an SVM classifier in 5-fold CV using accuracy for evaluation and compare performance in the LOIOCV and LOUOCV settings."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-146",
"text": "In order to further examine user bias, we perform the same experiments, this time by labelling the instances on a per-user basis (\"PERS\" -for \"personalised\" -setup), aiming to predict the per-user high/low mood days 6 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-147",
"text": "Table 3 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-148",
"text": "Summary of experiments."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-149",
"text": "The highlighted settings indicate the settings used in the original papers; \"Period\" indicates the period before each mood form completion during which the features were extracted."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-151",
"text": "**FEATURES**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-152",
"text": "For Dataset 1, we first defined a \"user snippet\" as the concatenation of all texts generated by a user within a set time interval, such that the maximum time difference between two consecutive document timestamps is less than 20 minutes."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-153",
"text": "We performed some standard noise reduction steps (converted text to lowercase, replaced URLs/user mentions and performed language identification 7 and tokenisation [6] )."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-154",
"text": "Given a mood form and a set of snippets produced by a user before the completion of a mood form, we extracted some commonly used feature sets for every snippet written in English [26] , which were used in all experiments."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-155",
"text": "To ensure sufficient data density, we excluded users for whom we had overall fewer than 25 snippets on the days before the completion of the mood form or fewer than 40 mood forms overall, leading to 27 users and 2, 368 mood forms."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-156",
"text": "For Dataset 2, we extracted the features presented in Table 4 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-157",
"text": "We only kept the users that had at least 10 self-reported stress questionnaires, leading to 44 users and 2, 146 instances."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-158",
"text": "For our random experiments used in P2, in Dataset 1 we replaced the text representation of every snippet with random noise (\u00b5 = 0, \u03c3 = 1) of the same feature dimensionality; in Dataset 2, we replaced the actual inferred value of every activity/audio sample with a random inference class; we also replaced each of the detected conversation samples and samples detected in a dark environment/locked/charging, with a random number (<100, uniformly distributed) indicating the number of pseudo-detected samples."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-159",
"text": "Table 5 presents the results on the basis of the methodology by LiKamWa et al. [13] , along with the average scores reported in [13] -note that the range of the mood scores varies on a per-target basis; hence, the reported results of different models should be compared among each other when tested on the same target."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-160",
"text": "As in [13] , always predicting the average score (AVG) for an unseen user performs better than applying a LR model trained on other users in a LOUOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-161",
"text": "If the same LR model used in LOUOCV is trained without using the previously self-reported ground-truth scores (Model D, -mood), its performance drops further."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-162",
"text": "This showcases that personalised models are needed for more Table 5 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-163",
"text": "P1: Results following the approach in [13] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-165",
"text": "**RESULTS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-166",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-167",
"text": "**P1: USING PAST LABELS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-168",
"text": "accurate mental health assessment (note that the AVG baseline is, in fact, a personalised baseline) and that there is no evidence that we can employ effective models in real-world applications to predict the mental health of previously unseen individuals, based on this setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-169",
"text": "The accuracy of LR under LOIOCV is higher, except for the \"stress\" target, where the performance is comparable to LOUOCV and lower than the AVG baseline."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-170",
"text": "However, the problem in LOIOCV is the fact that the features are extracted based on the past three days, thus creating a temporal cross-correlation in our input space."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-171",
"text": "If a similar correlation exists in the output space (target), then we end up in danger of overfitting our model to the training examples that are temporally close to the test instance."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-172",
"text": "This type of bias is essentially present if we force a temporal correlation in the output space, as studied next."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-174",
"text": "**P2: INFERRING TEST LABELS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-175",
"text": "The charts in Fig. 1 (top) show the results by following the LOIOCV approach from Canzian and Musolesi [3] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-176",
"text": "The pattern that these metrics take is consistent and quite similar to the original paper: specificity remains at high values, while sensitivity increases as we increase the time window from which we extract our features."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-177",
"text": "The charts on the bottom in Fig. 1 show the corresponding results in the LOUOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-178",
"text": "Here, such a generalisation is not feasible, since the increases in sensitivity are accompanied by sharp drops in the specificity scores."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-179",
"text": "The arising issue though lies in the LOIOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-180",
"text": "By training and testing on the same days (for T HIST > 1), the kernel matrix takes high values for cells which are highly correlated with respect to time, making the evaluation of the contribution of the features difficult."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-181",
"text": "To support this statement, we train the same model under LOIOCV, using only on the mood form completion date (Unix epoch) as a feature."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-182",
"text": "The results are very similar to those achieved by training on T HIST = 14 (see Table 6 )."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-183",
"text": "We also include the results of another na\u00efve classifier (LAST), predicting always the last observed score in the training set, which again achieves similar results."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-184",
"text": "The clearest demonstration of the problem though is by comparing the results of the RAND against the FEAT classifier, which shows that under the proposed evaluation setup we can achieve similar performance if we replace our inputs with random data, clearly demonstrating the temporal bias that can lead to over-optimistic results, even in the LOIOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-185",
"text": "Fig. 1 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-186",
"text": "P2: Sensitivity/specificity (blue/red) scores over the {positive, negative, wellbeing, stress} targets by training on different time windows on the LOIOCV (top) and LOUOCV (bottom) setups, similar to [3] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-187",
"text": "Table 6 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-188",
"text": "P2: Performance (sensitivity/specificity) of the SVM classifier trained over 14 days of smartphone/social media features (FEAT) compared against 3 na\u00efve baselines."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-190",
"text": "**P3: PREDICTING USERS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-191",
"text": "Experiment 1: Table 7 shows the results based on the evaluation setup of Tsakalidis et al. [26] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-192",
"text": "In the MIXED cases, the pattern is consistent with [26] , indicating that normalising the features on a per-user basis yields better results, when dealing with sparse textual features (positive, negative, wellbeing targets)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-193",
"text": "The explanation of this effect lies within the danger of predicting the user's identity instead of her mood scores."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-194",
"text": "This is why the per-user normalisation does not have any effect for the stress target, since for that we are using dense features derived from smartphones: the vocabulary used by the subjects for the other targets is more indicative of their identity."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-195",
"text": "In order to further support this statement, we trained the SVR model using only the one-hot encoded user id as a feature, without any textual features."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-196",
"text": "Our results yielded R 2 ={0.64, 0.50, 0.66} and RM SE={5.50, 5.32, 6.50} for the {positive, negative, wellbeing} targets, clearly demonstrating the user bias in the MIXED setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-197",
"text": "The RMSEs in LOIOCV are the lowest, since different individuals exhibit different ranges of mental health scores."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-198",
"text": "Nevertheless, R 2 is slightly negative, implying again that the average predictor for a single user provides a better estimate for her mental health score."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-199",
"text": "Note that while the predictions across all individuals seem to be very accurate (see Fig. 2 ), by separating them on a per-user basis, we end up with a negative R 2 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-200",
"text": "In the unbiased LOUOCV setting the results are, again, very poor."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-201",
"text": "The reason for the high differences observed between the three settings is provided by the R 2 formula itself (1"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-202",
"text": ")."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-203",
"text": "In the MIXED case, we train and test on the same users, while\u0233 is calculated as the mean of the mood scores across all users, whereas in the LOIOCV /LOUOCV cases,\u0233 is calculated for every user separately."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-204",
"text": "In MIXED, by identifying who the user is, we have a rough estimate of her mood score, which is by itself a good predictor, if it is compared with the average predictor across all mood scores of all users."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-205",
"text": "Thus, the effect of the features in this setting cannot be assessed with certainty."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-206",
"text": "Table 7 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-207",
"text": "P3: Results following the evaluation setup in [26] (MIXED), along with the results obtained in the LOIOCV and LOUOCV settings with (+) and without (-) per-user input normalisation."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-208",
"text": "Table 8 displays our results based on Jaques et al. [9] (see section 3.3)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-209",
"text": "The average accuracy on the \"UNIQ\" setup is higher by 14% compared to the majority classifier in MIXED."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-210",
"text": "The LOIOCV setting also yields very promising results (mean accuracy: 81.17%)."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-211",
"text": "As in all previous cases, in LOUOCV our models fail to outperform the majority classifier."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-212",
"text": "A closer look at the LOIOCV and MIXED results though reveals the user bias issue that is responsible for the high accuracy."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-213",
"text": "For example, 33% of the users had all of their \"positive\" scores binned into one class, as these subjects were exhibiting higher (or lower) mental health scores throughout the experiment, whereas another 33% of the subjects had 85% of their instances classified into one class."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-214",
"text": "By recognising the user, we can achieve high accuracy in the MIXED setting; in the LOIOCV, the majority classifier can also achieve at least 85% accuracy for 18/27 users."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-215",
"text": "Table 8 ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-216",
"text": "P3: Accuracy by following the evaluation setup in [9] (MIXED), along with the results obtained in LOIOCV & LOUOCV."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-218",
"text": "**EXPERIMENT 2:**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-219",
"text": "In the \"PERS\" setup, we removed the user bias, by separating the two classes on a per-user basis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-220",
"text": "The results now drop heavily even in the two previously well-performing settings and can barely outperform the majority classifier."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-221",
"text": "Note that the task in Experiment 2 is relatively easier, since we are trying to classify instances into two classes which are well-distinguished from each other from a psychological point of view."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-222",
"text": "However, by removing the user bias, the contribution of the user-generated features to this task becomes once again unclear."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-223",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-224",
"text": "**PROPOSAL FOR FUTURE DIRECTIONS**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-225",
"text": "Our results emphasize the difficulty of automatically predicting individuals' mental health scores in a real-world setting and demonstrate the dangers due to flaws in the experimental setup."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-226",
"text": "Our findings do not imply that the presented issues will manifest themselves to the same degree in different datasets -e.g., the danger of predicting the user in the MIXED setting is higher when using the texts of 27 users rather than sensor-based features of more users [1, 2, 9, 22] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-227",
"text": "Nevertheless, it is crucial to establish appropriate evaluation settings to avoid providing false alarms to users, if our aim is to build systems that can be deployed in practice."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-228",
"text": "To this end, we propose model building and evaluation under the following:"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-229",
"text": "-LOUOCV: By definition, training should be performed strictly on features and target data derived from a sample of users and tested on a completely new user, since using target data from the unseen user as features violates the independence hypothesis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-230",
"text": "A model trained in this setting should achieve consistently better results on the unseen user compared to the na\u00efve (from a modelling perspective) model that always predicts his/her average score."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-231",
"text": "-LOIOCV: By definition, the models trained under this setting should not violate the iid hypothesis."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-232",
"text": "We have demonstrated that the temporal dependence between instances in the train and test set can provide over-optimistic results."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-233",
"text": "A model trained on this setting should consistently outperform na\u00efve, yet competitive, baseline methods, such as the last-entered mood score predictor, the user's average mood predictor and the auto-regressive model."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-234",
"text": "Models that can be effectively applied in any of the above settings could revolutionise the mental health assessment process while providing us in an unbiased setting with great insights on the types of behaviour that affect our mental wellbeing."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-235",
"text": "On the other hand, positive results in the MIXED setting cannot guarantee model performance in a real-world setting in either LOUOCV or LOIOCV, even if they are compared against the user average baseline [4] ."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-236",
"text": "Transfer learning approaches can provide significant help in the LOUOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-237",
"text": "However, these assume that single-domain models have been effectively learned beforehand -but all of our single-user (LOIOCV ) experiments provided negative results."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-238",
"text": "Better feature engineering through latent feature representations may prove to be beneficial."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-239",
"text": "While different users exhibit different behaviours, these behaviours may follow similar patterns in a latent space."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-240",
"text": "Such representations have seen great success in recent years in the field of natural language processing [15] , where the aim is to capture latent similarities between seemingly diverse concepts and represent every feature based on its context."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-241",
"text": "Finally, working with larger datasets can help in providing more data to train on, but also in assessing the model's ability to generalise in a more realistic setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-242",
"text": "----------------------------------"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-243",
"text": "**CONCLUSION**"
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-244",
"text": "Assessing mental health with digital media is a task which could have great impact on monitoring of mental well-being and personalised health."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-245",
"text": "In the current paper, we have followed past experimental settings to evaluate the contribution of various features to the task of automatically predicting different mental health indices of an individual."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-246",
"text": "We find that under an unbiased, real-world setting, the performance of state-of-the-art models drops significantly, making the contribution of the features impossible to assess."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-247",
"text": "Crucially, this holds for both cases of creating a model that can be applied in previously unobserved users (LOUOCV ) and a personalised model that is learned for every user individually (LOIOCV )."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-248",
"text": "Our major goal for the future is to achieve positive results in the LOUOCV setting."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-249",
"text": "To overcome the problem of having only few instances from a diversely behaving small group of subjects, transfer learning techniques on latent feature representations could be beneficial."
},
{
"sent_id": "ef6f1050651a4c3ac9a53438ac1f87-C001-250",
"text": "A successful model in this setting would not only provide us with insights on what types of behaviour affect mental state, but could also be employed in a real-world system without the danger of providing false alarms to its users."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-3"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-14"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-25"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-35"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-37"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-54"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-87"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-97"
]
],
"cite_sentences": [
"ef6f1050651a4c3ac9a53438ac1f87-C001-3",
"ef6f1050651a4c3ac9a53438ac1f87-C001-14",
"ef6f1050651a4c3ac9a53438ac1f87-C001-25",
"ef6f1050651a4c3ac9a53438ac1f87-C001-35",
"ef6f1050651a4c3ac9a53438ac1f87-C001-37",
"ef6f1050651a4c3ac9a53438ac1f87-C001-54",
"ef6f1050651a4c3ac9a53438ac1f87-C001-87",
"ef6f1050651a4c3ac9a53438ac1f87-C001-97"
]
},
"@USE@": {
"gold_contexts": [
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-58"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-108"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-141"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-191"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-207"
]
],
"cite_sentences": [
"ef6f1050651a4c3ac9a53438ac1f87-C001-58",
"ef6f1050651a4c3ac9a53438ac1f87-C001-108",
"ef6f1050651a4c3ac9a53438ac1f87-C001-141",
"ef6f1050651a4c3ac9a53438ac1f87-C001-191",
"ef6f1050651a4c3ac9a53438ac1f87-C001-207"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-99"
],
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-154"
]
],
"cite_sentences": [
"ef6f1050651a4c3ac9a53438ac1f87-C001-99",
"ef6f1050651a4c3ac9a53438ac1f87-C001-154"
]
},
"@SIM@": {
"gold_contexts": [
[
"ef6f1050651a4c3ac9a53438ac1f87-C001-192"
]
],
"cite_sentences": [
"ef6f1050651a4c3ac9a53438ac1f87-C001-192"
]
}
}
},
"ABC_50d065b6b187f361f8e456df0a0bbe_5": {
"x": [
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-3",
"text": "Some of these are source language independent while others are not."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-76",
"text": "**EXPERIMENT**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-101",
"text": "ancestor."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-2",
"text": "Recently, translation scholars have made some general claims about translation properties."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-166",
"text": "We take Bulgarian as the target language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-4",
"text": "Koppel and Ordan (2011) performed empirical studies to validate both types of properties using English source texts and other texts translated into English."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-5",
"text": "Obviously, corpora of this sort, which focus on a single language, are not adequate for claiming universality of translation properties."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-6",
"text": "In this paper, we are validating both types of translation properties using original and translated texts from six European languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-9",
"text": "Even though it is content words that are semantically rich, function words also play an important role in a text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-10",
"text": "Function words are more frequent and predictable than content words."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-11",
"text": "Generally, function words carry grammatical information about content words."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-12",
"text": "High frequency function words are relatively shorter than mid/low frequency function words (Bell et al., 2008) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-13",
"text": "Due to their high frequency in texts and their grammatical role, function words also indicate authorial style (Argamon and Levitan, 2005) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-14",
"text": "These words could play an important role in translated text and in the translation process."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-15",
"text": "Source and translation classification is useful for some Natural Language Processing (NLP) applications."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-16",
"text": "Lembersky et al. (2011) have shown that a language model from translated text improves the performace of a Machine Translation (MT) system."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-17",
"text": "A source and translation classifier can be used to identify translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-18",
"text": "This application also can be used to detect plagiarism where the plagiarised text is translated from another language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-19",
"text": "From the early stage of translation studies research, translation scholars proposed different kinds of properties of source text and translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-20",
"text": "Recently, scholars in this area identified several properties of the translation process with the aid of corpora (Baker, 1993; Baker, 1996; Olohan, 2001; Laviosa, 2002; Hansen, 2003; Pym, 2005) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-21",
"text": "These properties are subsumed under four keywords: explicitation, simplification, normalization and levelling out."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-22",
"text": "They focus on the general effects of the translation process."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-23",
"text": "Toury (1995) has a different theory from these."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-24",
"text": "He stated that some interference effects will be observable in the translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-25",
"text": "That is, a translated text will carry some fingerprints of its source language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-26",
"text": "Specific properties of the English language are visible in user manuals that have been translated to other languages from English (for instance, word order) (Lzwaini, 2003) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-27",
"text": "Recently, Pastor et al. (2008) and Ilisei et al. (2009; have provided empirical evidence of simplification translation properties using a comparable corpus of Spanish."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-28",
"text": "Koppel and Ordan (2011) perform empirical studies to validate both theories, using a subcorpus extracted from the Europarl (Koehn, 2005) and IHT corpora (Koppel and Ordan, 2011) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-29",
"text": "They used a comparable corpus of original English and English translated from five other European languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-30",
"text": "In addition, original English and English translated from Greek and Korean was also used in their experiment."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-31",
"text": "They have found that a translated text contains both source language dependent and independent features."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-32",
"text": "Obviously, corpora of this sort, which focus on a single language (e.g., English), are not adequate for claiming the universal validity of translation properties."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-33",
"text": "Different languages (and language families) have different linguistic properties."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-34",
"text": "A corpus that contains original and translated texts from different source languages will be ideal for this kind of study."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-35",
"text": "In this paper, we are validating both types of translation properties using original and translated texts from six European languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-36",
"text": "As features, we used frequencies of the 100 most frequent words of each target language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-37",
"text": "The paper is organized as follows: Section 2 discusses related work, followed by an introduction of our corpus in Section 3."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-38",
"text": "The experiment and evaluation in Section 4 are followed by a discussion in Section 5."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-39",
"text": "Finally, we present conclusions and future work in Section 6."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-41",
"text": "**RELATED WORK**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-42",
"text": "Corpus-based translation studies is a recent field of research with a growing interest within the field of computational linguistics."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-43",
"text": "Baroni and Bernardini (2006) started corpus-based translation studies empirically, where they work on a corpus of geo-political journal articles."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-44",
"text": "A Support Vector Machine (SVM) was used to distinguish original and translated Italian text using n-gram based features."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-45",
"text": "According to their results, word bigrams play an important role in the classification task."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-46",
"text": "Van Halteren (2008) uses the Europarl corpus for the first time to identify the source language of text for which the source language marker was missing."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-47",
"text": "Support vector regression was the best performing method."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-48",
"text": "Pastor et al. (2008) and Ilisei et al. (2009; perform classification of Spanish original and translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-49",
"text": "The focus of their works is to investigate the simplification relation that was proposed by (Baker, 1996) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-50",
"text": "In total, 21 quantitative features (e.g. a number of different POS, average sentence length, the parse-tree depth etc.) were used where, nine (9) of them are able to grasp the simplification translation property."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-51",
"text": "Koppel and Ordan (2011) have built a classifier that can identify the correct source of the translated text (given different possible source languages)."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-52",
"text": "They have built another classifier which can identify source text and translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-53",
"text": "Furthermore, they have shown that the degree of difference between two translated texts, translated from two different languages into the same target language reflects, the degree of difference of the source languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-54",
"text": "They have gained impressive results for both of the tasks."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-55",
"text": "However, the limitation of this study is that they only used a corpus of English original text and English text translated from various European languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-56",
"text": "A list of 300 function words (Pennebaker et al., 2001 ) was used as feature vector for these classifications."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-57",
"text": "Popescu (2011) uses string kernels (Lodhi et al., 2002) to study translation properties."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-58",
"text": "A classifier was built to classify English original texts and English translated texts from French and German books that were written in the nineteenth century."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-59",
"text": "The p-spectrum normalized kernel was used for the experiment."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-60",
"text": "The system works on a character level rather than on a word level."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-61",
"text": "The system performs poorly when the source language of the training corpus is different from the one of the test corpus."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-62",
"text": "We can not compare our findings directly with Koppel and Ordan (2011) even though we use text from the same corpus and similar techniques."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-63",
"text": "The English language is not considered for this study due to unavailability of English translations for some languages included in this work."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-64",
"text": "Furthermore, instead of the list of 300 function words used by Koppel and Ordan (2011) , we used the 100 most frequent words for each candidate language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-66",
"text": "**DATA**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-67",
"text": "The field of translation studies lacks a multilingual corpus that can be used to validate translation properties proposed by translation scholars."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-68",
"text": "There are many multilingual corpora available used for different NLP applications."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-69",
"text": "A customized version of the Europarl corpus (Islam and Mehler, 2012 ) is freely available for corpus-based translation studies."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-70",
"text": "However, this corpus is not suitable for the experiment we are performing here."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-71",
"text": "We extract a suitable corpus from the Europarl corpus in a way similar to Lembersky et al. (2011) and Koppel and Ordan (2011) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-72",
"text": "Our target is to extract texts that are translated from and to the languages considered here."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-73",
"text": "We trust the source language marker that has been put by the respective translator, as did Lembersky et al.(2011) and Koppel and Ordan (2011) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-74",
"text": "To experiment with stylistic differences in translated text, a list of function words and their"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-77",
"text": "In order to validate two different kinds of translation properties mentioned in Section 1, two different experiments will be performed."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-78",
"text": "For the first experiment, our hypothesis is that texts translated into the same language from different source languages have different properties, a trained classifier will be able to classify texts based on different sources."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-79",
"text": "Our second hypothesis is that translated texts are distinguishable from source texts; a classifier can be trained to identify translated and original texts."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-80",
"text": "Note that we use the Naive Bayes multinomial classifier (Mccallum and Nigam, 1998) in WEKA (Hall et al., 2009 ) for classification."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-81",
"text": "To overcome the data over-fitting problem, we randomly generate training and test set N times and calculate the weighted average of F-Score and Ac-German Dutch French Spanish Polish Czech German -197 197 198 201 197 Dutch 197 -197 198 198 191 French 148 147 -148 149 157 Spanish 148 147 148 -148 148 Polish 151 141 149 148 -129 Czech 140 164 149 148 151 -Table 2 : Source language identification corpus (chunks)"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-82",
"text": "curacy."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-83",
"text": "In this experiment the value of N is 100."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-84",
"text": "The randomly generated training sets contain 80% of the data while the remaining data is used as a test set."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-85",
"text": "To evaluate the classification results, we use standard F-Score and Accuracy measures."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-87",
"text": "**SOURCE LANGUAGE IDENTIFICATION**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-88",
"text": "In this experiment, our goal is to validate the translation properties postulated by Toury (1995) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-89",
"text": "He stated that a translated text inherits some fingerprints from the source language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-90",
"text": "The experimental result of Koppel and Ordan (2011) shows that text translated into English holds this property."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-91",
"text": "If this characteristic also holds for text translated into other languages, then it will corroborate the claim by Toury (1995) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-92",
"text": "If it does not hold for a single language then it might be claimed that this translation property is not universal."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-93",
"text": "In order to train a classifier, we use texts translated into the same language from different source languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-94",
"text": "Table 1 shows the statistics of the corpus used for source language identification experiments."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-95",
"text": "Later, each corpus is divided into a number of chunks (see Table 2 )."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-96",
"text": "Each chunk contains at least seven sentences."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-97",
"text": "Our hypothesis is again similar to Koppel and Ordan (2011) , that is, if the classifier's accuracy is close to 20%, then we cannot say that there is an interference effect in translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-98",
"text": "If the classifier's accuracy is close to 100% then our conclusion will be that interference effects exist in translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-99",
"text": "Table 3 and Table 4 show the evaluation results."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-100",
"text": "Table 4 : Source language identification evaluation (Accuracy)"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-102",
"text": "In the vast majority of cases, members of the same language family share a considerable number of words and grammatical structures."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-103",
"text": "In the experiment, we consider three language families: Romance languages (French and Spanish), Germanic languages (German and Dutch), and Slavic languages (Polish and Czech)."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-104",
"text": "With a Romance target language, 5 the identification of other Romance and of Germanic languages as translation sources performs high, with an F-Score of between 0.86 and 0.95."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-105",
"text": "However, a noticeable drop in performance concerns the identification of the Slavic languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-106",
"text": "When we take a look at the confusion matrices for the respective classifications, we find that, for instance, most misclassifications in the French target language data are between the sources of Polish and Czech."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-107",
"text": "For Germanic target languages, the pattern repeats: when translated into German or Dutch, Polish and Czech texts are hardest to identify as the correct source."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-108",
"text": "The Slavic target languages show a different pattern."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-109",
"text": "Even in another Slavic target language, a Slavic source language cannot reliably be identified in our setting."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-110",
"text": "In addition to this, translations into Slavic are harder to distinguish from each other."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-111",
"text": "Misclassifications in this case show language family specific patterns: German is, for instance, most often misclassified as Dutch in both the Czech and the Polish data."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-112",
"text": "5 Target language refers to text translated into the language"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-114",
"text": "**SOURCE TRANSLATION CLASSIFICATION**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-115",
"text": "Translated texts have distinctive features that make them different from original or non translated text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-116",
"text": "According to Baker (1993; 1996) , Olohan (2001) , Lavisoa (2002) , Hansen (2003) , and Pym (2005) there are some general properties of translations that are responsible for the difference between these two text types."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-117",
"text": "Some of these properties are source and target language independent."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-118",
"text": "According to their findings, a translated text will be similar to another translated text but will be different from a source text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-119",
"text": "In the past, researchers have used comparable corpora to validate these translation properties (Baroni and Bernardini, 2006; Pastor et al., 2008; Ilisei et al., 2009; Ilisei et al., 2010; Koppel and Ordan, 2011) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-120",
"text": "Most of them used comparable corpora for two-class classification, distinguishing translated texts from the original texts."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-121",
"text": "Only Koppel and Ordan (Koppel and Ordan, 2011) used English texts translated from multiple source languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-122",
"text": "We perform similar experiments only for six European languages as shown in Table 1 ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-123",
"text": "In this experiment, the translated text in our training and test set will be a combination of all languages other than the target language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-124",
"text": "For example: when the original class contains original texts (source) in German, then the translation class contains texts that are translated German texts, translated from French, Dutch, Spanish, Polish, and Czech texts."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-125",
"text": "Each class contains 200 chunks of texts, where as the translated class has 40 chunks from each of the source languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-126",
"text": "The source language texts are extracted for the corresponding languages in a similar way from the Europarl corpus."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-127",
"text": "Koppel and Ordan (2011) received the highest accuracy (96.7%) among all works noted above."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-128",
"text": "The training and test data are generated in similar ways as in our previous experiment."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-129",
"text": "That is, 80% of the data is randomly extracted for training and the rest of the data is used for testing."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-130",
"text": "Expected F-Scores are calculated from 100 samples."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-131",
"text": "Table 5 shows the evaluation results."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-132",
"text": "Even though the classifier for German achieves around 99% accuracy, we cannot compare the result with Koppel and Ordan (Koppel and Ordan, 2011) as the amount of chunks for the classes are different."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-133",
"text": "The classifiers for other languages also display very high accuracy."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-134",
"text": "The result of Table 5 shows that general translation properties exist for all languages used in this experiment."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-136",
"text": "**DISCUSSION**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-137",
"text": "The results show that training a classifier based on the 100 most frequent words of a language is sufficient to obtain interpretable results."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-138",
"text": "We find our results to be compatible with Koppel and Ordan (2011) who used 300 function words."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-139",
"text": "A list of the 100 most frequent words is easily obtainable for a vast number of languages, while lists consisting strictly of function words are rare and cannot be produced without considerable additional effort."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-140",
"text": "While the 100 most frequent words of a language are sufficient to train a classifier for Germanic or Romance languages, it fails to perform equally well for Slavic languages."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-141",
"text": "Koppel and Ordan (2011) claim that Toury's (1995) findings of interference of a translation hold true; we find the assumption to be too simplistic, since for Slavic text either as a source or target language this statement cannot supported."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-142",
"text": "Although function words do exist in all the languages we examined, the language families differ in the degree to which it is necessary to use them."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-143",
"text": "For instance, French lacks a case system (Dryer and Haspelmath, 2011) , and makes instead use of prepositions."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-144",
"text": "On the other hand, Polish and Czech most extensively use (inflectional) affixes (Kulikov et al., 2006) ."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-145",
"text": "Regarding the distribution of word frequencies, for both Polish and Czech, the use of affixes causes a flatter Zipf curve."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-146",
"text": "Kwapien et al. (2010) put it so :\"...typical Polish texts have smaller \u03b1 [as exponent of the formula f (r) \u223c r \u2212\u03b1 ] than typical English texts (on average: 0.95 vs. 1.05).\" This means that on average a more frequent word does not differ as much in its frequency from a word 10 ranks further down in Polish as it does in English."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-147",
"text": "Consequently, there will be fewer instances of the 100 most frequent words in the same portion of text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-167",
"text": "The data showed that the classifier classifies Czech text no worse than Dutch or German and only slightly worse than French."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-148",
"text": "This is an obvious reason why a classifier's training must remain weaker in comparison to languages with a steeper Zipf curve."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-149",
"text": "There is a positive correlation to language family when considering the probability of finding the same strategy (e.g. prepositions vs. affixes)."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-150",
"text": "In summary, the fact that Slavic uses more affixes, or is more inflectional in linguistic terms, explains to some extent why the classifier performs worst for Slavic target text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-151",
"text": "However, for Slavic source texts, the classification results are equally unsatisfactory, which has to be explained differently."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-152",
"text": "One phenomenon contributing here could be that Romance and Germanic have a recent history of mutual loans and calques, which increases the probability of finding synonyms where one has a Romance origin and one a Germanic origin."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-153",
"text": "In the case of a translation, the translator, when confronted with such a synonym, might choose the item similar to the source language within the target language, as this minimizes the translation effort, complies thus to an economy principle and has virtually no effect on the translation."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-154",
"text": "6 Making this choice, the translator unintentionally distorts the native frequency patterns for the target language."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-155",
"text": "This could be one of the processes generating an imprint of translated text in the frequency spectrum, since function words are also subject to loaning and synonymy."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-156",
"text": "If the translator has a choice for translating a preposition/affix and neither of the possibilities is similar to the source language, nor a loanword or structurally similar, he/she will go for the predominant word or structure of the target language (since he/she is a native of the target language by translation industry standard), making the translation less different from native text."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-157",
"text": "The data can be influenced by many additional variables such as differing translation paradigms influencing the choice of structures (free translation vs. faithful translation), different industry standards, the size of the chunks, 7 the quality of the translation source marking, the native tongue of the translator(s), the time pressure for delivery, the payment, the membership of all sample languages to the European subbranch of Indo-European languages, the qualities of the lists of the most frequent 100 words, the genre of the Europarl corpus, and possibly many more."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-158",
"text": "This said, we believe the best hypothesis for the interpretation of the data is that a good classification result is reached firstly for languages with a more isolating structure, since they make less use of affixes and therefore more of function words, and should display steeper Zipf curves."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-159",
"text": "Secondly, the classification result should be better, the more instances the text contains, where the translator for one token (or for one structure) of the source language has the choice between at least two words or structures in the target language with one of those being similar to the source language, the other being different."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-160",
"text": "The number of such instances most probably correlates positively with the degree and quality of language relationship and language contact since the number of cognates, loans and calques does."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-161",
"text": "However, this number can also be \"accidentally high\" for two unrelated languages when they overlap in grammatical structure."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-162",
"text": "As has been postulated for instance by Croft (2003) , languages undergo a cyclic development from structurally more isolating towards agglutinative to inflectional and then back to isolating."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-163",
"text": "When a language is in a state of transition, which practically all languages are, they offer two structural encoding possibilities for one specific grammatical property, e.g., a genitive (for instance, inflectional (an affix) as in Peter's house and isolating (a preposition) as in the house of Peter)."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-164",
"text": "All languages should share structural properties, since there are only three types and each language has practically at least two."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-165",
"text": "Corroborating this rather complex hypothesis, we examine data on Bulgarian and Romanian."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-168",
"text": "When we replace Czech with Bulgarian and Spanish with Romanian in the German target language, the language family dependent pattern gets blurred and the identification of Polish performs quite well, that of Romanian relatively poor, while French is identified reliably."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-169",
"text": "This together with the observation that Romanian is misclassified either as Polish or Bulgarian and Bulgarian is mostly misclassified as Romanian seems to be a strong hint towards the impact of the language specific usage of function words, linguistic structure, and the importance of language contact."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-170",
"text": "Bulgarian and Romanian constitute the core of the most prominent linguistic contact zone or sprachbund ever written on the Balkans."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-171",
"text": "This suggests that Romanian and Bulgarian translators may, due to grammatical convergence of their languages make, given two equivalent structures in any target language, the same (structurally motivated) choices and hence leave a very similar imprint."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-172",
"text": "That is, sprachbund membership as well as language family could be decisive factors for a classifiers performance."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-174",
"text": "**CONCLUSION**"
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-175",
"text": "We have shown that interference as originally proposed by Toury (1995) is not supported by the data without making further assumptions."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-176",
"text": "Language family and language contact should be considered separately for each language pair as sources for possible weak results of a classifier even when operating with function words as should be general structural similarity."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-177",
"text": "As for the properties of translated text being universal, we found support for this in our data in a real n-ary validation setting."
},
{
"sent_id": "50d065b6b187f361f8e456df0a0bbe-C001-178",
"text": "We have also shown that the much more easily obtainable lists of the 100 most frequent words work almost as well for classification as do longer lists that contain only function words."
}
],
"y": {
"@MOT@": {
"gold_contexts": [
[
"50d065b6b187f361f8e456df0a0bbe-C001-2",
"50d065b6b187f361f8e456df0a0bbe-C001-3",
"50d065b6b187f361f8e456df0a0bbe-C001-4",
"50d065b6b187f361f8e456df0a0bbe-C001-5"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-20",
"50d065b6b187f361f8e456df0a0bbe-C001-21",
"50d065b6b187f361f8e456df0a0bbe-C001-22",
"50d065b6b187f361f8e456df0a0bbe-C001-23",
"50d065b6b187f361f8e456df0a0bbe-C001-25",
"50d065b6b187f361f8e456df0a0bbe-C001-27",
"50d065b6b187f361f8e456df0a0bbe-C001-28",
"50d065b6b187f361f8e456df0a0bbe-C001-29",
"50d065b6b187f361f8e456df0a0bbe-C001-30",
"50d065b6b187f361f8e456df0a0bbe-C001-31",
"50d065b6b187f361f8e456df0a0bbe-C001-32",
"50d065b6b187f361f8e456df0a0bbe-C001-33"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-51",
"50d065b6b187f361f8e456df0a0bbe-C001-52",
"50d065b6b187f361f8e456df0a0bbe-C001-53",
"50d065b6b187f361f8e456df0a0bbe-C001-54",
"50d065b6b187f361f8e456df0a0bbe-C001-55"
]
],
"cite_sentences": [
"50d065b6b187f361f8e456df0a0bbe-C001-4",
"50d065b6b187f361f8e456df0a0bbe-C001-28",
"50d065b6b187f361f8e456df0a0bbe-C001-51"
]
},
"@BACK@": {
"gold_contexts": [
[
"50d065b6b187f361f8e456df0a0bbe-C001-2",
"50d065b6b187f361f8e456df0a0bbe-C001-3",
"50d065b6b187f361f8e456df0a0bbe-C001-4",
"50d065b6b187f361f8e456df0a0bbe-C001-5"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-20",
"50d065b6b187f361f8e456df0a0bbe-C001-21",
"50d065b6b187f361f8e456df0a0bbe-C001-22",
"50d065b6b187f361f8e456df0a0bbe-C001-23",
"50d065b6b187f361f8e456df0a0bbe-C001-25",
"50d065b6b187f361f8e456df0a0bbe-C001-27",
"50d065b6b187f361f8e456df0a0bbe-C001-28",
"50d065b6b187f361f8e456df0a0bbe-C001-29",
"50d065b6b187f361f8e456df0a0bbe-C001-30",
"50d065b6b187f361f8e456df0a0bbe-C001-31",
"50d065b6b187f361f8e456df0a0bbe-C001-32",
"50d065b6b187f361f8e456df0a0bbe-C001-33"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-51",
"50d065b6b187f361f8e456df0a0bbe-C001-52",
"50d065b6b187f361f8e456df0a0bbe-C001-53",
"50d065b6b187f361f8e456df0a0bbe-C001-54",
"50d065b6b187f361f8e456df0a0bbe-C001-55"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-88",
"50d065b6b187f361f8e456df0a0bbe-C001-89",
"50d065b6b187f361f8e456df0a0bbe-C001-90"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-119",
"50d065b6b187f361f8e456df0a0bbe-C001-120",
"50d065b6b187f361f8e456df0a0bbe-C001-121",
"50d065b6b187f361f8e456df0a0bbe-C001-122"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-127"
]
],
"cite_sentences": [
"50d065b6b187f361f8e456df0a0bbe-C001-4",
"50d065b6b187f361f8e456df0a0bbe-C001-28",
"50d065b6b187f361f8e456df0a0bbe-C001-51",
"50d065b6b187f361f8e456df0a0bbe-C001-90",
"50d065b6b187f361f8e456df0a0bbe-C001-119",
"50d065b6b187f361f8e456df0a0bbe-C001-121",
"50d065b6b187f361f8e456df0a0bbe-C001-127"
]
},
"@SIM@": {
"gold_contexts": [
[
"50d065b6b187f361f8e456df0a0bbe-C001-62"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-73"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-97",
"50d065b6b187f361f8e456df0a0bbe-C001-98"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-137",
"50d065b6b187f361f8e456df0a0bbe-C001-138"
]
],
"cite_sentences": [
"50d065b6b187f361f8e456df0a0bbe-C001-62",
"50d065b6b187f361f8e456df0a0bbe-C001-73",
"50d065b6b187f361f8e456df0a0bbe-C001-97",
"50d065b6b187f361f8e456df0a0bbe-C001-138"
]
},
"@DIF@": {
"gold_contexts": [
[
"50d065b6b187f361f8e456df0a0bbe-C001-64"
],
[
"50d065b6b187f361f8e456df0a0bbe-C001-132"
]
],
"cite_sentences": [
"50d065b6b187f361f8e456df0a0bbe-C001-64",
"50d065b6b187f361f8e456df0a0bbe-C001-132"
]
},
"@USE@": {
"gold_contexts": [
[
"50d065b6b187f361f8e456df0a0bbe-C001-71",
"50d065b6b187f361f8e456df0a0bbe-C001-72"
]
],
"cite_sentences": [
"50d065b6b187f361f8e456df0a0bbe-C001-71"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"50d065b6b187f361f8e456df0a0bbe-C001-141"
]
],
"cite_sentences": [
"50d065b6b187f361f8e456df0a0bbe-C001-141"
]
}
}
},
"ABC_cb64ba694c37df9ebc1065a1deac0f_5": {
"x": [
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-2",
"text": "A number of different research subfields are concerned with the automatic assessment of student answers to comprehension questions, from language learning contexts to computer science exams."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-3",
"text": "They share the need to evaluate free-text answers but differ in task setting and grading/evaluation criteria, among others."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-4",
"text": "This paper has the intention of fostering synergy between the different research strands."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-5",
"text": "It discusses the different research strands, details the crucial differences, and explores under which circumstances systems can be compared given publicly available data."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-6",
"text": "To that end, we present results with the CoMiC-EN Content Assessment system (Meurers et al., 2011a) on the dataset published by Mohler et al. (2011) and outline what was necessary to perform this comparison."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-7",
"text": "We conclude with a general discussion on comparability and evaluation of short answer assessment systems."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-10",
"text": "Short answer assessment systems compare students' responses to questions with manually defined target responses or answer keys in order to judge the appropriateness of the responses, or in order to automatically assign a grade."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-11",
"text": "A number of approaches have emerged in recent years, each of them with different aims and different backgrounds."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-12",
"text": "In this paper, we will draw a map of the short answer assessment landscape, highlighting the similarities and differences between approaches and the data used for evaluation."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-13",
"text": "We will provide an overview of 12 systems and sketch their attributes."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-14",
"text": "Subsequently, we will zoom into the comparison of two of them, namely CoMiC-EN (Meurers et al., 2011a ) and the one which we call the Texas system (Mohler et al., 2011) and discuss the issues that arise with this endeavor. Returning to the bigger picture, we will explore how such systems could be compared in general, in the belief that meaningful comparison of approaches across research strands will be an important ingredient in advancing this relatively new research field."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-15",
"text": "2 The short answer assessment landscape"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-17",
"text": "**GENERAL ASPECTS**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-18",
"text": "Researchers from all directions have settled in the landscape of short answer assessment, each of them with different backgrounds and different goals."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-19",
"text": "In this section, we aim at providing an overview of these research villages, also hoping to construct a road network that may connect them."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-20",
"text": "Most approaches to short answer assessment are situated in an educational context."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-21",
"text": "Some focus on GCSE 1 tests, others aim at university assessment tests in the medical domain."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-22",
"text": "Another strand of approaches focuses on language teaching and learning."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-23",
"text": "All of these approaches share one theme: they assess short texts written by students."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-24",
"text": "These may be answers to questions that ask for knowledge acquired in a course, e.g., in computer science, or to reading comprehension questions in second language learning."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-25",
"text": "While thematically related, short answer assessment is different from essay grading."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-26",
"text": "Short answers are formulated by students in a much more controlled setting."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-27",
"text": "Not only are they short, they usually are supposed to contain only a few facts that answer only one question."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-28",
"text": "Another common theme of these approaches is that they compare the student answers to one or more previously defined correct answers that are either given in natural language as target answers or as a list of concepts in an answer key."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-29",
"text": "The ways of technically conducting these comparisons vary widely, as we discuss below in Section 2.2."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-30",
"text": "There also are conceptual differences between the approaches."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-31",
"text": "Some systems focus on assessing whether or not the student has properly answered the question."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-32",
"text": "They put the spot on comparing the meaning of target answers and student answers; they aim at being tolerant of form errors such as spelling or grammar errors."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-33",
"text": "Others aim at giving a grade as accurate as possible, therefore not only assessing meaning but also performing grading similar to human teachers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-34",
"text": "This can also include modules that take into account form errors."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-35",
"text": "These two views on a similar task are also reflected in the annotation of the data used in experiments: Systems performing meaning comparison usually operate with labels specifying the relations between target answers and student answers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-36",
"text": "Grading systems naturally aim at producing numerical grades."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-37",
"text": "Since labels are on a nominal scale, and grades are on an ordinal scale (or even treated as being on an interval scale), the difference between meaning comparison and grading results in a whole string of other differences in methodology."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-38",
"text": "Researchers also enter the short answer landscape from different home countries: Some projects are interested in the strategies and mechanics of meaning comparison, others aim at reducing the load and costs of large-scale assessment tests, and yet others aim at improving intelligent tutoring systems, requiring additional components that provide useful feedback to students using these systems."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-39",
"text": "Table 1 summarizes the features of the short answer assessment systems discussed hereafter."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-41",
"text": "**APPROACHES**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-42",
"text": "One of the earlier systems is WebLAS, presented by Bachman et al. (2002) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-43",
"text": "A human task creator feeds the system with scores for model answers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-44",
"text": "Regular expressions are then created automatically from these model answers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-45",
"text": "Since each regular expression is associated with a score, matching the expression against a student answer yields a score for that answer."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-46",
"text": "Bachman et al. (2002) do not provide an evaluation study based on data."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-47",
"text": "Another earlier system is CarmelTC by Ros\u00e9 et al. (2003) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-48",
"text": "It has been designed as a component in the Why2 tutorial dialogue system (VanLehn et al., 2002) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-49",
"text": "Even though Ros\u00e9 et al. (2003) position CarmelTC in the context of essay grading, it may be considered to deal with short answers: in their data, the average length of a student response is approx."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-50",
"text": "48 words."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-51",
"text": "Their system is designed to perform text classification on single sentences in the student responses, where each class of text represents one possible model response, plus an additional class for 'no match'."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-52",
"text": "They combine decision trees operating on an automatic syntactic analysis, a Naive Bayes text classifier, and a bag-of-words approach."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-53",
"text": "In a 50-fold cross validation experiment with one physics question, six classes and 126 student responses, hand-tagged by two annotators, CarmelTC reaches an F-measure value of 0.85."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-54",
"text": "They do not report on a baseline."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-55",
"text": "Concerning the quality of the gold standard, they report that conflicts in the annotation have been resolved."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-56",
"text": "C-Rater (Leacock and Chodorow, 2003 ) is based on a paraphrase recognition approach."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-57",
"text": "It employs correct answer models consisting of essential points formulated in natural language."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-58",
"text": "C-Rater aims at automatic scoring and focuses on meaning, thus tolerating form errors."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-59",
"text": "Leacock and Chodorow (2003) present two pilot studies, one of them dealing with reading comprehension."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-60",
"text": "From 16,625 student answers with an average length of 43 words, they drew a random sample of 100 answers to each of the seven questions."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-61",
"text": "This sample was scored by one human judge using a three-way scoring system (full credit, partial credit, no credit)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-62",
"text": "Their system achieved 84% agreement with the gold standard."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-63",
"text": "Information about the distribution of the scoring categories is given indirectly: A baseline system that assigns scores randomly would have achieved 47% accuracy."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-65",
"text": "**SYSTEM**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-66",
"text": "Goal Technique Domain Lang."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-67",
"text": "WebLAS (Bachman et al., 2002) Assessment of language ability"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-68",
"text": "Auto-generated regular expressions Foreign language teaching EN CarmelTC (Ros\u00e9 et al., 2003) Automatic grading Text classification Physics EN C-Rater (Leacock and Chodorow, 2003) Assessment test Paraphrase recognition Mathematics, Reading comp."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-70",
"text": "**EN**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-71",
"text": "IAT (Mitchell et al., 2003) Assessment, Automatic grading"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-72",
"text": "Information extraction w/ handwritten patterns"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-74",
"text": "**MEDICAL EN**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-75",
"text": "Oxford (Pulman and Sukkarieh, 2005) Assessment, automatic grading"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-76",
"text": "Information extraction w/ handwritten patterns GCSE exams EN Atenea (P\u00e9rez et al., 2005) Automatic grading N-gram overlap, Latent Semantic Analysis"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-77",
"text": "Computer science ES Logic-based System (Makatchev and VanLehn, 2007) Meaning comparison First-order logic, machine learning Physics EN CAM (Bailey and Meurers, 2008) , CoMiC-EN (Meurers et al., 2011a) Meaning comparison Alignment, machine learning"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-78",
"text": "Reading comp."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-79",
"text": "in foreign language EN Facets System (Nielsen et al., 2009) Information extraction templates form the core of the Intelligent Assessment Technologies system (IAT, Mitchell et al. 2003) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-80",
"text": "These templates are created manually in a special-purpose authoring tool by exploring sample responses."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-81",
"text": "They allow for syntactic variation, e.g., filling the subject slot in a sentence with different equivalent concepts."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-82",
"text": "The templates corresponding to a question are then matched against the student answer."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-83",
"text": "Unlike other systems, IAT additionally features templates for explicitly invalid answers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-84",
"text": "They tested their approach with a progress test that has to be taken by medicine students."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-85",
"text": "Approximately 800 students each plowed through 270 test items."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-86",
"text": "The automatically graded responses then were moderated: Human judges streamlined the answers to achieve a more consistent grading."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-87",
"text": "This step already had been done before with tests graded by humans."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-88",
"text": "Mitchell et al. (2003) state that their system reaches 99.4% accuracy on the full dataset after the manual adjustment of the templates via the moderation process."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-89",
"text": "Summarizing, they report an error of \"between 5 and 5.5%\" in inter-grader agreement and an error of 5.8% in automatic grading without the moderation step, though it is not entirely clear which data these statistics correspond to."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-90",
"text": "No information on the distribution of grades or a random baseline is provided."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-91",
"text": "The Oxford system (Pulman and Sukkarieh, 2005) is another one to employ an information extraction approach."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-92",
"text": "Again, templates are constructed manually."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-93",
"text": "Motivated by the necessary robustness to process language with grammar mistakes and spelling errors, they use shallow analyses in their pre-processing."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-94",
"text": "In order to overcome the hassle of manually constructing templates, they also investigated machine learning techniques."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-95",
"text": "However, the automatically generated templates were outperformed by the manually created ones."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-96",
"text": "Furthermore, they state that manually created templates can be equipped with messages provided to the student as feedback in a tutoring system."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-97",
"text": "For evaluating their system, they used factual science questions and the corresponding student answers from GCSE tests."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-98",
"text": "200 graded answers for each of nine questions served as a training set, while another 60 answers served as a test set."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-99",
"text": "They report that their system achieves an accuracy of 84%."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-100",
"text": "With inconsistencies in the human grading removed, it achieves 93%."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-101",
"text": "However, they do not report on the level of inter-grader agreement or on a random baseline."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-102",
"text": "P\u00e9rez et al. (2005) present the Atenea system, a combined approach that makes use of Latent Semantic Analysis (LSA, Landauer et al. 1998 ) and n-gram overlap."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-103",
"text": "While n-gram overlap supports comparing target responses and student responses with differing word order, it does not deal with synonyms and related terms."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-104",
"text": "Hence, they use LSA to add a component that deals with semantic relatedness in the comparison step."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-105",
"text": "As a test corpus, they collected nine different questions from computer science exams."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-106",
"text": "A tenth question \"[consists] of a set of definitions of 'Operating System' obtained from the Internet.\" Altogether, they gathered 924 student responses and 44 target responses written by teachers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-107",
"text": "Since their LSA module had been trained on English but their data were in Spanish, they chose to use Altavista Babelfish to translate the data into English."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-108",
"text": "They do not provide information about the distribution of scores and about inter-grader agreement."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-109",
"text": "Atenea achieves a Pearson's correlation of r = 0.554 with the scores in the gold standard."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-110",
"text": "The approach by Makatchev and VanLehn (2007) , which we refer to as the Logic-based System, enters the landscape from the direction of artificial intelligence."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-111",
"text": "It is related to CarmelTC and its dataset, but follows a different route: target responses are manually encoded in first-order predicate language."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-112",
"text": "Similar logic representations are constructed automatically for student answers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-113",
"text": "They explore various strategies for matching these two logic representation on the basis of 16 semantic classes."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-114",
"text": "In an evaluation experiment, they tested the system on 293 \"natural language utterances\" with ten-fold cross validation."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-115",
"text": "The test data are skewed towards the 'empty' label that indicates that none of the 16 semantic labels could be attached."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-116",
"text": "They do not report on other properties of the dataset such as number of annotators or number of questions to which the student answers were given."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-117",
"text": "Their winning configuration yields a F-measure value of 0.4974."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-118",
"text": "While Makatchev and VanLehn (2007) position their approach in the context of the Why2 tutorial dialogue system, their use of semantic classes seems to make them more related to meaning comparison than to grading."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-119",
"text": "The Content Assessment Module (CAM) presented in Bailey (2008) and Bailey and Meurers (2008) utilizes an approach that is different from the systems discussed so far: Following a three-step strategy, the system first automatically generates linguistic annotations for questions, target responses and student responses."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-120",
"text": "In an alignment phase, these annotations are then used to map from elements (words, lemmas, chunks, dependency triples) in the student responses to elements in the target responses."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-121",
"text": "Finally, a machine learning classifier judges on the basis of this alignment, whether or not the student has answered the question correctly."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-122",
"text": "The data used for evaluation was made available as the Corpus of Reading Comprehension Exercises in English (CREE, Meurers et al. 2011a )."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-123",
"text": "This corpus consists of 566 responses produced by intermediate ESL learners at The Ohio State University as part of their regular assignments."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-124",
"text": "Students had access to their textbooks and typically answered questions in one to three sentences."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-125",
"text": "All responses were labelled as either appropriate or inappropriate by two independent annotators, along with a detailed diagnosis code specifying the nature of the inappropriateness (missing concept, extra concept, blend, non-answer)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-126",
"text": "In leave-one-out evaluation on the development set containing 311 responses to 47 different questions, CAM achieved 87% accuracy on the binary judgment (response correct/incorrect)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-127",
"text": "For the test set containing 255 responses to 28 questions, the approach achieved 88%."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-128",
"text": "However, the distribution of categories in the data is heavily skewed with 71% of the responses marked as correct in the development set and 84% in the test set."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-129",
"text": "The best result obtained on a balanced set with leave-one-out-testing is 78%."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-130",
"text": "Meurers et al. (2011a) (RTE, Dagan et al. 2009 )."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-131",
"text": "In a number of friendly challenges, RTE research has spawned numerous systems that try to automatically answer the following question: Given a text and a hypothesis, is the hypothesis entailed by the text? Short answers assessment can be seen as a RTE task in which the target response corresponds to the text and the student response to the hypothesis."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-132",
"text": "Nielsen et al. (2009) base their system on what they call facets."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-133",
"text": "These facets are meaning representations of parts of sentences."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-134",
"text": "They are constructed automatically from dependency and semantic parses of the target responses."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-135",
"text": "Each facet in the target response is then looked up in the corresponding student response and equipped with one of five labels 2 ranging from unaddressed (the student did not mention the fact in this facet) to expressed (the student named the fact)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-136",
"text": "This step is taken via machine learning."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-137",
"text": "From a tutoring system in real-life operation, they gathered responses from third-to sixth-grade students answering questions for science classes."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-138",
"text": "Two annotators worked on these data, producing 142,151 facets."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-139",
"text": "Furthermore, all facets were looked up in the corresponding student responses and annotated accordingly, using the mentioned set of labels."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-140",
"text": "The best result of the Facets System is 75.5% accuracy on one of the held-out test sets."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-141",
"text": "With ten-fold cross validation on the training set, it achieves 77.1% accuracy."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-142",
"text": "The majority label baselines are 51.1% and 54.6% respectively."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-143",
"text": "Providing this more fine-grained analysis of facets that are searched for in student responses, Nielsen et al. (2009) claim to \"enable more intelligent dialogue control\" in tutoring systems."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-144",
"text": "From the point of view of grading vs. meaning comparison, their approach can be counted towards the latter, since their labels can be conflated to produce a single yes/no decision."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-145",
"text": "Another recent approach is described by Mohler et al. (2011) , hereafter referred to as the Texas system."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-146",
"text": "Student responses and target responses are annotated using a dependency parser."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-147",
"text": "Thereupon, subgraphs of the dependency structures are constructed in order to map one response to the other."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-148",
"text": "These alignments are generated using machine learning."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-149",
"text": "Dealing with subgraphs allows for variation in word order between the two responses that are to be compared."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-150",
"text": "In order to account for meaning, they combine lexical semantic similarity with the aforementioned alignment."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-151",
"text": "They make use of several WordNet-based measures and two corpus-based measures, namely Latent Semantic Analysis and Explicit Semantic Analysis (ESA, Gabrilovich and Markovitch 2007) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-152",
"text": "For evaluating their system, Mohler et al. (2011) collected student responses from an online learning environment."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-153",
"text": "80 questions from ten introductory computer science assignments spread across two exams were gathered together with 2,273 student responses."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-154",
"text": "These responses were graded by two human judges on a scale from zero to five."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-155",
"text": "The judges fully agreed in 57% of all cases, their Pearson correlation computes to r = 0.586."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-156",
"text": "The gold standard has been created by computing the arithmetic mean of the two judgments for each response."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-157",
"text": "The Texas system achieves r = 0.518 and a Root Mean Square Error of 0.978 as its best result."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-158",
"text": "Mohler et al. (2011) mention that \"[t]he dataset is biased towards correct answers\"."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-159",
"text": "Data are publicly available."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-160",
"text": "We used these in an evaluation experiment with the CoMiC-EN system, discussed in Section 3."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-161",
"text": "While almost all short answer assessment research has targeted answers written in English, there are two recent approaches dealing with German answers."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-162",
"text": "The CoMiC-EN reimplementation of CAM discussed above was motivated by the need for a modular architecture supporting a transfer of the system to German, resulting in its counterpart named CoMiC-DE (Meurers et al., 2011b) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-163",
"text": "The German system utilizes the same strategies as the English one, but with language-dependent processing modules being replaced."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-164",
"text": "Meurers et al. (2011b) Richter and Sailer 2003) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-165",
"text": "In a first step, they create LRS representations from POS-tagged and dependency-parsed data."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-166",
"text": "These underspecified LRS representations of student responses and target responses are then aligned."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-167",
"text": "Using A* as heuristic search algorithm, a best alignment is computed and equipped with a numeric score representing the quality of the alignment of the formulae."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-168",
"text": "If this best alignment scores higher than a threshold, the system judges student response and target response to convey the same meaning."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-169",
"text": "The alignment and comparison mechanism does not utilize any linguistic representations other than the LRS semantic formulae."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-170",
"text": "These semantic representations abstract away from surface features, e.g., by treating active and passive voice equally."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-171",
"text": "Hahn and Meurers (2012) claim that that \"[semantic representations] more clearly expose those distinction which do make a difference in meaning.\" They evaluate the approach on the above-mentioned subset of CREG containing 1,032 learner responses and report an accuracy of 86.3%."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-173",
"text": "**A CONCRETE SYSTEM COMPARISON**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-174",
"text": "After discussing the broad landscape of Short Answer Evaluation systems, the main characteristics and differences, we now turn to a comparison of two concrete systems, namely CoMiC-EN (Meurers et al., 2011a ) and the Texas system Mohler et al. (2011) , to explore what is involved in such a concrete comparison of two systems from different contexts."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-175",
"text": "While CoMiC-EN was developed with meaning comparison in mind, the purpose of the Texas system is answer grading."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-176",
"text": "We pick these two systems because they constitute recent and interesting instances of their respective fields and the corresponding data are freely available."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-178",
"text": "**DATA**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-179",
"text": "In evaluating the Texas system, Mohler et al. (2011) used a corpus of ten assignments and two exams from an introductory computer science class."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-180",
"text": "In total, the Texas corpus consists of 2,442 responses, which were collected using an online learning platform."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-181",
"text": "Each response is rated by two annotators with a numerical grade on a 0-5 scale."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-182",
"text": "Annotators were not given any specific instructions besides the scale itself, which resulted in an exact agreement of 57.7%."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-183",
"text": "In order to arrive at a gold standard rating, the numerical average of the two ratings was computed."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-184",
"text": "The data exist in raw, sentence-segmented and parsed versions and are freely available for research use."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-185",
"text": "Table 2 presents a breakdown of the score counts and distribution statistics of the Texas corpus."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-186",
"text": "A bias towards correct answers can be observed, which is also mentioned by Mohler et al. (2011) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-187",
"text": "Table 2 : Details on the gold standard scores in the Texas corpus."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-188",
"text": "Non-integer scores result from averaging between raters and normalization onto the 0-5 scale."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-190",
"text": "**APPROACHES**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-191",
"text": "CoMiC-EN uses a three-step approach to meaning comparison."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-192",
"text": "Annotation uses NLP to enrich the student and target answers, as well as the question text, with linguistic information on different levels (words, chunks, dependency triples) and types of abstraction (tokens, lemmas, distributional vectors, etc.) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-193",
"text": "Alignment maps elements of the learner answer to elements of the target response using annotation."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-194",
"text": "The global alignment solution is computed using the Traditional Marriage Algorithm (Gale and Shapley, 1962) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-195",
"text": "Finally, Classification analyzes the possible alignments and labels the learner response with a binary or detailed diagnosis code."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-196",
"text": "The features used in the classification step are shown in Percent of token alignments that were token-identical 9."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-197",
"text": "Similarity Match Percent of token alignments that were similarity-resolved 10."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-198",
"text": "Type Match Percent of token alignments that were type-resolved 11."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-199",
"text": "Lemma Match Percent of token alignments that were lemma-resolved 12."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-200",
"text": "Synonym Match Percent of token alignments that were synonym-resolved 13."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-201",
"text": "Variety of Match Number of kinds of (0-5) token-level alignments dependency graph alignment in connection with two different machine learning approaches."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-202",
"text": "Among the BOW features are WordNet-based similarity measures such as the one by Lesk (1986) and vector space measures such as tf * idf (Salton and McGill, 1983 ) and the more advanced LSA (Landauer et al., 1998) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-203",
"text": "The dependency graph alignment approach builds on a node-to-node matching stage which computes a score for each possible match between nodes of the student and target response."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-204",
"text": "In the next stage, the optimal graph alignment is computed based on the node-to-node scores using the Hungarian algorithm."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-205",
"text": "Mohler et al. (2011) also employ a technique they call \"question demoting\", which refers to the exclusion of words from the alignment process if they already appeared in the question string."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-206",
"text": "Incidentally, the technique is also used in the earlier CAM system (Bailey and Meurers, 2008) , but called \"Givenness filter\" there, following the long tradition of research on givenness (Schwarzschild, 1999) as a notion of information structure investigated in formal pragmatics."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-207",
"text": "To produce the final system score, the Texas system uses two machine learning techniques based on Support Vector Machines (SVMs), SVMRank and Support Vector Regression (SVR)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-208",
"text": "Both techniques are trained with several combinations of the dependency alignment and BOW features."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-209",
"text": "While with SVR one trains a function to produce a score on the 0-5 scale itself, SVMRank produces a ranking of student answers which does not produce a 0-5 grade."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-210",
"text": "Therefore, Mohler et al. (2011) employ isotonic regression to map the ranking to the 0-5 scale."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-211",
"text": "In terms of performance, Mohler et al. (2011) report that the SVMRank system produces a better correlation measure (r = 0.518) while the SVR system yields a better RMSE (0.978)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-213",
"text": "**EVALUATION**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-214",
"text": "We now turn to the evaluation of CoMiC-EN on the Texas corpus as it is a publicly available dataset."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-215",
"text": "As mentioned before, CoMiC-EN performs meaning comparison based on a system of categories while the Texas system is a scoring approach, trying to predict a grade."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-216",
"text": "While the former is a classification task, the latter is better characterized as a regression problem because of the desired numerical outcome."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-217",
"text": "Of course, one could simply pretend that individual grades are classes and treat scoring as a classification task."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-218",
"text": "However, a classification approach has no knowledge of numerical relationships, i.e., it does not 'know' that 4 is a higher grade than 3 and a much higher grade than 1 (assuming a 0-5 scale)."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-219",
"text": "As a result, if an evaluation metric such as Pearson correlation is used, classification systems are at a disadvantage because some misclassifications are punished more than others."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-220",
"text": "We discuss this point further in Section 4."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-221",
"text": "For these reasons, to obtain a more interesting comparison, we modified CoMiC-EN to perform scoring instead of meaning comparison."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-222",
"text": "This means that the memory-based learning approach CoMiC-EN had employed so far was no longer applicable and had to be replaced with a regression-capable learning strategy."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-223",
"text": "We chose Support Vector Regression (SVR) using libSVM 4 since that is one of the methods employed by Mohler et al. (2011) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-224",
"text": "However, all other parts of CoMiC-EN such as the processing pipeline and the alignment approach and the extracted features remained the same."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-225",
"text": "The evaluation procedure was carried out as a 12-fold cross-validation due to the 12 assignments in the Texas corpus."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-226",
"text": "For each fold, one complete assignment was held out as test set."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-227",
"text": "Parameters for the SVR were determined using a grid search using the tools provided with libSVM."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-228",
"text": "As kernel function, we used a linear kernel as it was also used in the evaluation of the Texas system and thus constitutes a vital part of the evaluation setup."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-229",
"text": "In general, we designed to evaluation procedure to be as close as possible to the Texas one."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-230",
"text": "The CoMiC-EN system on the Texas data set does not quite reach the level achieved by the Texas system on their data set."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-231",
"text": "We obtained a Pearson correlation of r = 0.405 and an RMSE of 1.016 over all 12 folds."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-232",
"text": "However, let us keep in mind the objective of this experiment as exemplifying the process needed to directly compare two systems from different research strands on the same dataset."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-233",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-234",
"text": "**COMPARABILITY OF APPROACHES & DATASETS**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-235",
"text": "It seems clear that for systems to be comparable and results to be reproducible, datasets must be publicly available, as is the case with the Texas corpus."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-236",
"text": "However, data availability alone does not ensure meaningful comparison."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-237",
"text": "Depending on the context the corpus was drawn from, datasets will differ just like the corresponding systems:"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-238",
"text": "\u2022 Data source: Reading comprehension task in language learning setting, language tutoring context, automated grading of short answer exams"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-239",
"text": "\u2022 Language properties: Native vs. learner language, domain-specific language (e.g., computer science)"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-240",
"text": "\u2022 Assessment scheme: nominal vs. interval scale Especially the last point deserves some further discussion."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-241",
"text": "Depending on the kind of assessment scheme, which in turn is motivated by the task, different evaluation methods may be chosen."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-242",
"text": "Scoring systems are often evaluated using a correlation metric in order to capture the systems' tendency to assign similar but not necessary equal grades as the human raters."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-243",
"text": "Conversely, with category-based schemes one usually reports accuracy, which expresses how many items were classified correctly."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-244",
"text": "The question that arises is how a system coming from one paradigm can be compared to one from the other paradigm in a meaningful way."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-245",
"text": "One might argue that the tasks are simply too different: scoring might take form errors into account while meaning comparison by definition does not."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-246",
"text": "Moreover, while classification labels say something explicit and absolute about a piece of data, grades by definition are relative to the scale they come from."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-247",
"text": "It thus seems impossible to somehow unify the two schemes as they express fundamentally different ideas."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-248",
"text": "However, the strategies systems use to tackle scoring or meaning comparison are undoubtedly similar and should be comparable, as we argue in this paper."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-249",
"text": "So in order for researchers to learn from other approaches and also compare their results to those of other systems which tackle a different task, changes to systems seem necessary and should be preferred over changes to the gold standard data."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-250",
"text": "In the case presented here, a meaning comparison system was turned into a scoring system by changing the machine learning component from classification to regression, which requires a certain level of system modularity."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-251",
"text": "Having compared the two systems using Pearson correlation and RMSE, it also makes sense to consider the relevance of these evaluation metrics."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-252",
"text": "For example, it is the case that pairwise correlation assumes a normal distribution whereas datasets like the Texas corpus are heavily skewed towards correct answers (see Table 2 )."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-253",
"text": "Mohler et al. (2011) also note that in distributions with zero variance, correlation is undefined, which is not a problem as such but limits the use of correlation as evaluation metric."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-254",
"text": "Mohler et al. (2011) propose that RMSE is better suited to the task since it captures the relative error a system makes when trying to predict scores."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-255",
"text": "However, RMSE is scale-dependent and thus RMSE values across different studies cannot be compared."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-256",
"text": "We can only suggest that in order to sufficiently describe a system's performance, several metrics need to be reported."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-257",
"text": "Finally, an important point concerns the quality of gold standards."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-258",
"text": "Given the relatively low interannotator agreement in the Texas corpus (r = 0.586, RM SE = 0.659) it seems fair to ask whether answers without perfect agreement should be used in training and testing systems at all."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-259",
"text": "In the CREE and CREG corpora, answers with disagreement among the annotators have either been excluded from experiments or resolved by an additional judge."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-260",
"text": "This approach is also supported by recent literature (cf., e.g., )."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-261",
"text": "However, for the Texas corpus, Mohler et al. (2011) have opted to use the arithmetic mean of the two graders as gold standard."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-262",
"text": "While mathematically a viable solution, it seems questionable whether the mean is reliable with only two graders, especially if they have not operated on the grounds of explicit guidelines."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-263",
"text": "It would be interesting to see whether in this case, a system trained on more, singly annotated data would perform better than one on less, doubly annotated data, as argued for by Dligach et al. (2010) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-264",
"text": "In any case, if many disagreements occur, one should ask the question whether the annotation task is defined well enough and whether machines should really be expected to perform it consistently if humans have trouble doing so."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-265",
"text": "----------------------------------"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-266",
"text": "**CONCLUSION**"
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-267",
"text": "We discussed several issues in the comparison of short answer evaluation systems."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-268",
"text": "To that end, we gave an overview of the existing systems and picked two for a concrete comparison on the same data, the CoMiC-EN system (Meurers et al., 2011a ) and the Texas system (Mohler et al., 2011) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-269",
"text": "In comparing the two, it was necessary to turn CoMiC-EN into a scoring system because the Texas corpus as the chosen gold standard contains numeric scores assigned by humans."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-270",
"text": "Taking a step back from the concrete comparison, we gave a more general description of what is necessary to compare short answer evaluation systems."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-271",
"text": "We observed that more datasets need to be publicly available in order for performance comparisons to have meaning, a point also made earlier by Pulman and Sukkarieh (2005) ."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-272",
"text": "Moreover, we noted how datasets differ in similar aspects as systems do, such as task context and assessment scheme."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-273",
"text": "We then criticized the use of correlation measures as evaluation metrics for short answer scoring."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-274",
"text": "Finally, we discussed the importance of gold standard quality."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-275",
"text": "We conclude that it is interesting and relevant to compare short answer evaluation systems even if the concrete task they tackle, such as grading or meaning comparison, is not the same."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-276",
"text": "However, the availability and quality of the datasets will decide to what extent systems can sensibly be compared."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-277",
"text": "For progress to be made in this area, more publicly available datasets and systems are needed."
},
{
"sent_id": "cb64ba694c37df9ebc1065a1deac0f-C001-278",
"text": "The upcoming SemEval-2013 task on \"Textual entailment and paraphrasing for student input assessment\" 5 will hopefully become one important step into this direction (see also Dzikovska et al. 2012) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"cb64ba694c37df9ebc1065a1deac0f-C001-4",
"cb64ba694c37df9ebc1065a1deac0f-C001-5",
"cb64ba694c37df9ebc1065a1deac0f-C001-6"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-14"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-145",
"cb64ba694c37df9ebc1065a1deac0f-C001-146",
"cb64ba694c37df9ebc1065a1deac0f-C001-147",
"cb64ba694c37df9ebc1065a1deac0f-C001-148",
"cb64ba694c37df9ebc1065a1deac0f-C001-149",
"cb64ba694c37df9ebc1065a1deac0f-C001-150",
"cb64ba694c37df9ebc1065a1deac0f-C001-151"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-152",
"cb64ba694c37df9ebc1065a1deac0f-C001-153",
"cb64ba694c37df9ebc1065a1deac0f-C001-154",
"cb64ba694c37df9ebc1065a1deac0f-C001-155",
"cb64ba694c37df9ebc1065a1deac0f-C001-156",
"cb64ba694c37df9ebc1065a1deac0f-C001-157"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-158",
"cb64ba694c37df9ebc1065a1deac0f-C001-159",
"cb64ba694c37df9ebc1065a1deac0f-C001-160"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-174",
"cb64ba694c37df9ebc1065a1deac0f-C001-175",
"cb64ba694c37df9ebc1065a1deac0f-C001-176"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-179",
"cb64ba694c37df9ebc1065a1deac0f-C001-180",
"cb64ba694c37df9ebc1065a1deac0f-C001-181",
"cb64ba694c37df9ebc1065a1deac0f-C001-182",
"cb64ba694c37df9ebc1065a1deac0f-C001-183",
"cb64ba694c37df9ebc1065a1deac0f-C001-184",
"cb64ba694c37df9ebc1065a1deac0f-C001-185"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-186",
"cb64ba694c37df9ebc1065a1deac0f-C001-187",
"cb64ba694c37df9ebc1065a1deac0f-C001-188"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-204",
"cb64ba694c37df9ebc1065a1deac0f-C001-205"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-210",
"cb64ba694c37df9ebc1065a1deac0f-C001-211"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-253"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-254",
"cb64ba694c37df9ebc1065a1deac0f-C001-255",
"cb64ba694c37df9ebc1065a1deac0f-C001-256"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-268",
"cb64ba694c37df9ebc1065a1deac0f-C001-269",
"cb64ba694c37df9ebc1065a1deac0f-C001-270"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-207",
"cb64ba694c37df9ebc1065a1deac0f-C001-208"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-215",
"cb64ba694c37df9ebc1065a1deac0f-C001-216",
"cb64ba694c37df9ebc1065a1deac0f-C001-217",
"cb64ba694c37df9ebc1065a1deac0f-C001-218",
"cb64ba694c37df9ebc1065a1deac0f-C001-219"
]
],
"cite_sentences": [
"cb64ba694c37df9ebc1065a1deac0f-C001-6",
"cb64ba694c37df9ebc1065a1deac0f-C001-14",
"cb64ba694c37df9ebc1065a1deac0f-C001-145",
"cb64ba694c37df9ebc1065a1deac0f-C001-152",
"cb64ba694c37df9ebc1065a1deac0f-C001-174",
"cb64ba694c37df9ebc1065a1deac0f-C001-179",
"cb64ba694c37df9ebc1065a1deac0f-C001-186",
"cb64ba694c37df9ebc1065a1deac0f-C001-210",
"cb64ba694c37df9ebc1065a1deac0f-C001-211",
"cb64ba694c37df9ebc1065a1deac0f-C001-268"
]
},
"@USE@": {
"gold_contexts": [
[
"cb64ba694c37df9ebc1065a1deac0f-C001-4",
"cb64ba694c37df9ebc1065a1deac0f-C001-5",
"cb64ba694c37df9ebc1065a1deac0f-C001-6"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-158",
"cb64ba694c37df9ebc1065a1deac0f-C001-159",
"cb64ba694c37df9ebc1065a1deac0f-C001-160"
]
],
"cite_sentences": [
"cb64ba694c37df9ebc1065a1deac0f-C001-6"
]
},
"@MOT@": {
"gold_contexts": [
[
"cb64ba694c37df9ebc1065a1deac0f-C001-14"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-186",
"cb64ba694c37df9ebc1065a1deac0f-C001-187",
"cb64ba694c37df9ebc1065a1deac0f-C001-188"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-253"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-254",
"cb64ba694c37df9ebc1065a1deac0f-C001-255",
"cb64ba694c37df9ebc1065a1deac0f-C001-256"
],
[
"cb64ba694c37df9ebc1065a1deac0f-C001-215",
"cb64ba694c37df9ebc1065a1deac0f-C001-216",
"cb64ba694c37df9ebc1065a1deac0f-C001-217",
"cb64ba694c37df9ebc1065a1deac0f-C001-218",
"cb64ba694c37df9ebc1065a1deac0f-C001-219"
]
],
"cite_sentences": [
"cb64ba694c37df9ebc1065a1deac0f-C001-14",
"cb64ba694c37df9ebc1065a1deac0f-C001-186"
]
},
"@EXT@": {
"gold_contexts": [
[
"cb64ba694c37df9ebc1065a1deac0f-C001-221",
"cb64ba694c37df9ebc1065a1deac0f-C001-222",
"cb64ba694c37df9ebc1065a1deac0f-C001-223"
]
],
"cite_sentences": [
"cb64ba694c37df9ebc1065a1deac0f-C001-223"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"cb64ba694c37df9ebc1065a1deac0f-C001-261",
"cb64ba694c37df9ebc1065a1deac0f-C001-262",
"cb64ba694c37df9ebc1065a1deac0f-C001-263"
]
],
"cite_sentences": [
"cb64ba694c37df9ebc1065a1deac0f-C001-261"
]
}
}
},
"ABC_46a23364b7bc51493d83f874a824ad_5": {
"x": [
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-2",
"text": "Full recovery of argument structure information for question answering or information extraction requires that parsers can analyse long-distance dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-3",
"text": "Previous work on statistical dependency parsing has used post-processing or additional training data to tackle this complex problem."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-4",
"text": "We evaluate an alternative approach to recovering long-distance dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-5",
"text": "This approach uses a two-level parsing model to recover both grammatical dependencies, such as subject and object, and full argument structure."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-6",
"text": "We show that this two-level approach is competitive, while also providing useful semantic role information."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-9",
"text": "One of the main motivations for adopting dependency representations in the parsing and computational linguistics community is their direct expression of the lexical-semantic properties of words and their relations."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-10",
"text": "Argument structure is the representation of the argument taking properties of a predicate."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-11",
"text": "It represents those semantic properties of a predicate that are expressed grammatically."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-12",
"text": "It is usually defined as the specification of the arity of the predicate, its grammatical functions and the substantive labels of the arguments in the structure, what are usually called thematic or semantic roles."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-13",
"text": "For example the argument structure of the verb hit comprises the specification that hit is a transitive verb and that it takes an AGENT subject and a THEME object."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-14",
"text": "Constructions involving long-distance dependencies (LDDs) -such as questions, or relative clauses -are the stress test of the ability to represent argument structure, because in these constructions argument structure information is not directly reflected in the surface order of the sentence."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-15",
"text": "Despite the complexity of their representation, Rimell et al. (2009) report that these constructions cover roughly ten percent of the data in a corpus such as the PennTreebank, and therefore cannot be ignored."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-16",
"text": "LDDs are illustrated in Figure 1."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-17",
"text": "Representing argument structure in longdistance dependency constructions, thus, requires special mechanisms to deal with the divergence between the argument taking properties of the verb and the surface order of the sentence."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-18",
"text": "The most frequently used ways to encode long-distance dependencies is either by a copy mechanism, shown in Figure 1 , or by turning the tree into a directed graph, shown in Figure 2 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-19",
"text": "1 Many current statistical dependency parsers fail to represent many long-distance dependencies and their related argument structure directly, often because the relevant information, such as traces, has been stripped from the training data."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-20",
"text": "For example, most current statistical parsers do not represent directly the links drawn below the sentences in Figure 2 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-21",
"text": "Moreover, there is no attempt in these representations, to encode the full argument structure directly, as the semantic role labels are usually inferred from their correlation with the grammatical function labels, but not explicitly represented."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-22",
"text": "The argument structure of the verb spread in the first sentence in Figure 2 comprises a THEME subject in the intransitive form of the verb."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-23",
"text": "This argument structure must be inferred indirectly from the graph: first the long-distance nsubj relation must be inferred from a sequence of links typical of subject extraction from an embedded clause."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-24",
"text": "Moreover, the notion that verbs like spread take THEME subjects in some, but not all cases, is not represented, and therefore the argument structure cannot be, strictly speaking, fully recovered."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-25",
"text": "These parsers can recover the long-distance dependency only through a post-processing step, which recovers the information about predicateargument relation and the grammatical function."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-26",
"text": "The semantic role label is usually not recovered even in post-processing."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-27",
"text": "We investigate here, then, the hypothesis whether current two-level syntactic-semantic parsers can fill in for the missing information, and recover the long-distance and argument structure information during parsing without need for postprocessing and without loss in performance."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-28",
"text": "If this were possible, we would be able to produce longdistance dependencies with more direct and perspicuous representations, and also fill in some of the semantic information currently missing from argument structure representations."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-29",
"text": "It is important to recall that the reason why predicate-argument structure is considered central for NLP applications hinges on the assumption that what needs recovering is the lexical semantics content."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-30",
"text": "For example, it is likely that for information extraction, it is more useful to know which are the manner, temporal and location arguments than to know an underspecified adverbial modifier label."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-31",
"text": "In the rest of the paper, then, we will first contrast the one-level representation of long-distance dependencies to a two-level representation, where grammatical functions and argument structure are both explicitly represented."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-32",
"text": "We will then briefly recall a recently proposed two-level parsing model (Henderson et al., 2013) , and then present the main contribution of the paper: the evaluation of parsing models that parse these twolevel syntactic-semantic dependencies on longdistance dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-33",
"text": "We also compare the results to other statistical dependency parsers, investigate the usefulness and informativeness of the extracted information, discuss and conclude."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-34",
"text": "Recall that the LDD encoded in the arcs under the sentence are the LDD that must be recovered."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-35",
"text": "They are shown for expository purpose and they are not usually part of the syntactic tree."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-36",
"text": "parse tree, either as co-indexed \"traces\", such as in the Penn Treebank, as illustrated in Figure 1 , or as arcs as in a dependency representation."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-37",
"text": "In practice, current statistical parsers do not encode LDD directly, as illustrated in Figure 2 , and leave it to post-processing procedures to recover the LDD relation (Johnson, 2002; Nivre et al., 2010) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-38",
"text": "These approaches exploit the very strong constraints that govern long-distance relations syntactically, and ignore the full or partial recovery of the semantic roles entirely."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-39",
"text": "Consider, for example, the representations for subject embeddings (first tree) and object reduced relatives (second tree) in Figure 2 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-40",
"text": "This figure illustrates the Stanford dependency representation that was used in Rimmel et al. (2009), and Nivre et al. (2010) , indicating below the sentence the long distance dependency that needs to be recovered, but that is not in the representation."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-41",
"text": "The first tree encodes the subject relation between ac- Figure 3 : LDDs represented as a syntactic dependency tree above the sentence (in blue) and argument structure labels under the sentence (in green)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-42",
"text": "The label A0 stands for AGENT and A1 stands for THEME."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-43",
"text": "The prefix AM indicates a modifier argument."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-44",
"text": "Recall that the LDDs encoded in the arcs under the sentence (in red) are the LDDs that must be recovered."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-45",
"text": "They are shown for expository purpose and they are neither part of the syntactic tree nor of the semantic graph."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-46",
"text": "tions and spreading as a sequence of two arcs rcmod(actions, saw) and xcomp(saw, spreading)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-47",
"text": "This sequence indicates a dependency relation in the opposite direction from the one needed to correctly recover the argument structure of the verb spread, and does not explicitly indicate the grammatical function,SUBJECT, nor the semantic role relation, THEME."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-48",
"text": "The label rcmod is the same label used to indicate the relationship between do and things in the second sentence, but in this case the relation is an object relation, so the distinction between subject-oriented and object-oriented relative clauses is encoded very indirectly."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-49",
"text": "This kind of encoding of argument structure and longdistance dependency is indirect and potentially lacking in perspicuity."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-50",
"text": "In a dependency formalism, two-level representations have been proposed to represent the syntactic and argument structures of a sentence in terms of dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-51",
"text": "Consider the representations in Figure 3 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-52",
"text": "The syntactic representation is the same as in the previous figures, but LDDs and argument structures are represented directly."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-53",
"text": "For example, the verb saw has two arguments, an AGENT and a THEME, while the verb spread has a long-distance dependency with the word actions, which is its THEME subject."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-54",
"text": "2 The verb do in the second sentence has a long-distance THEME object."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-55",
"text": "Therefore, the overall complex graph that represents both the syntax and the underlying argument structure of the sentences comprises two half graphs, sharing all vertices, the words."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-56",
"text": "They are indicated by the blue and green arcs, respectively, in Figure 3 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-57",
"text": "These representations factor the syntactic parse tree information from the argument structure information and provide, overall, more labelling information."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-77",
"text": "**PARSING TWO-LEVEL REPRESENTATIONS**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-58",
"text": "The parse tree is needed to provide a connected graph, to provide information about constituency/dependency relations for grammatical correctness (agreement, for example, is triggered in environments defined by grammatical functions, and not by semantic relations) and grammatical functions."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-59",
"text": "Argument structures are represented separately, for each predicate in the sentence and give explicit labels to the arguments."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-60",
"text": "While these labels are correlated to the grammatical functions, it is a well-established fact that they are not coextensive, for instance not all subjects are Agents as shown in Figure 3 , and therefore are not redundant."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-61",
"text": "From a linguistic point of view, these representations are related to many grammar formalisms that invoke the need to represent both grammatical functional level and argument structure level, such as tectogrammatical dependency representations (Hajic, 1998) , or early versions of transformational grammar."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-62",
"text": "From a graph-theoretic and parsing point of view, the complete graph of both the syntax and the semantics of the sentences is composed of two half graphs, which share all their vertices, namely the words."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-63",
"text": "Internally, these two half graphs exhibit different properties."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-64",
"text": "The syntactic graph is a single connected tree."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-65",
"text": "The semantic graph is a forest of one-level treelets, one for each proposition, which may be disconnected and may share children."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-66",
"text": "In both graphs, it is not generally appropriate to assume independence across the different treelets in the structure."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-67",
"text": "In the semantic graph, linguistic evidence that propositions are not independent of each other comes from constructions such as coordinations where some of the arguments are shared and semantically parallel."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-68",
"text": "Arcs in the semantic graph do not correspond one-to-one to arcs in the syntactic graph, indicating that a rather flexible framework is needed to capture the correlations between graphs."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-69",
"text": "The challenge, then, arises in developing models of these two-level representations."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-70",
"text": "These models must find an effective way of communicating the necessary information between the syntax and the argument structure representation."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-71",
"text": "From the practical point of view of existing resources, one version of these representations results from the merging of widely used and carefully annotated linguistic resources, PennTreebank (Marcus et al., 1993) and PropBank (Palmer et al., 2005) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-72",
"text": "They are PennTreebank-derived dependency representations that have been stripped of long-distance dependencies, and merged with PropBank encoding of argument structures."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-73",
"text": "But PropBank encodings are often based on the traceenriched PennTreeBank representations as a starting point."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-74",
"text": "Hence, these representations encode all LDDs, enriched with substantive semantic role labels, according to the PropBank labelling scheme."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-75",
"text": "3 They could also be constructed from other resources, for example by augmenting the current Universal dependency annotation scheme with extra semantic annotations (de Marneffe et al., 2014 )."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-78",
"text": "Developing models to learn these two-level analyses of syntax and argument structure raises several interesting questions regarding the design of the interface between the syntactic and the argument structure representations and how to learn these complex representations (Merlo and Musillo, 2008; Surdeanu et al., 2008) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-79",
"text": "4 A model that can parse these two leveldependencies is proposed in Henderson et al. (2013) and we adopt it here without modifications."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-80",
"text": "We choose this model for our evaluation of 3 These representations are the same, in practice, as the encoding used in some recent shared tasks (CoNLL 2008 and CoNLL 2009 (Surdeanu et al., 2008 Haji\u010d et al., 2009 )) for syntactic-semantic dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-81",
"text": "4 Joint syntactic-semantic dependency parsing was the theme of two CoNLL shared tasks."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-82",
"text": "CoNLL 2008 explored syntactic-semantic parsing for English, CoNLL 2009 extended the task to several languages."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-83",
"text": "Only four truly joint models were developed, and most of the multi-lingual models were fine-tuned specifically for each language."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-84",
"text": "long-distance dependencies as the best performing among those approaches that have attempted to model jointly the relationship between argument structure and surface syntax Surdeanu et al., 2008) and developments of this model have shown good performance on several languages (Gesmundo et al., 2009 ), without any language-specific tailoring."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-85",
"text": "These results suggest that this model can capture abstract linguistic regularities in a single parsing architecture."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-86",
"text": "5 We describe this model here very briefly."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-87",
"text": "For more detail on the parser and the model, we refer the interested reader to Henderson et al. (2013) and references therein."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-88",
"text": "The crucial intuitions behind the two-level approach is that the parsing mechanism must correlate the two half-graphs, but allow them to be constructed separately as they have very different properties."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-89",
"text": "The derivations for both syntactic dependency trees are based on a standard transition-based, shift-reduce style parser (Nivre et al., 2006) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-90",
"text": "The derivations for argument structure dependency graphs use virtually the same set of actions, but are augmented with a Swap action, that swaps the two words at the top of the stack."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-91",
"text": "The Swap action is inspired by the planarisation algorithm described in Hajicova et al.(2004) , where non-planar trees are transformed into planar ones by recursively rearranging their sub-trees to find a linear order of the words for which the tree is planar."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-92",
"text": "The probability model to determine which action to pursue is a joint generative model of syntactic and argument structure dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-93",
"text": "The two dependency structures are specified as the synchronised sequences of actions for a shift-reduce parser that operates on two different stacks."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-94",
"text": "By synchronising parsing for both the syntactic and the argument structure representations, a probabilistic model is learnt which maximises the joint probability of the syntactic and semantic dependencies and thereby guarantees that the output structure is globally coherent, while at the same time building the two structures separately."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-95",
"text": "The probabilistic estimation is based on Incremental Sigmoid Belief Networks (ISBNs)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-96",
"text": "The use of latent variables allows ISBNs to induce their fea-(3) Each must match Wisman's pie with the fragment that they carry with him."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-97",
"text": "(4) Five things you can do for 15,000 dollars or less."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-98",
"text": "(5) They will remain on a lower-priority list that includes 17 other countries."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-99",
"text": "(6) How he felt ready for the many actions he saw spreading out before him."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-100",
"text": "(7) What you see are self-help projects."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-101",
"text": "(8) What effect does a prism have on light?"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-102",
"text": "(9) The men were at first puzzled then angered by the aimless tacking."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-104",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-105",
"text": "In this section we assess how well the two-level parser performs on constructions involving longdistance dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-106",
"text": "In so doing, we verify that these two-level models of syntactic and argument structure representations can be learnt even in difficult cases, while also producing an output that is richer than what statistical parsers usually produce."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-107",
"text": "To confirm this statement, we expect to see that the syntactic dependency parsing performance is not degraded, compared to more standard statistical parsing architectures on long-distance dependencies, while also producing semantic role labels on these difficult constructions."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-109",
"text": "**THE TEST DATA**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-110",
"text": "To test the performance on LDDs, we use the test suites developed by Rimell et al. (2009) for English."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-111",
"text": "They comprise 560 test sentences, 80 for each type of construction."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-112",
"text": "Half of them are extracted from the Penn Treebank, half of them from the Brown corpus, balanced across construction types."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-113",
"text": "None of these sentences is included in the training set of the parser."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-114",
"text": "These sentences cover seven types of long-distance relations, illustrated in Figure 4 : object extraction from relative clauses (ORC) in (3) or from reduced relative clauses (ORed) in (4), subject extraction from relative clauses (SRC) in (5) or from an embedded clause (SEmb) in (6), free relatives (Free) in (7), object-oriented questions (OQ) in (8), and right node raising constructions (RNR) in (9)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-115",
"text": "Compared to the other statistical dependency parsers, questions (OQ) are not well represented in our training data, since they do not include the additional QB data (Nivre et al., 2010) used to improve the performance of MSTParser and MaltParser."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-117",
"text": "**PARSING SET UP**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-118",
"text": "Like the dependency parser in Nivre et al. (2010) , the parser was not trained on the same data or tree representations as those used in the test data."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-119",
"text": "The parser is trained on the data derived by merging a dependency transformation of the Penn Treebank with Propbank and Nombank (Surdeanu et al., 2008 )."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-120",
"text": "An illustrative example of the kind of labelled structures that we need to parse was given in Figure 3 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-121",
"text": "Training and development data follow the usual partition as sections 02-21, 24 of the Penn Treebank."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-122",
"text": "More details and references on the data, and the conversion of the Penn Treebank format to dependencies are given in Surdeanu et al. (2008) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-123",
"text": "Like for standard statistical and dependency parsers, the syntactic representation used by the two-level parser has been stripped of all traces."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-124",
"text": "The predicates of the argument structures and their locations are not provided at testing, unlike some of the CONLL shared tasks."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-125",
"text": "Unlike Nivre et al. (2010) , we did not use an external part-of-speech tagger to annotate the data of the development set."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-126",
"text": "To minimize preprocessing of the data, we choose to have part-ofspeech tagging as an internal part of the parsing model, which therefore, takes raw input."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-127",
"text": "In order for our results to be comparable to those reported in previous evaluations (Rimell et al., 2009; Nivre et al., 2010) , we ran the parser \"out of the box\" directly on the test sentences, without using the development sentences to finetune."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-128",
"text": "We were able to parse all the sentences in the test suites without any adjustments to the parser."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-129",
"text": "6"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-131",
"text": "**EVALUATION METHODOLOGY**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-132",
"text": "Like in previous papers (Rimell et al., 2009; Nivre et al., 2010) , we evaluate the parser on its ability to recover LDDs."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-133",
"text": "Two evaluations were done."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-134",
"text": "The first one was semi-automatic, performed with a modified version of the evaluation script developed in Rimell et al. (2009) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-135",
"text": "An independent manual evaluation was also performed."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-136",
"text": "A dependency is considered correctly recovered if a dependency in the gold data is found in the output."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-137",
"text": "A dependency is a triple comprising three items: the nodes connected by the arc in the graph and the label of the arc."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-138",
"text": "In principle, a dependency is considered correct if all three elements of the triple are correct."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-139",
"text": "However, in this evaluation the representations vary across models and exact matches would not allow a fair assessment."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-140",
"text": "Both previous evaluation exercises (Rimell et al., 2009; Nivre et al., 2010) suggest some avenues to relax the matching conditions, and define equivalence classes of representations."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-142",
"text": "**EQUIVALENCE CLASSES OF ARCS**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-143",
"text": "To relax the requirement of exact match on the definition of arc, a set of equivalence classes between single arcs and paths connecting two nodes indirectly is precisely defined in the post-processing scheme of Nivre et al. (2010) , which applies to the Stanford labelling scheme."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-144",
"text": "In Nivre et al. (2010) , the encoding of long-distance dependencies in a dependency parser is categorised as simple, complex, and indirect."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-145",
"text": "In the simple case, the LDD coincides with an arc in a tree."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-146",
"text": "In the complex case, the LDD is represented by a path of arcs."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-147",
"text": "In the indirect case, the dependency is not directly encoded in a path in the tree, but it must be inferred from a larger portion of the tree using heuristics."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-148",
"text": "The two last cases require post-processing of the tree."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-149",
"text": "In Rimell et al. (2009) , two dependencies are considered equivalent if they differ only in their definition of what counts as head."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-150",
"text": "For example, in some dependency schemes the preposition is the head of a prepositional phrase, while in others it is the noun."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-151",
"text": "We develop a definition of equivalence classes of arcs inspired by both these approaches."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-152",
"text": "Following Nivre et al. (2010) , we define a longdistance dependency as simple or complex."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-153",
"text": "In the simple case, the LDD coincides with an arc in a tree."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-154",
"text": "A complex dependency is defined as a path of at most two simple dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-155",
"text": "Unlike singlelevel statistical parsers, our two-level representation could create more than one path to connect two nodes, since two nodes could be connected both by a syntactic arc and by a semantic arc."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-156",
"text": "Following Rimell et al. (2009) , we define which path of two arcs is considered correct by allowing some flexibility in the definition of the head in very specific predefined cases, such as prepositional phrases."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-157",
"text": "The head can be either the word in the position indicated in the gold annotation, or its parent."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-158",
"text": "This definition applies, for example, to extraction from prepositional phrases which in our case are related to the semantic head, while in Rimell et al.'s scheme they are connected to the preposition."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-159",
"text": "This relaxed definition is triggered in 31 cases of semantic matches and 40 cases of syntactic matches, over a total of 398 matches."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-160",
"text": "The evaluation script was also augmented with a construction-specific rule to capture complex dependencies with be-constructions."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-161",
"text": "Sentence (10) is an example of a be-construction, where the gold dependency in (10a) corresponds to a path of two dependencies in (10b)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-162",
"text": "The latter consists of the subject dependency between the copula is, the head, and its subject childhood, and the predicative dependency between the head is and the predicative what."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-163",
"text": "For a complex dependency of this type to be counted correct, the end points of the path have to match the endpoints of the longdistance dependency in the gold and the labels have to be exactly as indicated, sbj and prd."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-164",
"text": "This specific rule adds seven correct cases to the total."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-166",
"text": "**EQUIVALENCE CLASSES OF LABELS**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-167",
"text": "The evaluation in Rimell et al. (2009) is largely done manually, and equivalences are decided by the authors."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-168",
"text": "Different labelling schemes are considered correct, as long as they can make the distinction between subject, object, indirect object and adjunct modifier."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-169",
"text": "We establish a correspondence of labels."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-170",
"text": "In our two-level representation, labels are the grammatical functions of the syntactic dependencies, and the semantic role labels, taken from PropBank."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-171",
"text": "7 Core arguments nsubj A0, A1, SBJ obj,dobj, pobj OBJ, A1, PMOD passive subj A1 obj2 A1 Other labels advmod LOC,TMP,MNR amod MNR,NMOD aux MOD,VC nn NAME, DEP partmod MOD Figure 5 : Gold data and two-level output label equivalences."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-172",
"text": "Our equivalences might depend only on the labels or on the labels in the context of the sentence type."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-173",
"text": "For example, the subject of a passive is an A1, that is a THEME."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-174",
"text": "In some cases, direct inspection of the predicate was necessary: A1 corresponds to subjects for some verbs even in the active voice."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-175",
"text": "A simple rule was applied to decide what verbs can exhibit an A1 subject, based on PropBank's framesets: If the frameset allowed A1 as a subject, in the appropriate sense of the verb, then the correspondence was accepted."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-176",
"text": "This decision rule applied to 33 cases (the (nsubj, A1) cell in Table 2 )."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-177",
"text": "The label equivalences are given in detail in Figure 5 : the grammatical function labels of the gold data are shown on the left and labels of the two-level parser are shown on the right."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-178",
"text": "The confusion matrix by labels is provided in Table 2 ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-179",
"text": "Manual evaluation The evaluation was also done manually by a judge, a trained linguist, who had not developed the initial script."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-180",
"text": "We used a visualisation tool (Tred) (Pajas andSt\u0207p\u00e1nek, 2008) , adapted to our output, to facilitate the inspection of the two-level representations and avoid mistakes."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-181",
"text": "In the manual evaluation, a dependency is correctly recovered if an arc and its syntactic/semantic label (see Figure 4) are correct."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-182",
"text": "Three different constructions need to be mentioned, because they have special chracteristics that had to be taken into account: coordination, right node raising and small clauses."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-183",
"text": "A dependency may be found directly, as a single arc, or by coordination."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-184",
"text": "Regarding coordina- (1993) 's semantic propositions of alternating verbs."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-185",
"text": "PropBank propositions have been shown to be closely related to grammatical functions (Merlo and van der Plas, 2009 )."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-186",
"text": "So we can assume that grammatical functions can also be inferred from PropBank relations in most cases."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-187",
"text": "tion, we follow the Stanford scheme, according to which an argument or adjunct must be attached to the first conjunct to indicate that it belongs to both conjuncts."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-188",
"text": "Right node raising is too difficult to evaluate automatically."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-189",
"text": "In Rimell et al. (2009) 's definition, right node raising is represented by two arcs."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-190",
"text": "It is considered correctly recovered if one of the arcs was correct and the other was found either directly or by coordination."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-191",
"text": "We evaluate right node raising by hand, in the same way: either the dependency was found directly or by coordination, either in the syntax or in the argument structure."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-192",
"text": "Small clauses are rare, complex dependencies that were evaluated by hand."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-193",
"text": "Sentence (11) is an example of a small clause construction, where the nsubj dependency of the gold data (11a) corresponds to two dependencies (11b): one between the head called and its object/theme horses, and one between called and the object predicative Dogs."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-194",
"text": "We found only five cases of this construction."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-195",
"text": "However, these five dependencies do make a difference, because they all appear in SEmb, which has a low percent recall, as shown in Table 1 . (11)"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-197",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-198",
"text": "Automatic and manual results (percent recall) are shown in Table 1 , where we compare our results to the relevant ones of those reported in previous evaluations (Rimell et al., 2009; Nivre et al., 2010; Nguyen et al., 2012) ."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-199",
"text": "8 These papers compare several statistical parsers."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-219",
"text": "Link errors related to relative clauses (indirect dependencies) are classified as Sem errors."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-200",
"text": "Some parsers like Nguyen, the C&C parser (Clark and Curran, 2007) and Enju (Miyao and Tsujii, 2005) are based on rich grammatical formalisms, and others others are representative of statistical dependency parsers (MST, MALT, (McDonald, 2006; Nivre et al., 2006) These last two parsers constitute the relevant comparison for our approach."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-201",
"text": "9 Like the other parsers discussed in Rimell et al. (2009) and Nivre et al. (2010) , the overall performance on these long-distance constructions is much lower than the overall scores for this parser."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-202",
"text": "However, the parser recovers long-distance dependencies at least as well as standard statistical dependency parsers that use a post-processing step, and better than standard statistical parsers."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-203",
"text": "10 The differences in recall between manual and automatic evaluation in Table 1 show that the automatic evaluation is sometimes too strict and sometimes too lenient."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-204",
"text": "The former cases arise primarily in small clause dependencies and dependency recovery by coordination across all LDD constructions, which were taken into account in the manual evaluation, but not in the automatic evaluation, because, as indicated above, scoring coordination automatically is too difficult."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-205",
"text": "This explains the recall difference between the two evaluation methods in SRC and SEmb."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-206",
"text": "The latter case is due to the stricter definition of head in the manual evaluation."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-207",
"text": "This is the main reason why ORed and OQ have lower recall in this evaluation."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-208",
"text": "Table 2 reports some of the labelled error counts of the most frequent labels."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-209",
"text": "In general, the confusion matrix shows that the labelled correspondence is accurate, and that it corresponds to meaningful generalisations."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-210",
"text": "As can also be observed, a single grammatical function label corresponds to several different semantic relations and vice versa."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-211",
"text": "Full recovery of argument structure, then, requires both grammatical syntactic relations and semantic role labelling."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-213",
"text": "**ERROR ANALYSIS OF DEVELOPMENT SETS**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-214",
"text": "We classify the errors made by our parser on the development set based on Nivre et al. (2010) one which occurs when the parser fails to assign the correct functional relation (e.g., subject, object), while a Sem error is one in which the parser fails to assign the correct semantic relation (e.g., A1, A2)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-215",
"text": "Nivre et al's Link error is one where the parser fails to find a dependency by coordination in the case of right node raising."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-216",
"text": "Our restrictive modifications follow the constraints indicated above on what counts as a correct dependency."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-217",
"text": "In particular, we only count as correct two types of dependencies: simple, in which the dependency is represented as a single arc in the parse tree; and complex, where a gold dependency corresponds to a path of only two direct dependencies, such as in the case of predicative constructions and prepositional phrases discussed above."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-218",
"text": "Our definition of complex dependencies is stricter than Nivre et al.'s, and we do not count indirect dependencies."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-220",
"text": "11 Table 3 shows the frequency of the error types 11 Nivre et al.'s Link errors also include cases where the parser fails to find the crucial Link relations rcmod in ORed, ORC, SRC, and SEmb."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-221",
"text": "This type of Link error is not relevant for us."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-222",
"text": "for our parser in the seven development sets."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-223",
"text": "Global errors are most frequent for OQ, ORC and SRC."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-224",
"text": "Questions (OQ) are not well represented in our training data, since they do not include the additional QB data (Nivre et al., 2010) used to improve the performance of MSTParser and MaltParser (see Table 4 for comparison of number of errors for each parser)."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-225",
"text": "With respect to ORC and SRC, most Global errors are related to part-ofspeech tagging errors and wrong head assignment of complex NPs which are modified by the relevant relative clause."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-226",
"text": "In particular there seems to be a strong recency preference, which assigns the relative clause to the closest noun head in a complex NP."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-227",
"text": "A closer look at Arg errors shows that, in ORed, ORC and OQ, the most frequent errors are because the parser fails to find the Arg relation between a preposition and its argument in cases of preposition stranding."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-228",
"text": "Based on the comparison of errors of other statistical dependency parsers on the development set, shown in Table 4 , we can conclude that the trends of errors by constructions are the same in all three parsers."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-229",
"text": "----------------------------------"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-230",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-231",
"text": "In this paper, we have evaluated an approach to learn two-level long-distance representations that encode argument structure information directly, as a particularly difficult test case, and shown that we can learn these difficult constructions as well as dependency parsers augmented with a dedicated long-distance dependency post-processing step."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-232",
"text": "This work also shows that resources and methods to recover these richer representations already exist."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-233",
"text": "It is important to recall that the predicateargument structure of a clause is considered central for NLP applications because it represents the grammatically relevant lexical semantic content of the clause."
},
{
"sent_id": "46a23364b7bc51493d83f874a824ad-C001-234",
"text": "The two-level parser described in this paper can recover this information, while purely syntactic parsers, whether they recover long-distance dependencies or not, would still need further enhancements."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"46a23364b7bc51493d83f874a824ad-C001-37",
"46a23364b7bc51493d83f874a824ad-C001-38"
],
[
"46a23364b7bc51493d83f874a824ad-C001-40"
],
[
"46a23364b7bc51493d83f874a824ad-C001-115"
],
[
"46a23364b7bc51493d83f874a824ad-C001-140"
],
[
"46a23364b7bc51493d83f874a824ad-C001-143"
],
[
"46a23364b7bc51493d83f874a824ad-C001-144",
"46a23364b7bc51493d83f874a824ad-C001-145",
"46a23364b7bc51493d83f874a824ad-C001-146",
"46a23364b7bc51493d83f874a824ad-C001-147",
"46a23364b7bc51493d83f874a824ad-C001-148"
],
[
"46a23364b7bc51493d83f874a824ad-C001-198"
]
],
"cite_sentences": [
"46a23364b7bc51493d83f874a824ad-C001-37",
"46a23364b7bc51493d83f874a824ad-C001-40",
"46a23364b7bc51493d83f874a824ad-C001-115",
"46a23364b7bc51493d83f874a824ad-C001-140",
"46a23364b7bc51493d83f874a824ad-C001-143",
"46a23364b7bc51493d83f874a824ad-C001-144",
"46a23364b7bc51493d83f874a824ad-C001-198"
]
},
"@MOT@": {
"gold_contexts": [
[
"46a23364b7bc51493d83f874a824ad-C001-37",
"46a23364b7bc51493d83f874a824ad-C001-38"
]
],
"cite_sentences": [
"46a23364b7bc51493d83f874a824ad-C001-37"
]
},
"@DIF@": {
"gold_contexts": [
[
"46a23364b7bc51493d83f874a824ad-C001-115"
],
[
"46a23364b7bc51493d83f874a824ad-C001-125",
"46a23364b7bc51493d83f874a824ad-C001-126"
],
[
"46a23364b7bc51493d83f874a824ad-C001-201",
"46a23364b7bc51493d83f874a824ad-C001-202"
],
[
"46a23364b7bc51493d83f874a824ad-C001-224"
]
],
"cite_sentences": [
"46a23364b7bc51493d83f874a824ad-C001-115",
"46a23364b7bc51493d83f874a824ad-C001-125",
"46a23364b7bc51493d83f874a824ad-C001-201",
"46a23364b7bc51493d83f874a824ad-C001-224"
]
},
"@SIM@": {
"gold_contexts": [
[
"46a23364b7bc51493d83f874a824ad-C001-118"
],
[
"46a23364b7bc51493d83f874a824ad-C001-132",
"46a23364b7bc51493d83f874a824ad-C001-133",
"46a23364b7bc51493d83f874a824ad-C001-134",
"46a23364b7bc51493d83f874a824ad-C001-135"
],
[
"46a23364b7bc51493d83f874a824ad-C001-152",
"46a23364b7bc51493d83f874a824ad-C001-153",
"46a23364b7bc51493d83f874a824ad-C001-154"
],
[
"46a23364b7bc51493d83f874a824ad-C001-201",
"46a23364b7bc51493d83f874a824ad-C001-202"
]
],
"cite_sentences": [
"46a23364b7bc51493d83f874a824ad-C001-118",
"46a23364b7bc51493d83f874a824ad-C001-132",
"46a23364b7bc51493d83f874a824ad-C001-152",
"46a23364b7bc51493d83f874a824ad-C001-201"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"46a23364b7bc51493d83f874a824ad-C001-127"
]
],
"cite_sentences": [
"46a23364b7bc51493d83f874a824ad-C001-127"
]
},
"@USE@": {
"gold_contexts": [
[
"46a23364b7bc51493d83f874a824ad-C001-214"
]
],
"cite_sentences": [
"46a23364b7bc51493d83f874a824ad-C001-214"
]
}
}
},
"ABC_5a2cd80d7c57e06a51457e53169b49_5": {
"x": [
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-79",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-80",
"text": "Dataset and Setup."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-106",
"text": "**MUSE**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-107",
"text": "back translation between Adv-C and our method on MUSE."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-2",
"text": "Unsupervised bilingual lexicon induction naturally exhibits duality, which results from symmetry in back-translation."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-3",
"text": "For example, EN-IT and IT-EN induction can be mutually primal and dual problems."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-50",
"text": "logP Dy (src = 1|y j )"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-4",
"text": "Current state-ofthe-art methods, however, consider the two tasks independently."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-5",
"text": "In this paper, we propose to train primal and dual models jointly, using regularizers to encourage consistency in back translation cycles."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-6",
"text": "Experiments across 6 language pairs show that the proposed method significantly outperforms competitive baselines, obtaining the best published results on a standard benchmark."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-9",
"text": "Unsupervised bilingual lexicon induction (UBLI) has been shown to benefit NLP tasks for low resource languages, including unsupervised NMT (Artetxe et al., 2018b,c; Lample et al., 2018a,b) , information retrieval (Vuli\u0107 and Moens, 2015; Litschko et al., 2018) , dependency parsing (Guo et al., 2015) , and named entity recognition (Mayhew et al., 2017; Xie et al., 2018) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-10",
"text": "Recent research has attempted to induce unsupervised bilingual lexicons by aligning monolingual word vector spaces (Zhang et al., 2017a; Conneau et al., 2018; Aldarmaki et al., 2018; Artetxe et al., 2018a; Alvarez-Melis and Jaakkola, 2018; Mukherjee et al., 2018) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-11",
"text": "Given a pair of languages, their word alignment is inherently a bi-directional problem (e.g. EnglishItalian vs Italian-English)."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-12",
"text": "However, most existing research considers mapping from one language to another without making use of symmetry."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-13",
"text": "Our experiments show that separately learned UBLI models are not always consistent in opposite directions."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-14",
"text": "As shown in Figure 1a , when the model of Conneau et al. (2018) is applied to English and Italian, the primal model maps the word \"three\" to the Italian word \"tre\", but the dual model maps \"tre\" to \"two\" instead of \"three\"."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-15",
"text": "We propose to address this issue by exploiting duality, encouraging forward and backward mappings to form a closed loop (Figure 1b )."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-16",
"text": "In particular, we extend the model of Conneau et al. (2018) by using a cycle consistency loss (Zhou et al., 2016) to regularize two models in opposite directions."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-17",
"text": "Experiments on two benchmark datasets show that the simple method of enforcing consistency gives better results in both directions."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-18",
"text": "Our model significantly outperforms competitive baselines, obtaining the best published results."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-19",
"text": "We release our code at xxx."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-21",
"text": "**RELATED WORK**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-22",
"text": "UBLI."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-23",
"text": "A typical line of work uses adversarial training (Miceli Barone, 2016; Zhang et al., 2017a,b; Conneau et al., 2018) , matching the distributions of source and target word embeddings through generative adversarial networks (Goodfellow et al., 2014) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-24",
"text": "Non-adversarial approaches have also been explored."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-25",
"text": "For instance, Mukherjee et al. (2018) use squared-loss mutual information to search for optimal cross-lingual word pairing."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-51",
"text": "logP Dy (src = 0|F(x i ))."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-26",
"text": "Artetxe et al. (2018a) and Hoshen and Wolf (2018) exploit the structural similarity of word embedding spaces to learn word mappings."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-27",
"text": "In this paper, we choose Conneau et al. (2018) as our baseline as it is theoretically attractive and gives strong results on large-scale datasets."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-28",
"text": "Cycle Consistency."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-29",
"text": "Forward-backward consistency has been used to discover the correspondence between unpaired images (Zhu et al., 2017; Kim et al., 2017) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-30",
"text": "In machine translation, similar ideas were exploited, He et al. (2016) , Xia et al. (2017) and use dual learning to train two opposite language translators by minimizing the reconstruction loss."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-31",
"text": "Sennrich et al. (2016) consider back-translation, where a backward model is used to build synthetic parallel corpus and a forward model learns to generate genuine text based on the synthetic output."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-32",
"text": "Closer to our method, Chandar et al. (2014) jointly train two autoencoders to learn supervised bilingual word embeddings."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-33",
"text": "use sinkhorn distance (Cuturi, 2013) and backtranslation to align word embeddings."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-34",
"text": "However, they cannot perform fully unsupervised training, relying on WGAN (Arjovsky et al., 2017) for providing initial mappings."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-35",
"text": "Concurrent with our work, Mohiuddin and Joty (2019) build a adversarial autoencoder with cycle consistency loss and post-cycle reconstruction loss."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-36",
"text": "In contrast to these works, our method is fully unsupervised, simpler, and empirically more effective."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-38",
"text": "**APPROACH**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-39",
"text": "We take Conneau et al. (2018) as our baseline, introducing a novel regularizer to enforce cycle consistency."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-40",
"text": "Let X = {x 1 , ..., x n } and Y = {y 1 , ..., y m } be two sets of n and m word embeddings for a source and a target language, respectively."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-41",
"text": "The primal UBLI task aims to learn a linear mapping F : X \u2192 Y such that for each x i , F(x i ) corresponds to its translation in Y ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-42",
"text": "Similarly, a linear mapping G : Y \u2192 X is defined for the dual task."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-43",
"text": "In addition, we introduce two language discriminators D x and D y , which are trained to discriminate between the mapped word embeddings and the original word embeddings."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-44",
"text": "Conneau et al. (2018) align two word embedding spaces through generative adversarial networks, in which two networks are trained simultaneously."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-45",
"text": "Specifically, take the primal UBLI task as an example, the linear mapping F tries to generate \"fake\" word embeddings F(x) that look similar to word embeddings from Y , while the discriminator D y aims to distinguish between \"fake\" and real word embeddings from Y ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-46",
"text": "Formally, this"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-48",
"text": "**BASELINE ADVERSARIAL MODEL**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-49",
"text": "idea can be expressed as the minmax game min F max Dy adv (F, D y , X, Y ), where"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-52",
"text": "( 1) P Dy (src|y j ) is a model probability from D y to distinguish whether word embedding y j is coming from the target language (src = 1) or the primal mapping F (src = 0)."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-53",
"text": "Similarly, the dual UBLI problem can be formulated as"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-54",
"text": ", where G is the dual mapping, and D x is a source discriminator."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-55",
"text": "Theoretically, a unique solution for above minmax game exists, with the mapping and the discriminator reaching a nash equilibrium."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-56",
"text": "Since the adversarial training happens at the distribution level, no cross-lingual supervision is required."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-58",
"text": "**REGULARIZERS FOR DUAL MODELS**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-59",
"text": "We train F and G jointly and introduce two regularizers."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-60",
"text": "Formally, we hope that G(F(X)) is similar to X and F(G(Y )) is similar to Y ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-61",
"text": "We implement this constraint as a cycle consistency loss."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-62",
"text": "As a result, the proposed model has two learning objectives: i) an adversarial loss ( adv ) for each model as in the baseline."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-63",
"text": "ii) a cycle consistency loss ( cycle ) on each side to avoid F and G from contradicting each other."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-64",
"text": "The overall architecture of our model is illustrated in Figure 2 ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-65",
"text": "Cycle Consistency Loss."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-66",
"text": "We introduce"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-67",
"text": "where \u2206 denotes the discrepancy criterion, which is set as the average cosine similarity in our model."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-69",
"text": "**MODEL SELECTION**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-70",
"text": "We follow Conneau et al. (2018) , using an unsupervised criterion to perform model selection."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-71",
"text": "In preliminary experiments, we find in adversarial training that the single-direction criterion S(F, X, Y ) by Conneau et al. (2018) does not always work well."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-72",
"text": "To address this, we make a simple extension by calculating the weighted average of forward and backward scores:"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-73",
"text": "Where \u03bb is a hyperparameter to control the importance of the two objectives."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-74",
"text": "1 Here S first generates bilingual lexicons by learned mappings, and then computes the average cosine similarity of these translations."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-76",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-77",
"text": "We perform two sets of experiments, to investigate the effectiveness of our duality regularization in isolation (Section 4.2) and to compare our final models with the state-of-the-art methods in the literature (Section 4.3), respectively."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-81",
"text": "Our datasets includes: (i) The Multilingual Unsupervised and Supervised Embeddings (MUSE) dataset released by Conneau et al. (2018) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-82",
"text": "(ii) the more challenging Vecmap dataset from Dinu et al. (2015) and the extensions of Artetxe et al. (2017) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-83",
"text": "We follow the evaluation setups of Conneau et al. (2018) , utilizing cross-domain similarity local scaling (CSLS) for retrieving the translation of given source words."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-84",
"text": "Following a standard evaluation practice (Vuli\u0107 and Moens, 2013; Mikolov et al., 2013; Conneau et al., 2018) , we report precision at 1 scores (P@1)."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-85",
"text": "Given the instability of existing methods, we follow Artetxe et al. (2018a) to perform 10 runs for each method and report the best and the average accuracies."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-87",
"text": "**THE EFFECTIVENESS OF DUAL LEARNING**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-88",
"text": "We compare our method with Conneau et al. (2018) (Adv-C) under the same settings."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-89",
"text": "As 1 We find that \u03bb = 0.5 generally works well."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-91",
"text": "**SETTING**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-92",
"text": "Adv-C Ours best average."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-93",
"text": "best average."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-94",
"text": "Adv-C Ours three-tre-two three-tre-three neck-collo-ribcage neck-collo-neck door-finestrino-window door-portiera-door second-terzo-third second-terzo-second before-prima-first before-dopo-after shown in Table 1 , our model outperforms Adv-C on both MUSE and Vecmap for all language pairs (except ES-EN)."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-95",
"text": "In addition, the proposed approach is less sensitive to initialization, and thus more stable than Adv-C over multiple runs."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-96",
"text": "These results demonstrate the effectiveness of dual learning."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-97",
"text": "Our method is also superior to Adv-C for the low-resource language pairs English \u2194 Malay (MS) and English \u2194 English-Esperanto (EO)."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-98",
"text": "Adv-C gives low performances on ES-EN, DE-EN, but much better results on the opposite directions on Vecmap."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-99",
"text": "This is likely because the separate models are highly under-constrained, and thus easy to get stuck in poor local optima."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-100",
"text": "In contrast, our method gives comparable results on both directions for the two languages, thanks to the use of information symmetry."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-101",
"text": "Table 4 : Accuracy (P@1) on Vecmap."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-102",
"text": "The best results are bolded."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-103",
"text": "\u2020Results as reported in the original paper."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-104",
"text": "For unsupervised methods, we report the average accuracy across 10 runs."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-108",
"text": "Compared with Adv-C, our model significantly reduces the inconsistency rates on all language pairs, which explains the overall improvement in Table 1."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-109",
"text": "Table 3 gives several word translation examples."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-110",
"text": "In the first three cases, our regularizer successfully fixes back translation errors."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-111",
"text": "In the fourth case, ensuring cycle consistency does not lead to the correct translation, which explains some errors by our system."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-112",
"text": "In the fifth case, our model finds a related word but not the same word in the back translation, due to the use of cosine similarity for regularization."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-114",
"text": "**COMPARISON WITH THE STATE-OF-THE-ART**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-115",
"text": "In this section, we compare our model with state-of-the-art systems, including those with different degrees of supervision."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-116",
"text": "The baselines include: (1) Procrustes (Conneau et al., 2018) , which learns a linear mapping through Procrustes Analysis (Sch\u00f6nemann, 1966) ."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-117",
"text": "(2) GPA (Kementchedjhieva et al., 2018) , an extension of Procrustes Analysis."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-118",
"text": "(3) GeoMM (Jawanpuria et al., 2018), a geometric approach which learn a Mahalanobis metric to refine the notion of similarity."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-119",
"text": "(4) GeoMM semi , iterative GeoMM with weak supervision. (5) Adv-C-Procrustes (Conneau et al., 2018) , which refines the mapping learned by Adv-C with iterative Procrustes, which learns the new mapping matrix by constructing a bilingual lexicon iteratively."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-120",
"text": "(6) Unsup-SL (Artetxe et al., 2018a) , which integrates a weak unsupervised mapping with a robust selflearning."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-121",
"text": "(7) Sinkhorn-BT , which combines sinkhorn distance (Cuturi, 2013) and back-translation."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-122",
"text": "For fair comparison, we integrate our model with two iterative refinement methods (Procrustes and GeoMM semi )."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-123",
"text": "Table 4 shows the final results on Vecmap."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-124",
"text": "3 We first compare our model with the stateof-the-art unsupervised methods."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-125",
"text": "Our model based on procrustes (Ours-Procrustes) outperforms Sinkhorn-BT on all test language pairs, and shows better performance than Adv-C-Procrustes on most language pairs."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-126",
"text": "Adv-C-Procrustes gives very low precision on DE-EN, FI-EN and ES-EN, while Ours-Procrustes obtains reasonable results consistently."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-127",
"text": "A possible explanation is that dual learning is helpful for providing good initiations, so that the procrustes solution is not likely to fall in poor local optima."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-128",
"text": "The reason why Unsup-SL gives strong results on all language pairs is that it uses a robust self-learning framework, which contains several techniques to avoid poor local optima."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-129",
"text": "Additionally, we observe that our unsupervised method performs competitively and even better compared with strong supervised and semisupervised approaches."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-130",
"text": "Ours-Procrustes obtains comparable results with Procrustes on EN-IT and gives strong results on EN-DE, EN-FI, EN-ES and the opposite directions."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-131",
"text": "Ours-GeoMM semi obtains the state-of-the-art results on all tested language pairs except EN-FI, with the additional advantage of being fully unsupervised."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-133",
"text": "**CONCLUSION**"
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-134",
"text": "We investigated a regularization method to enhance unsupervised bilingual lexicon induction, by encouraging symmetry in lexical mapping between a pair of word embedding spaces."
},
{
"sent_id": "5a2cd80d7c57e06a51457e53169b49-C001-135",
"text": "Results show that strengthening bi-directional mapping consistency significantly improves the effectiveness over the state-of-the-art method, leading to the best results on a standard benchmark."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-10",
"5a2cd80d7c57e06a51457e53169b49-C001-11",
"5a2cd80d7c57e06a51457e53169b49-C001-12"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-14",
"5a2cd80d7c57e06a51457e53169b49-C001-15"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-23"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-44"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-115",
"5a2cd80d7c57e06a51457e53169b49-C001-116"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-119"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-10",
"5a2cd80d7c57e06a51457e53169b49-C001-14",
"5a2cd80d7c57e06a51457e53169b49-C001-23",
"5a2cd80d7c57e06a51457e53169b49-C001-116",
"5a2cd80d7c57e06a51457e53169b49-C001-119"
]
},
"@MOT@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-10",
"5a2cd80d7c57e06a51457e53169b49-C001-11",
"5a2cd80d7c57e06a51457e53169b49-C001-12"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-14",
"5a2cd80d7c57e06a51457e53169b49-C001-15"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-16",
"5a2cd80d7c57e06a51457e53169b49-C001-18"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-71",
"5a2cd80d7c57e06a51457e53169b49-C001-72",
"5a2cd80d7c57e06a51457e53169b49-C001-73",
"5a2cd80d7c57e06a51457e53169b49-C001-74"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-10",
"5a2cd80d7c57e06a51457e53169b49-C001-14",
"5a2cd80d7c57e06a51457e53169b49-C001-16",
"5a2cd80d7c57e06a51457e53169b49-C001-71"
]
},
"@SIM@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-16",
"5a2cd80d7c57e06a51457e53169b49-C001-18"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-27"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-16",
"5a2cd80d7c57e06a51457e53169b49-C001-27"
]
},
"@DIF@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-16",
"5a2cd80d7c57e06a51457e53169b49-C001-18"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-16"
]
},
"@USE@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-39",
"5a2cd80d7c57e06a51457e53169b49-C001-40",
"5a2cd80d7c57e06a51457e53169b49-C001-41",
"5a2cd80d7c57e06a51457e53169b49-C001-42",
"5a2cd80d7c57e06a51457e53169b49-C001-43"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-70"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-81"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-83"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-84"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-39",
"5a2cd80d7c57e06a51457e53169b49-C001-70",
"5a2cd80d7c57e06a51457e53169b49-C001-81",
"5a2cd80d7c57e06a51457e53169b49-C001-83",
"5a2cd80d7c57e06a51457e53169b49-C001-84"
]
},
"@EXT@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-39",
"5a2cd80d7c57e06a51457e53169b49-C001-40",
"5a2cd80d7c57e06a51457e53169b49-C001-41",
"5a2cd80d7c57e06a51457e53169b49-C001-42",
"5a2cd80d7c57e06a51457e53169b49-C001-43"
],
[
"5a2cd80d7c57e06a51457e53169b49-C001-71",
"5a2cd80d7c57e06a51457e53169b49-C001-72",
"5a2cd80d7c57e06a51457e53169b49-C001-73",
"5a2cd80d7c57e06a51457e53169b49-C001-74"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-39",
"5a2cd80d7c57e06a51457e53169b49-C001-71"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"5a2cd80d7c57e06a51457e53169b49-C001-88"
]
],
"cite_sentences": [
"5a2cd80d7c57e06a51457e53169b49-C001-88"
]
}
}
},
"ABC_d8e73e9c00acffc34ade1331709d92_5": {
"x": [
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-2",
"text": "This paper presents a novel approach to improve reordering in phrase-based machine translation by using richer, syntactic representations of units of bilingual language models (BiLMs)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-3",
"text": "Our method to include syntactic information is simple in implementation and requires minimal changes in the decoding algorithm."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-4",
"text": "The approach is evaluated in a series of ArabicEnglish and Chinese-English translation experiments."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-5",
"text": "The best models demonstrate significant improvements in BLEU and TER over the phrase-based baseline, as well as over the lexicalized BiLM by Niehues et al. (2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-6",
"text": "Further improvements of up to 0.45 BLEU for ArabicEnglish and up to 0.59 BLEU for ChineseEnglish are obtained by combining our dependency BiLM with a lexicalized BiLM."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-7",
"text": "An improvement of 0.98 BLEU is obtained for Chinese-English in the setting of an increased distortion limit."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-10",
"text": "In statistical machine translation (SMT) reordering (also called distortion) refers to the order in which source words are translated to generate the translation in the target language."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-11",
"text": "Word orders can differ significantly across languages."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-12",
"text": "For instance, Arabic declarative sentences can be verbinitial, while the corresponding English translation should realize the verb after the subject, hence requiring a reordering."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-13",
"text": "Determining the correct reordering during decoding is a major challenge for SMT."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-14",
"text": "This problem has received a lot of attention in the literature (see, e.g., Tillmann (2004) , Zens and Ney (2003) , Al-Onaizan and Papineni (2006) ), as choosing the correct reordering improves readability of the translation and can have a substantial impact on translation quality (Birch, 2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-15",
"text": "In this paper, we only consider those approaches that include a reordering feature function into the loglinear interpolation used during decoding."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-16",
"text": "The simplest reordering model is linear distortion (Koehn et al., 2003) which scores the distance between phrases translated at steps t and t + 1 of the derivation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-17",
"text": "This model ignores any contextual information, as the distance between translated phrases is its only parameter."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-18",
"text": "Lexical distortion modeling (Tillmann, 2004) conditions reordering probabilities on the phrase pairs translated at the current and previous steps."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-19",
"text": "Unlike linear distortion, it characterizes reordering not in terms of distance but type: monotone, swap, or discontinuous."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-20",
"text": "In this paper, we base our approach to reordering on bilingual language models (Marino et al., 2006; Niehues et al., 2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-21",
"text": "Instead of directly characterizing reordering, they model sequences of elementary translation events as a Markov process."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-22",
"text": "1 Originally, Marino et al. (2006) used this kind of model as the translation model, while more recently it has been used as an additional model in PBSMT systems (Niehues et al., 2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-23",
"text": "We adopt and generalize the approach of Niehues et al. (2011) to investigate several variations of bilingual language models."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-24",
"text": "Our method consists of labeling elementary translation events (tokens of bilingual LMs) with their different contextual properties."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-25",
"text": "What kind of contextual information should be incorporated in a reordering model?"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-26",
"text": "Lexical information has been used by Tillmann (2004) but is known to suffer from data sparsity (Galley and Manning, 2008) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-27",
"text": "Also previous contributions to bilingual language modeling (Marino et al., 2006; Niehues et al., 2011) have mostly used lexical information, although Crego and Yvon (2010a) and Crego and Yvon (2010b) label bilingual to-kens with a rich set of POS tags."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-28",
"text": "But in general, reordering is considered to be a syntactic phenomenon and thus the relevant features are syntactic (Fox, 2002; Cherry, 2008) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-29",
"text": "Syntactic information is incorporated in tree-based approaches in SMT, allowing one to provide a more detailed definition of translation events and to redefine decoding as parsing of a source string (Liu et al., 2006; Huang et al., 2006; Marton and Resnik, 2008) , of a target string (Shen et al., 2008) , or both (Chiang, 2007; Chiang, 2010) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-30",
"text": "Reordering is a result of a given derivation, and CYK-based decoding used in tree-based approaches is more syntax-aware than the simple PBSMT decoding algorithm."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-31",
"text": "Although tree-based approaches potentially offer a more accurate model of translation, they are also a lot more complex and requiring more intricate optimization and estimation techniques (Huang and Mi, 2010) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-32",
"text": "Our idea is to keep the simplicity of PBSMT but move towards the expressiveness typical of treebased models."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-33",
"text": "We incrementally build up the syntactic representation of a translation during decoding by adding precomputed fragments from the source parse tree."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-34",
"text": "The idea to combine the merits of the two SMT paradigms has been proposed before, where Huang and Mi (2010) introduce incremental decoding for a tree-based model."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-35",
"text": "On a very general level, our approach is similar to theirs in that it keeps track of a sequence of source syntactic subtrees that are being translated at consecutive decoding steps."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-36",
"text": "An important difference is that they keep track of whether the visited subtrees have been fully translated, while in our approach, once a syntactic structural unit has been added to the history, it is not updated anymore."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-37",
"text": "In this paper, we focus on source syntactic information."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-38",
"text": "During decoding we have full access to the source sentence, which allows us to obtain a better syntactic analysis (than for a partial sentence) and to precompute the units that the model operates with."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-39",
"text": "We investigate the following research questions: How well can we capture reordering regularities of a language pair by incorporating source syntactic parameters into the units of a bilingual language model? What kind of source syntactic parameters are necessary and sufficient?"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-40",
"text": "Our contributions can be summarized as follows: We argue that the contextual information used in the original bilingual models (Niehues et al., 2011) is insufficient and introduce a simple model that exploits source-side syntax to improve reordering (Sections 2 and 3)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-41",
"text": "We perform a thorough comparison between different variants of our general model and compare them to the original approach."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-42",
"text": "We carry out translation experiments on multiple test sets, two language pairs (ArabicEnglish and Chinese-English), and with respect to two metrics (BLEU and TER)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-43",
"text": "Finally, we present a preliminary analysis of the reorderings resulting from the proposed models (Section 4)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-45",
"text": "**MOTIVATION**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-46",
"text": "In this section, we elaborate on our research questions and provide background for our approach."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-47",
"text": "We also discuss existing bilingual n-gram models and argue that they are often not expressive enough to differentiate between alternative reorderings."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-48",
"text": "We should first note that the most commonly used n-gram model to distinguish between reorderings is a target language model, which does not take translation correspondence into account and just models target-side fluency."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-49",
"text": "Al-Onaizan and Papineni (2006) show that target language models by themselves are not sufficient to correctly characterize reordering."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-50",
"text": "In what follows we only discuss bilingual models."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-51",
"text": "The word-aligned sentence pair in Figure 1 .a 2 demonstrates a common Arabic-English reordering."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-52",
"text": "As stated in the introduction, bilingual language models capture reordering regularities as a sequence of elementary translation events 3 ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-53",
"text": "In the given example, one could decompose the sequential process of translation as follows: First translate the first word Alwzyr as the minister, then ArjE as attributed, then ArtfAE as the increase and so on."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-54",
"text": "The sequence of elementary translation events is modeled as an n-gram model (Equation 1, where t i is a translation event)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-55",
"text": "There are numerous ways in which t i can be defined."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-56",
"text": "Below we first discuss how they have been defined within previous approaches, and then introduce our definition."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-58",
"text": "**LEXICALIZED BILINGUAL LMS**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-59",
"text": "By including both source and target information into the representation of translation events we ob- tain a bilingual LM."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-60",
"text": "The richer representation allows for a finer distinction between reorderings."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-61",
"text": "For example, Arabic has a morphological marker of definiteness on both nouns and adjectives."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-62",
"text": "If we first translate a definite adjective and then an indefinite noun, it will probably not be a likely sequence according to the translation model."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-63",
"text": "This kind of intuition underlies the model of Niehues et al. (2011) , a bilingual LM (BiLM), which defines elementary translation events t 1 , ..., t n as follows:"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-64",
"text": "where e i is the i-th target word and A : E \u2192 P(F ) is an alignment function, E and F referring to target and source sentences, and P(\u00b7) is the powerset function."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-65",
"text": "In other words, the i-th translation event consists of the i-th target word and all source words aligned to it."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-66",
"text": "Niehues et al. (2011) refer to the defined translation events t i as bilingual tokens and we adopt this terminology."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-67",
"text": "There are alternative definitions of bilingual language models."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-68",
"text": "Our choice of the above definition is supported by the fact that it produces an unambiguous segmentation of a parallel sentence into tokens."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-69",
"text": "Ambiguous segmentation is undesirable because it increases the token vocabulary, and thus the model sparsity."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-70",
"text": "Another disadvantage comes from the fact that we want to compare permutations of the same set of elements."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-71",
"text": "For example, the two different segmentations of ba into [ba] and [b] [a] still represent the same permutation of the sequence ab."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-72",
"text": "In Figure 1 one can produce a segmentation of (AsEAr Albtrwl, oil prices) into (Albtrwl, oil) and (AsEAr, prices) or leave it as is. If we allow for both segmentations, the learnt probability parameters may be different for the sum of (Albtrwl, oil) and (AsEAr, prices) and for the unsegmented phrase."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-73",
"text": "Durrani et al. (2011) introduce an alternative method for unambiguous bilingual segmentation where tokens are defined as minimal phrases, called minimal translation units (MTUs)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-74",
"text": "Figure 1 compares the BiLM and MTU tokenization for a specific example."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-75",
"text": "Since Niehues et al. (2011) have shown their model to work successfully as an additional feature in combination with commonly used standard phrase-based features, we use their approach as the main point of reference and base our approach on their segmentation method."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-76",
"text": "In the rest of the text we refer to Niehues et al. (2011) as the original BiLM."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-77",
"text": "4 At the same time, we do not see any specific obstacles for combining our work with MTUs."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-79",
"text": "**SUITABILITY OF LEXICALIZED BILM TO MODEL REORDERING**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-80",
"text": "As mentioned in the introduction, lexical information is not very well-suited to capture reordering regularities."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-81",
"text": "Consider Figure 2 .a."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-82",
"text": "The extracted sequence of bilingual tokens is produced by aligning source words with respect to target words (so that they are in the same order), as demonstrated by the shaded part of the picture."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-83",
"text": "If we substituted the Arabic translation of Egyptian for the Arabic translation of Israeli, the reordering should remain the same. What matters for reordering is the syntactic role or context of a word."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-84",
"text": "By using unnecessarily fine-grained categories we risk running into sparsity issues."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-85",
"text": "Niehues et al. (2011) also described an alternative variant of the original BiLM, where words are substituted by their POS tags (Figure 2 .a, shaded part)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-86",
"text": "Also, however, POS information by itself may be insufficiently expressive to separate cor- , it still is a likely sequence."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-87",
"text": "Indeed, the log-probabilities of the two sequences with respect to a 4-gram BiLM model 5 result in a higher probability of \u221210.25 for the incorrect reordering than for the correct one (\u221210.39)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-88",
"text": "Since fully lexicalized bilingual tokens suffer from data sparsity and POS-based bilingual tokens are insufficiently expressive, the question is which level of syntactic information strikes the right balance between expressiveness and generality."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-89",
"text": "5 Section 4 contains details about data and software setup."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-91",
"text": "**BILM WITH DEPENDENCY INFORMATION**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-92",
"text": "Dependency grammar is commonly used in NLP to formalize role-based relations between words."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-93",
"text": "The intuitive notion of syntactic modification is captured by the primitive binary relation of dependence."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-94",
"text": "Dependency relations do not change with the linear order of words ( Figure 2 ) and therefore can provide a characterization of a word's syntactic class that invariant under reordering."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-95",
"text": "If we incorporate dependency relations into the representation of bilingual tokens, the incorrect reordering in Figure 2 .b will produce a highly unlikely sequence."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-96",
"text": "For example, we can substitute each source word with its POS tag and its parent's POS tag (Figure 3 )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-97",
"text": "Again, we computed 4-gram log-probabilities for the corresponding sequences: the correct reordering results in a substantially higher probability of \u221210.58 than the incorrect one (\u221213.48)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-98",
"text": "We may consider situations where more fine-grained distinctions are required."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-99",
"text": "In the next section, we explore different representations based on source dependency trees."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-101",
"text": "**DEPENDENCY-BASED BILM**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-102",
"text": "In this section, we introduce our model which combines the BiLM from Niehues et al. (2011) with source dependency information."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-103",
"text": "We further give details on how the proposed models are trained and integrated into a phrase-based decoder."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-105",
"text": "**THE GENERAL FRAMEWORK**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-106",
"text": "In the previous section we outlined our framework as composed of two steps: First, a parallel sentence is tokenized according to the BiLM model (Niehues et al., 2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-107",
"text": "Next, words in the bilingual tokens are substituted with their contextual properties."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-108",
"text": "It is thus convenient to use the following generalized definition for a token sequence t 1 ...t n in our framework:"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-109",
"text": "where e i is the i-th target word, A : E \u2192 P(F ) is an alignment function, F and E are source and target sentences, and ContE and ContF are target and source contextual functions, respectively."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-110",
"text": "A contextual function returns a word's contextual property, based on its sentential context (source or target)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-111",
"text": "See Figure 4 for an example of a sequence of BiLM tokens with a ContF defined as returning the POS tag of the source word combined with the POS tags of its parent, grandparent and siblings, and ContE defined as an identity function (see Section 3.2 for a detailed explanation of the functions and notation)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-112",
"text": "In this work we focus on source contextual functions (ContF )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-113",
"text": "We also exploit some very simple target contextual functions, but do not go into an in-depth exploration."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-115",
"text": "**DEPENDENCY-BASED CONTEXTUAL FUNCTIONS**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-116",
"text": "In NLP approaches exploiting dependency structure, two kinds of relations are of special importance: the parent-child relation and the sibling relation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-117",
"text": "Shen et al. (2008) work with two wellformed dependency structures, both of which are defined in such a way that there is one common parent and a set of siblings."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-118",
"text": "Li et al. (2012) characterize rules in hierarchical SMT by labeling them with the POS tags of the parents of the words inside the rule."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-119",
"text": "Lerner and Petrov (2013) model reordering as a sequence of classification steps based on a dependency parse of a sentence."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-120",
"text": "Their model first decides how a word is reordered with respect to its parent and then how it is reordered with respect to its siblings."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-121",
"text": "Based on these previous approaches, we propose to characterize contextual syntactic roles of a word in terms of POS tags of the words themselves and their relatives in a dependency tree."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-122",
"text": "It is straightforward to incorporate parent information since each node has a unique parent."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-123",
"text": "As for siblings information, we incorporate POS tags of the closest sibling to the left and the closest to the right."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-124",
"text": "We do not include all of the siblings to avoid overfitting."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-125",
"text": "In addition to these basic syntactic relations, we consider the grandparent relation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-126",
"text": "The following list is a summary of the source contextual functions that we use."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-127",
"text": "We describe a function with respect to the kind of contextual property of a word it returns: (i) the word itself (Lex); (ii) POS label of the word (Pos); (iii) POS label of the word's parent; (iv) POS of the word's closest sibling to the left, concatenated with the POS tag of the closest sibling to the right; (v) the POS label of the word's grandparent."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-128",
"text": "We use target-side contextual functions returning: (i) an empty string, (ii) POS of the word, (iii) the word itself."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-129",
"text": "Notation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-130",
"text": "We do not use the above functions separately to define individual BiLM models, but use combinations of these functions."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-131",
"text": "We use the following notation for function combinations: \"\u2022\" horizontally connects source (on the left) and target (on the right) contextual functions for a given model."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-132",
"text": "For example, Lex\u2022Lex refers to the original (lexicalized) BiLM."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-133",
"text": "We use arrows (\u2192) to designate parental information (the arrow goes from parent to child)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-134",
"text": "Pos\u2192Pos refers to a combination of a function returning the POS of a word and the POS of its parent (as in Figure 3 )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-135",
"text": "Pos\u2192Pos\u2192Pos is a combination of the previous with the function returning the grandparent's POS."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-136",
"text": "Finally, we use +sibl to indicate the use of the sibling function described above: For example, Pos\u2192Pos+sibl is a source function that returns the word's POS, its parent's POS and the POS labels of the closest siblings to left and right."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-137",
"text": "6 Pos+sibl\u2192Pos is a source function returning the word's own POS, the POS of a word's parent, and the POS tags of the parent's siblings (left-and right-adjacent)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-138",
"text": "Figure 4 represents the sentence from Figure 2 during decoding in a system with an integrated Pos\u2192Pos\u2192Pos+sibl\u2022Lex feature."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-139",
"text": "It shows the sequence of produced bilingual tokens and corresponding labels in the introduced notation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-141",
"text": "**TRAINING**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-142",
"text": "Training of dependency-based BiLMs consists of a sequence of extraction steps: After having produced word-alignments for a bitext (Section 4), sentences are segmented according to Equation 3."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-143",
"text": "We produce a dependency parse of a source sentence and a POS-tag labeling of a target sentence."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-144",
"text": "For Chinese, we use the Stanford dependency parser (Chang et al., 2009 )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-145",
"text": "For Arabic a dependency parser is not available for public use, so we produce a constituency parse with the Stanford parser (Green and Manning, 2010) and extract dependencies based on the rules in Collins (1999) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-146",
"text": "For English POS-tagging, we use the Stanford POS-tagger (Toutanova et al., 2003) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-147",
"text": "After having produced a labeled sequence of tokens, we learn a 5-gram model using SRILM (Stolcke et al., 2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-148",
"text": "Kneyser-Ney smoothing is used for all model variations except for Pos\u2022Pos where Witten-Bell smoothing is used due to zero countof-counts."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-150",
"text": "**DECODER INTEGRATION**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-151",
"text": "Dependency-based BiLMs are integrated into our phrase-based SMT decoder as follows: Before translating a sentence, we produce its dependency parse."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-152",
"text": "Phrase-internal word-alignments, needed to segment the translation hypothesis into tokens, are stored in the phrase table, based on the most frequent internal alignment observed during training."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-153",
"text": "Likewise, we store the most likely target-side POS-labeling for each phrase pair."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-154",
"text": "The decoding algorithm is augmented with one additional feature function and one additional, corresponding feature weight."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-155",
"text": "At each step of the derivation, as a new phrase pair is added to the partial translation hypothesis, this function segments the new phrase into bilingual tokens (given the internal alignment information) and substitutes the words in the phrase pair with syntactic labels (given the source parse and the target POS labeling associated with the phrase)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-156",
"text": "The new syntactified bilingual tokens are added to the stack of preceding n\u22121 tokens, and the feature function computes the weighted updated model probability."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-157",
"text": "During decoding, the probabilities of the BiLMs are computed in a stream-based fashion, with bilingual tokens as string tokens, and not in a class-based fashion, with syntactic source-side representations emitting the corresponding target words (Bisazza and Monz, 2014) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-159",
"text": "**EXPERIMENTS 4.1 SETUP**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-160",
"text": "We conduct translation experiments with a baseline PBSMT system with additionally one of the dependency-based BiLM feature functions specified in Section 3."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-161",
"text": "We compare the translation performance to a baseline PBSMT system and to a baseline augmented with the original BiLMs from (Niehues et al., 2011) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-162",
"text": "Word-alignment is produced with GIZA++ (Och and Ney, 2003) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-163",
"text": "We use an in-house implementation of a PBSMT system similar to Moses (Koehn et al., 2007) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-164",
"text": "Our baseline contains all standard PBSMT features including language model, lexical weighting, and lexicalized reordering."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-165",
"text": "The distortion limit is set to 5."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-166",
"text": "A 5-gram LM is trained on the English Gigaword corpus (1.6B tokens) using SRILM with modified Kneyser-Ney smoothing and interpolation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-167",
"text": "The BiLMs were trained as described in Section 3.3."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-168",
"text": "Information about the parallel data used for training the Arabic-English 7 and Chinese-English systems 8 is 7 The following Arabic-English parallel corpora were used: LDC2006E25, LDC2004T18, several gale corpora, LDC2004T17, LDC2005E46, LDC2007T08, LDC2004E13."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-169",
"text": "8 The following Chinese-English parallel corpora were used: LDC2002E18, LDC2002L27, LDC2003E07, LDC2003E14, LDC2005T06, LDC2005T10, LDC2005T34, Table 3 : Different combinations of a target contextual function with the Pos\u2192Pos source contextual function for Arabic-English."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-170",
"text": "See Table 2 for the notation regarding statistical significance."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-171",
"text": "shown in Table 1 ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-172",
"text": "The feature weights were tuned by using pairwise ranking optimization (Hopkins and May, 2011) on the MT04 benchmark (for both language pairs)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-173",
"text": "During tuning, 14 PRO parameter estimation runs are performed in parallel on different samples of the n-best list after each decoder iteration."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-174",
"text": "The weights of the individual PRO runs are then averaged and passed on to the next decoding iteration."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-175",
"text": "Performing weight estimation independently for a number of samples corrects for some of the instability that can be caused by individual samples."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-176",
"text": "For testing, we used MT08 and MT09 for Arabic, and MT06 and MT08 for Chinese."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-177",
"text": "We use approximate randomization (Noreen, 1989; Riezler and Maxwell, 2005) to test for statistically significant differences."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-178",
"text": "In the next two subsections we discuss the general results for Arabic and Chinese, where we use case-insensitive BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) as evaluation metrics."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-179",
"text": "This is followed by a preliminary analysis of observed reorderings where we compare 4-gram precision results and conduct experiments with an increased distortion limit."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-181",
"text": "**ARABIC-ENGLISH TRANSLATION EXPERIMENTS**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-182",
"text": "We are interested in how a translation system with an integrated dependency-based BiLM feaand several gale corpora."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-183",
"text": "ture performs as compared to the standard PB-SMT baseline and, more importantly, to the original BiLM model."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-184",
"text": "We consider two variants of BiLM discussed by Niehues et al. (2011) : the standard one, Lex\u2022Lex, and the simplest syntactic one, Pos\u2022Pos."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-185",
"text": "Results for the experiments can be found in Table 2 ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-186",
"text": "In the discussion below we mostly focus on the experimental results for the large, combined test set MT08+MT09."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-187",
"text": "Table 2 .a-b compares the performance of the baseline and original BiLM systems."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-188",
"text": "Lex\u2022Lex yields strongly significant improvements over the baseline for BLEU and weakly significant improvements for TER."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-189",
"text": "Therefore, for the rest of the experiments we are interested in obtaining further improvements over Lex\u2022Lex."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-190",
"text": "Pos\u2192Pos\u2022Pos (Table 2 .c) demonstrates the effect of adding minimal dependency information to a BiLM."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-191",
"text": "9 It results in strongly significant improvements over the baseline and weak improvements over Lex\u2022Lex in terms of BLEU."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-192",
"text": "We additionally ran experiments with the different target functions (Table 3 )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-193",
"text": "\u2022Pos shows the highest results, and \u2022 the lowest ones: this implies that a rather expressive source syntactic representation alone still benefits from target-side syntactic information."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-194",
"text": "Below, our dependency-based systems only use \u2022Pos."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-195",
"text": "Next, we tested the effect of adding more source 9 Additional significance testing, which is not shown in Table 2 , shows a strongly significant improvement over the original syntactic BiLM Pos\u2022Pos."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-196",
"text": "(Table 2 .e) shows the best results overall for BLEU, although it must be pointed out that the difference with Pos\u2192Pos\u2022Pos is very small."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-197",
"text": "With respect to TER, Pos\u2192Pos\u2022Pos outperforms the grandparent variant."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-198",
"text": "So far, we can conclude that source parent information helps improve translation performance."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-199",
"text": "Increased specificity of a parent (parent specified by a grandparent) tends to further improve performance."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-200",
"text": "Up to now, we have only used syntactic information and obtained considerable improvements over Pos\u2022Pos, surpassing the improvement provided by Lex\u2022Lex."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-201",
"text": "Can we gain further improvements by also adding lexical information?"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-202",
"text": "To this end, we conduct experiments combining the best performing dependency-based BiLM (Pos\u2192Pos\u2192Pos\u2022Pos) and the lexicalized BiLM (Lex\u2022Lex)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-203",
"text": "We hypothesize that the two models improve different aspects of translation: Lex\u2022Lex is biased towards improving lexical choice and Pos\u2192Pos\u2192Pos\u2022Pos towards improving reordering."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-204",
"text": "Combining these two models, we may improve both aspects."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-205",
"text": "The metric results for the combined set indeed support this hypothesis (Table 2 .f)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-206",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-207",
"text": "**CHINESE-ENGLISH TRANSLATION EXPERIMENTS**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-208",
"text": "The results of the Chinese-English experiments are shown in Table 4 ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-209",
"text": "In the discussion below we mostly focus on the experimental results for the large, combined test set MT06+MT08."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-210",
"text": "We observe the same general pattern for the Pos\u2192Pos source function (Table 4 .c) as for Arabic-English: the system with the \u2022Pos target function has the highest scores (Table 5 )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-211",
"text": "All of the Pos\u2192Pos\u2022 configurations show statistically significant improvements over the PBSMT baseline."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-212",
"text": "For TER, two of the three Pos\u2192Pos\u2022 variants significantly outperform Lex\u2022Lex."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-213",
"text": "The system with sibling information (Table 4 .d) obtains quite low BLEU results, just as in the Arabic experiments."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-214",
"text": "On the other hand, its TER results are the highest overall."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-215",
"text": "The system with the Pos\u2192Pos\u2192Pos\u2022Pos function (Table 4 .e) achieves the best results among dependency-based BiLMs for BLEU."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-216",
"text": "Finally, combining Pos\u2192Pos\u2192Pos\u2022Pos and Lex\u2022Lex results in the largest and significant improvements over all competing systems for BLEU."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-218",
"text": "**PRELIMINARY ANALYSIS OF REORDERING IN TRANSLATION EXPERIMENTS**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-219",
"text": "In general, the experimental results show that using source dependency information yields consistent improvements for translating from Arabic and Chinese into English."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-220",
"text": "On the other hand, we have pointed out some discrepancies between the two metrics employed, suggesting that different system configurations may improve different aspects of translation."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-221",
"text": "To this end, we conducted some additional evaluations to understand how reordering is affected by the proposed features."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-222",
"text": "We use 4-gram precision as a metric of how much of the reference set word order is preserved."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-223",
"text": "Table 6 shows the corresponding results for both languages."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-224",
"text": "Just as in the previous two sections, configurations with parental information produce the best results."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-225",
"text": "For Arabic, all of the dependency configurations outperform Lex\u2022Lex."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-226",
"text": "But the system with two feature functions, one of which is Lex\u2022Lex, still obtains the best results, which may suggest that the lexicalized BiLM also helps to differentiate between word orders."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-227",
"text": "For Chinese, Pos\u2192Pos\u2192Pos\u2022Pos and the system combining the latter and Lex\u2022Lex also obtain the best results."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-228",
"text": "However, other dependency-based configurations do not outperform Lex\u2022Lex."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-229",
"text": "All the experiments so far were run with a distortion limit of 5. But both of the languages, especially Chinese, often require reorderings over a longer distance."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-230",
"text": "We performed additional experiments with a distortion limit of 10 for the Lex\u2022Lex and Pos\u2192Pos\u2192Pos\u2022Pos systems (Tables 7 and 8) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-231",
"text": "It is more difficult to translate with a higher distortion limit (Green et al., 2010) as the set of permutations grows larger thereby making it more difficult to differentiate between correct and incorrect continuations of the current hypothesis."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-232",
"text": "It has also been noted that higher distortion limits are more likely to result in improvements for Chinese rather than Arabic to English translation (Chiang, 2007; Green et al., 2010) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-233",
"text": "We compared performance of fixed BiLM models at distortion lengths of 5 and 10."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-234",
"text": "ArabicEnglish results did not reveal statistically significant differences between the two distortion limits for Pos\u2192Pos\u2192Pos\u2022Pos."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-235",
"text": "On the other hand, for Lex\u2022Lex BLEU decreases when using a distortion limit of 10 compared to a limit of 5."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-236",
"text": "This implies that the dependency BiLM is more robust in the more challenging reordering setting than the lexicalized BiLM."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-237",
"text": "Chinese-English results for Pos\u2192Pos\u2192Pos\u2022Pos do show significant improvements over the distortion limit of 5 (up to 0.49 BLEU higher than the best result in Table 4 )."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-238",
"text": "This indicates that the dependency-based BiLM is better capable to take advantage of the increased distortion limit and discriminate between correct and incorrect reordering choices."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-239",
"text": "Comparing the results for Pos\u2192Pos\u2192Pos\u2022Pos and Lex\u2022Lex at a distortion limit of 10, we obtain strongly significant improvements for all metrics."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-240",
"text": "For Chinese, a larger distortion limit helps for both configurations, but more so for our dependency BiLM, yielding an improvement of 0.98 BLEU over the original, lexicalized BiLM (Table 8) ."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-241",
"text": "----------------------------------"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-242",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-243",
"text": "In this paper, we have introduced a simple, yet effective way to include syntactic information into phrase-based SMT."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-244",
"text": "Our method consists of enriching the representation of units of a bilingual language model (BiLM)."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-245",
"text": "We argued that the very limited contextual information used in the original bilingual models (Niehues et al., 2011) can capture reorderings only to a limited degree and proposed a method to incorporate information from a source dependency tree in bilingual units."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-246",
"text": "In a series of translation experiments we performed a thorough comparison between various syntacticallyenriched BiLMs and competing models."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-247",
"text": "The results demonstrated that adding syntactic information from a source dependency tree to the representations of bilingual tokens in an n-gram model can yield statistically significant improvements over the competing systems."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-248",
"text": "A number of additional evaluations provided an indication for better modeling of reordering phenomena."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-249",
"text": "The proposed dependency-based BiLMs resulted in an increase in 4-gram precision and provided further significant improvements over all considered metrics in experiments with an increased distortion limit."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-250",
"text": "In this paper, we have focused on rather elementary dependency relations, which we are planning to expand on in future work."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-251",
"text": "Our current approach is still strictly tied to the number of target tokens."
},
{
"sent_id": "d8e73e9c00acffc34ade1331709d92-C001-252",
"text": "In particular, we are interested in exploring ways to better capture the notion of syntactic cohesion in translation (Fox, 2002; Cherry, 2008)"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d8e73e9c00acffc34ade1331709d92-C001-2",
"d8e73e9c00acffc34ade1331709d92-C001-5"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-20",
"d8e73e9c00acffc34ade1331709d92-C001-21"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-22"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-27",
"d8e73e9c00acffc34ade1331709d92-C001-28"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-40"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-60",
"d8e73e9c00acffc34ade1331709d92-C001-61",
"d8e73e9c00acffc34ade1331709d92-C001-62",
"d8e73e9c00acffc34ade1331709d92-C001-63",
"d8e73e9c00acffc34ade1331709d92-C001-64",
"d8e73e9c00acffc34ade1331709d92-C001-65"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-66",
"d8e73e9c00acffc34ade1331709d92-C001-68"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-85",
"d8e73e9c00acffc34ade1331709d92-C001-86",
"d8e73e9c00acffc34ade1331709d92-C001-87",
"d8e73e9c00acffc34ade1331709d92-C001-88"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-102",
"d8e73e9c00acffc34ade1331709d92-C001-103"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-184",
"d8e73e9c00acffc34ade1331709d92-C001-185"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-245"
]
],
"cite_sentences": [
"d8e73e9c00acffc34ade1331709d92-C001-5",
"d8e73e9c00acffc34ade1331709d92-C001-20",
"d8e73e9c00acffc34ade1331709d92-C001-22",
"d8e73e9c00acffc34ade1331709d92-C001-27",
"d8e73e9c00acffc34ade1331709d92-C001-40",
"d8e73e9c00acffc34ade1331709d92-C001-63",
"d8e73e9c00acffc34ade1331709d92-C001-66",
"d8e73e9c00acffc34ade1331709d92-C001-85",
"d8e73e9c00acffc34ade1331709d92-C001-102",
"d8e73e9c00acffc34ade1331709d92-C001-184",
"d8e73e9c00acffc34ade1331709d92-C001-245"
]
},
"@SIM@": {
"gold_contexts": [
[
"d8e73e9c00acffc34ade1331709d92-C001-20",
"d8e73e9c00acffc34ade1331709d92-C001-21"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-66",
"d8e73e9c00acffc34ade1331709d92-C001-68"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-102",
"d8e73e9c00acffc34ade1331709d92-C001-103"
]
],
"cite_sentences": [
"d8e73e9c00acffc34ade1331709d92-C001-20",
"d8e73e9c00acffc34ade1331709d92-C001-66",
"d8e73e9c00acffc34ade1331709d92-C001-102"
]
},
"@USE@": {
"gold_contexts": [
[
"d8e73e9c00acffc34ade1331709d92-C001-23",
"d8e73e9c00acffc34ade1331709d92-C001-24"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-75"
]
],
"cite_sentences": [
"d8e73e9c00acffc34ade1331709d92-C001-23",
"d8e73e9c00acffc34ade1331709d92-C001-75"
]
},
"@EXT@": {
"gold_contexts": [
[
"d8e73e9c00acffc34ade1331709d92-C001-23",
"d8e73e9c00acffc34ade1331709d92-C001-24"
]
],
"cite_sentences": [
"d8e73e9c00acffc34ade1331709d92-C001-23"
]
},
"@MOT@": {
"gold_contexts": [
[
"d8e73e9c00acffc34ade1331709d92-C001-27",
"d8e73e9c00acffc34ade1331709d92-C001-28"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-40"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-85",
"d8e73e9c00acffc34ade1331709d92-C001-86",
"d8e73e9c00acffc34ade1331709d92-C001-87",
"d8e73e9c00acffc34ade1331709d92-C001-88"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-245"
]
],
"cite_sentences": [
"d8e73e9c00acffc34ade1331709d92-C001-27",
"d8e73e9c00acffc34ade1331709d92-C001-40",
"d8e73e9c00acffc34ade1331709d92-C001-85",
"d8e73e9c00acffc34ade1331709d92-C001-245"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"d8e73e9c00acffc34ade1331709d92-C001-76"
],
[
"d8e73e9c00acffc34ade1331709d92-C001-161"
]
],
"cite_sentences": [
"d8e73e9c00acffc34ade1331709d92-C001-76",
"d8e73e9c00acffc34ade1331709d92-C001-161"
]
}
}
},
"ABC_3dbdf61d07a3e35ac1b6ecc7ab3999_5": {
"x": [
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-199",
"text": "We applied this DE classifier to the Chinese sentences of MT data, and we also reordered the constructions that required reordering to better match their English translations."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-200",
"text": "The MT experiments showed our preprocessing gave significant BLEU and TER score gains over the baselines."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-201",
"text": "Based on our classification and MT experiments, we found that not only do we have better rules for deciding what to reorder, but the syntactic, semantic, and discourse information that we capture in the Chinese sentence allows us to give hints to the MT system which allows better translations to be chosen."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-2",
"text": "Linking constructions involving (DE) are ubiquitous in Chinese, and can be translated into English in many different ways."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-3",
"text": "This is a major source of machine translation error, even when syntaxsensitive translation models are used."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-4",
"text": "This paper explores how getting more information about the syntactic, semantic, and discourse context of uses of (DE) can facilitate producing an appropriate English translation strategy."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-5",
"text": "We describe a finergrained classification of (DE) constructions in Chinese NPs, construct a corpus of annotated examples, and then train a log-linear classifier, which contains linguistically inspired features."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-6",
"text": "We use the DE classifier to preprocess MT data by explicitly labeling (DE) constructions, as well as reordering phrases, and show that our approach provides significant BLEU point gains on MT02 (+1.24), MT03 (+0.88) and MT05 (+1.49) on a phrasedbased system."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-7",
"text": "The improvement persists when a hierarchical reordering model is applied."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-10",
"text": "Machine translation (MT) from Chinese to English has been a difficult problem: structural differences between Chinese and English, such as the different orderings of head nouns and relative clauses, cause BLEU scores to be consistently lower than for other difficult language pairs like Arabic-English."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-11",
"text": "Many of these structural differences are related to the ubiquitous Chinese (DE) construction, used for a wide range of noun modification constructions (both single word and clausal) and other uses."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-12",
"text": "Part of the solution to dealing with these ordering issues is hierarchical decoding, such as the Hiero system (Chiang, 2005) , a method motivated by (DE) examples like the one in Figure 1 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-13",
"text": "In this case, the translation goal is to rotate the noun head and the preceding relative clause around (DE) , so that we can translate to \"[one of few countries] [have diplomatic relations with North Korea]\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-14",
"text": "Hiero can learn this kind of lexicalized synchronous grammar rule."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-15",
"text": "But use of hierarchical decoders has not solved the DE construction translation problem."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-16",
"text": "We analyzed the errors of three state-of-the-art systems (the 3 DARPA GALE phase 2 teams' systems), and even though all three use some kind of hierarchical system, we found many remaining errors related to reordering."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-17",
"text": "One is shown here: None of the teams reordered \"bad reputation\" and \"middle school\" around the ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-18",
"text": "We argue that this is because it is not sufficient to have a formalism which supports phrasal reordering, but it is also necessary to have sufficient linguistic modeling that the system knows when and how much to rearrange."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-19",
"text": "An alternative way of dealing with structural differences is to reorder source language sentences to minimize structural divergence with the target language, (Xia and McCord, 2004; Collins et al., 2005; Wang et al., 2007) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-20",
"text": "For example Wang et al. (2007) introduced a set of rules to decide if a (DE) construction should be reordered or not before translating to English:"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-21",
"text": "\u2022 For DNPs (consisting of\"XP+DEG\"):"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-22",
"text": "-Reorder if XP is PP or LCP; -Reorder if XP is a non-pronominal NP \u2022 For CPs (typically formed by \"IP+DEC\"):"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-23",
"text": "-Reorder to align with the \"that+clause\" structure of English."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-24",
"text": "Although this and previous reordering work has led to significant improvements, errors still remain."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-25",
"text": "Indeed, Wang et al. (2007) found that the precision of their NP rules is only about 54.6% on a small human-judged set."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-26",
"text": "One possible reason the (DE) construction remains unsolved is that previous work has paid insufficient attention to the many ways the (DE) construction can be translated and the rich structural cues to the translation."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-27",
"text": "Wang et al. (2007) (Chiang, 2005) classes. But our investigation shows that there are many strategies for translating Chinese [A B] phrases into English, including the patterns in Table 1, only some involving reversal."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-28",
"text": "Notice that the presence of reordering is only one part of the rich structure of these examples."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-29",
"text": "Some reorderings are relative clauses, while others involve prepositional phrases, but not all prepositional phrase uses involve reorderings."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-30",
"text": "These examples suggest that capturing finer-grained translation patterns could help achieve higher accuracy both in reordering and in lexical choice."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-31",
"text": "In this work, we propose to use a statistical classifier trained on various features to predict for a given Chinese (DE) construction both whether it will reorder in English and which construction it will translate to in English."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-32",
"text": "We suggest that the necessary classificatory features can be extracted from Chinese, rather than English."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-33",
"text": "The (DE) in Chinese has a unified meaning of 'noun modification', and the choice of reordering and construction realization are mainly a consequence of facts of English noun modification."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-34",
"text": "Nevertheless, most of the features that determine the choice of a felicitous translation are available in the Chinese source."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-35",
"text": "Noun modification realization has been widely studied in English (e.g., (Rosenbach, 2003) ), and many of the important determinative properties (e.g., topicality, animacy, prototypicality) can be detected working in the source language."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-36",
"text": "We first present some corpus analysis characterizing different DE constructions based on how they get translated into English (Section 2)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-37",
"text": "We then train a classifier to label DEs into the 5 different categories that we define (Section 3)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-38",
"text": "The fine-grained DEs, together with reordering, are then used as input to a statistical MT system (Section 4)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-39",
"text": "We find that classifying DEs into finergrained tokens helps MT performance, usually at least twice as much as just doing phrasal reordering."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-41",
"text": "**DE CLASSIFICATION**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-42",
"text": "The Chinese character DE serves many different purposes."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-43",
"text": "According to the Chinese Treebank tagging guidelines (Xia, 2000) , the character can be tagged as DEC, DEG, DEV, SP, DER, or AS."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-44",
"text": "Similar to (Wang et al., 2007) , we only consider the majority case when the phrase with (DE) is a noun phrase modifier."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-45",
"text": "The DEs in NPs have a part-of-speech tag of DEC (a complementizer or a nominalizer) or DEG (a genitive marker or an associative marker)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-47",
"text": "**CLASS DEFINITION**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-48",
"text": "The way we categorize the DEs is based on their behavior when translated into English."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-49",
"text": "This is implicitly done in the work of Wang et al. (2007) where they use rules to decide if a certain DE and the words next to it will need to be reordered."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-50",
"text": "Some NPs are translated into a hybrid of these categories, or just don't fit into one of the five categories, for instance, involving an adjectival premodifier and a relative clause."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-51",
"text": "In those cases, they are put into an \"other\" category."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-52",
"text": "1"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-54",
"text": "**DATA ANNOTATION OF DE CLASSES**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-55",
"text": "In order to train a classifier and test its performance, we use the Chinese Treebank 6.0 (LDC2007T36) and the English Chinese Translation Treebank 1.0 (LDC2007T02 Table 2 : 5-class and 2-class classification accuracy."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-56",
"text": "\"baseline\" is the heuristic rules in (Wang et al., 2007) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-57",
"text": "Others are various features added to the log-linear classifier."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-58",
"text": "Chinese sentences with the DE annotation and extract parse-related features from there."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-60",
"text": "**EXPERIMENTAL SETTING**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-61",
"text": "For the classification experiment, we exclude the \"other\" class and only use the 2882 examples that fall into the five pre-defined classes."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-62",
"text": "To evaluate the classification performance and understand what features are useful, we compute the accuracy by averaging five 10-fold cross-validations."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-63",
"text": "2 As a baseline, we use the rules introduced in Wang et al. (2007) to decide if the DEs require reordering or not."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-64",
"text": "However, since their rules only decide if there is reordering in an NP with DE, their classification result only has two classes."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-65",
"text": "So, in order to compare our classifier's performance with the rules in Wang et al. (2007) , we have to map our five-class results into two classes."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-66",
"text": "We mapped our five-class results into two classes."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-67",
"text": "So we mapped B preposition A and relative clause into the class \"reordered\", and the other three classes into \"not-reordered\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-69",
"text": "**FEATURE ENGINEERING**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-70",
"text": "To understand which features are useful for DE classification, we list our feature engineering steps and results in Table 2 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-71",
"text": "In Table 2 , the 5-class accuracy is defined by:"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-72",
"text": "(number of correctly labeled DEs) (number of all DEs) \u00d7 100"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-73",
"text": "The 2-class accuracy is defined similarly, but it is evaluated on the 2-class \"reordered\" and \"notreordered\" after mapping from the 5 classes."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-74",
"text": "The DEs we are classifying are within an NP; we refer to them as [A B] NP ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-75",
"text": "A includes all the words in the NP before ; B includes all the words in the NP after ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-76",
"text": "To illustrate, we will use the following NP: to show examples of each feature."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-77",
"text": "The parse structure of the NP is listed in Figure 2 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-78",
"text": "Figure 2: The parse tree of the Chinese NP."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-80",
"text": "**DEPOS: PART-OF-SPEECH TAG OF DE**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-81",
"text": "Since the part-of-speech tag of DE indicates its syntactic function, it is the first obvious feature to add."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-82",
"text": "The NP in Figure 2 will have the feature \"DEC\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-83",
"text": "This basic feature will be referred to as DEPOS."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-84",
"text": "Note that since we are only classifying DEs in NPs, ideally the part-of-speech tag of DE will either be DEC or DEG as described in Section 2."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-85",
"text": "However, since we are using automatic parses instead of gold-standard ones, the DEPOS feature might have other values than just DEC and DEG."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-86",
"text": "From Table 2 , we can see that with this simple feature, the 5-class accuracy is low but at least better than simply guessing the majority class (47.92%)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-87",
"text": "The 2-class accuracy is still lower than using the heuristic rules in (Wang et al., 2007) , which is reasonable because their rules encode more information than just the POS tags of DEs."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-88",
"text": "A-pattern: Chinese syntactic patterns appearing before"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-89",
"text": "Secondly, we want to incorporate the rules in (Wang et al., 2007) as features in the log-linear classifier."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-90",
"text": "We added features for certain indicative patterns in the parse tree (listed in Table 3 )."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-92",
"text": "**A IS ADJP:**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-93",
"text": "true if A+DE is a DNP which is in the form of \"ADJP+DEG\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-95",
"text": "**A IS QP:**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-96",
"text": "true if A+DE is a DNP which is in the form of \"QP+DEG\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-97",
"text": "3. A is pronoun: true if A+DE is a DNP which is in the form of \"NP+DEG\", and the NP is a pronoun."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-98",
"text": "4. A ends with VA: true if A+DE is a CP which is in the form of \"IP+DEC\", and the IP ends with a VP that's either just a VA or a VP preceded by a ADVP."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-99",
"text": "Features 1-3 are inspired by the rules in (Wang et al., 2007) , and the fourth rule is based on the observation that even though the predicative adjective VA acts as a verb, it actually corresponds to adjectives in English as described in (Xia, 2000) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-100",
"text": "3 We call these four features A-pattern."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-101",
"text": "Our example NP in Figure 2 will have the fourth feature \"A ends with VA\" in Table 3 , but not the other three features."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-102",
"text": "In Table 2 we can see that after adding A-pattern, the 2-class accuracy is already much higher than the baseline."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-103",
"text": "We attribute this to the fourth rule and also to the fact that the classifier can learn weights for each feature."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-104",
"text": "4"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-106",
"text": "**POS-NGRAM: UNIGRAMS AND BIGRAMS OF POS TAGS**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-107",
"text": "The POS-ngram feature adds all unigrams and bigrams in A and B. Since A and B have different influences on the choice of DE class, we distinguish their ngrams into two sets of features."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-108",
"text": "We also include the bigram pair across DE which gets another feature name for itself."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-109",
"text": "The example NP in Figure 2 will have these features (we use b to indicate boundaries):"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-110",
"text": "\u2022 POS unigrams in A: \"NR\", \"AD\", \"VA\" The part-of-speech ngram features add 4.24% accuracy to the 5-class classifier."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-112",
"text": "**LEXICAL: LEXICAL FEATURES**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-113",
"text": "In addition to part-of-speech features, we also tried to use features from the words themselves. But since using full word identity resulted in a sparsity issue, 5 we take the one-character suffix of each word and extract suffix unigram and bigram features from them."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-114",
"text": "The argument for using suffixes is that it often captures the larger category of the word (Tseng et al., 2005) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-115",
"text": "For example, (China) and (Korea) share the same suffix , which means \"country\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-116",
"text": "These suffix ngram features will result in these features for the NP in Figure 2: \u2022 suffix unigrams: \" \", \" \", \" \", \" \", \" \", \" \" \u2022 suffix bigrams: \"b-\", \" -\", \" -\", \" -\", \" -\", \" -\", \" -b\""
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-117",
"text": "Other than the suffix ngram, we also add three other lexical features: first, if the word before DE is a noun, we add a feature that is the conjunction of POS and suffix unigram."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-118",
"text": "Secondly, an \"NR only\" feature will fire when A only consists of one or more NRs."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-119",
"text": "Thirdly, we normalize different forms of \"percentage\" representation, and add a feature if they exist."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-120",
"text": "This includes words that start with \" \" or ends with the percentage sign \"%\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-121",
"text": "The first two features are inspired by the fact that a noun and its type can help decide \"B prep A\" versus \"A B\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-122",
"text": "Here we use the suffix of the noun and the NR (proper noun) tag to help capture its animacy, which is useful in choosing between the s-genitive (the boy's mother) and the of-genitive (the mother of the boy) in English (Rosenbach, 2003) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-123",
"text": "The third feature is added because many of the cases in the \"A preposition B\" class have a percentage number in A. We call these sets of features Lexical."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-124",
"text": "Together they provide 2.73% accuracy improvement over the previous setting."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-125",
"text": "SemClass: semantic class of words We also use a Chinese thesaurus, CiLin, to look up the semantic classes of the words in [A B] and use them as features."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-126",
"text": "CiLin is a Chinese thesaurus published in 1984 (Mei et al., 1984) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-127",
"text": "CiLin is organized in a conceptual hierarchy with five levels."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-128",
"text": "We use the level-1 tags which includes 12 categories."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-129",
"text": "6 This feature fires when a word we look up has one level-1 tag in CiLin."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-130",
"text": "This kind of feature is referred to as SemClass in Table 2 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-131",
"text": "For the example in Figure 2 , two words have a single level-1 tag: \" \"(most) has a level-1 tag K 7 and \" \"(investment) has a level-1 tag H 8 . \" \" and \""
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-132",
"text": "\" are not listed in CiLin, and \" \" has multiple entries."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-133",
"text": "Therefore, the SemClass features are: (i) before DE: \"K\"; (ii) after DE: \"H\" Topicality: re-occurrence of nouns The last feature we add is a Topicality feature, which is also useful for disambiguating s-genitive and of-genitive."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-134",
"text": "We approximate the feature by caching the nouns in the previous two sentences, and fire a topicality feature when the noun appears in the cache."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-135",
"text": "Take this NP in MT06 as an example:"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-136",
"text": "\" \""
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-137",
"text": "For this NP, all words before DE and after DE appeared in the previous sentence."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-138",
"text": "Therefore the topicality features \"cache-before-DE\" and \"cacheafter-DE\" both fire."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-139",
"text": "After all the feature engineering above, the best accuracy on the 5-class classifier we have is 75.4%, which maps into a 2-class accuracy of 86.9%."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-140",
"text": "Comparing the 2-class accuracy to the (Wang et al., 2007) baseline, we have a 10.9% absolute improvement."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-141",
"text": "The 5-class accuracy and confusion matrix is listed in Table 4 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-142",
"text": "\"A preposition B\" is a small category and is the most confusing. \"A 's B\" also has lower accuracy, and is mostly confused with \"B preposition A\"."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-143",
"text": "This could be due to the fact that there are some cases where the translation is correct both ways, but also could be because the features we added have not captured the difference well enough."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-145",
"text": "**MACHINE TRANSLATION EXPERIMENTS 4.1 EXPERIMENTAL SETTING**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-146",
"text": "For our MT experiments, we used a reimplementation of Moses (Koehn et al., 2003) , a state-of-the-art phrase-based system."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-147",
"text": "The alignment is done by the Berkeley word aligner (Liang et al., 2006 ) and then we symmetrized the word alignment using the grow-diag heuristic."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-148",
"text": "For features, we incorporate Moses' standard eight features as well as the lexicalized reordering model."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-149",
"text": "Parameter tuning is done with Minimum Error Rate Training (MERT) (Och, 2003) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-150",
"text": "The tuning set for MERT is the NIST MT06 data set, which includes 1664 sentences."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-151",
"text": "We evaluate the result with MT02 (878 sentences), MT03 (919 sentences), and MT05 (1082 sentences)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-152",
"text": "Our MT training corpus contains 1,560,071 sentence pairs from various parallel corpora from LDC."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-153",
"text": "9 There are 12,259,997 words on the English side."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-154",
"text": "Chinese word segmentation is done by the Stanford Chinese segmenter (Chang et al., 2008) ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-155",
"text": "After segmentation, there are 11,061,792 words on the Chinese side."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-156",
"text": "We use a 5-gram language model trained on the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) and also the English side of all the LDC parallel data permissible under the NIST08 rules."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-157",
"text": "Documents of Gigaword released during the epochs of MT02, MT03, MT05, and MT06 were removed."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-158",
"text": "To run the DE classifier, we also need to parse the Chinese texts."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-159",
"text": "We use the Stanford Chinese parser (Levy and Manning, 2003) to parse the Chinese side of the MT training data and the tuning and test sets."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-160",
"text": "9 LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06, LDC2006E26, LDC2006E85, LDC2006E85, LDC2005T34, and LDC2005T34"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-162",
"text": "**BASELINE EXPERIMENTS**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-163",
"text": "We have two different settings as baseline experiments."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-164",
"text": "The first is without reordering or DE annotation on the Chinese side; we simply align the parallel texts, extract phrases and tune parameters."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-165",
"text": "This experiment is referred to as BASELINE."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-166",
"text": "Also, we reorder the training data, the tuning and the test sets with the NP rules in (Wang et al., 2007) and compare our results with this second baseline (WANG-NP)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-167",
"text": "The NP reordering preprocessing (WANG-NP) showed consistent improvement in Table 5 on all test sets, with BLEU point gains ranging from 0.15 to 0.40."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-168",
"text": "This confirms that having reordering around DEs in NP helps Chinese-English MT."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-170",
"text": "**EXPERIMENTS WITH 5-CLASS DE ANNOTATION**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-171",
"text": "We use the best setting of the DE classifier described in Section 3 to annotate DEs in NPs in the MT training data as well as the NIST tuning and test sets."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-172",
"text": "10 After this preprocessing, we restart the whole MT pipeline -align the preprocessed data, extract phrases, run MERT and evaluate."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-173",
"text": "This setting is referred to as DE-Annotated in Table 5 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-175",
"text": "**HIERARCHICAL PHRASE REORDERING MODEL**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-176",
"text": "To demonstrate that the technique presented here is effective even with a hierarchical decoder, we Table 5 : MT experiments of different settings on various NIST MT evaluation datasets."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-177",
"text": "We used both the BLEU and TER metrics for evaluation."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-178",
"text": "All differences between DE-Annotated and BASELINE are significant at the level of 0.05 with the approximate randomization test in (Riezler and Maxwell, 2005) conduct additional experiments with a hierarchical phrase reordering model introduced by )."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-179",
"text": "This shows that our preprocessing affects the majority of the sentences and thus it is not surprising that preprocessing based on the DE construction can make a significant difference."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-181",
"text": "**EXAMPLE: HOW DE ANNOTATION AFFECTS TRANSLATION**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-182",
"text": "Our approach DE-Annotated reorders the Chinese sentence, which is similar to the approach proposed by Wang et al. (2007) (WANG-NP)."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-183",
"text": "However, our focus is on the annotation on DEs and how this can improve translation quality."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-184",
"text": "Table 7 shows an example that contains a DE construction that translates into a relative clause in English."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-185",
"text": "12 The automatic parse tree of the sentence is listed in Figure 3 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-186",
"text": "The reordered sentences of WANG-NP and DE-Annotated appear on the top and bottom in Figure 4 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-187",
"text": "For this example, both systems decide to reorder, but DE-Annotated had the extra information that this is a relc ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-188",
"text": "In Figure 4 we can see that in WANG-NP, \" \" is being translated as \"for\", and the translation afterwards is not grammatically correct."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-189",
"text": "On the other hand, the bottom of Figure 4 shows that with the DE-Annotated preprocessing, now \" relc \" is translated into \"which was\" and well connected with the later translation."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-190",
"text": "This shows that disambiguating helps in choosing a better English translation."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-191",
"text": ") (ADJP (JJ )) (NP (NN ))))) (DEC )) (NP (NN ) (NN ) (NN ))))))) (PU ))"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-192",
"text": "Figure 3: The parse tree of the Chinese sentence in Table 7 . 12 In this example, all four references agreed on the relative clause translation."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-193",
"text": "Sometimes DE constructions have multiple appropriate translations, which is one of the reasons why certain classes are more confusable in Table 7 ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-194",
"text": "The bottom one is from DE-Annotated."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-195",
"text": "In this example, both systems reordered the NP, but DE-Annotated has an annotation on the ."
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-197",
"text": "**CONCLUSION**"
},
{
"sent_id": "3dbdf61d07a3e35ac1b6ecc7ab3999-C001-198",
"text": "In this paper, we presented a classification of Chinese (DE) constructions in NPs according to how they are translated into English."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-10",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-15",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-19"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-20",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-21",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-22",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-23",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-24",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-25",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-26"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-44",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-45"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-48",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-49",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-50"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-163",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-166",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-167",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-168"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-19",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-20",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-25",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-44",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-49",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-166"
]
},
"@MOT@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-20",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-21",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-22",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-23",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-24",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-25",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-26"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-20",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-25"
]
},
"@SIM@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-44",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-45"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-48",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-49",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-50"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-99"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-182",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-183",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-184"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-44",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-49",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-99",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-182"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-56"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-86",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-87"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-140"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-56",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-87",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-140"
]
},
"@EXT@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-63",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-64",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-65",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-66",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-67"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-63",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-65"
]
},
"@USE@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-89"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-99"
],
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-163",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-166",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-167",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-168"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-89",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-99",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-166"
]
},
"@DIF@": {
"gold_contexts": [
[
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-182",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-183",
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-184"
]
],
"cite_sentences": [
"3dbdf61d07a3e35ac1b6ecc7ab3999-C001-182"
]
}
}
},
"ABC_d785838888358a711fbf07c9dcf430_5": {
"x": [
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-2",
"text": "In this work we learn clusters of contextual annotations for non-terminals in the Penn Treebank."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-3",
"text": "Perhaps the best way to think about this problem is to contrast our work with that of Klein and Manning (2003) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-4",
"text": "That research used treetransformations to create various grammars with different contextual annotations on the non-terminals."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-5",
"text": "These grammars were then used in conjunction with a CKY parser."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-6",
"text": "The authors explored the space of different annotation combinations by hand."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-7",
"text": "Here we try to automate the process -to learn the \"right\" combination automatically."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-8",
"text": "Our results are not quite as good as those carefully created by hand, but they are close (84.8 vs 85.7)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-10",
"text": "**INTRODUCTION AND PREVIOUS RESEARCH**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-11",
"text": "It is by now commonplace knowledge that accurate syntactic parsing is not possible given only a context-free grammar with standard Penn Treebank (Marcus et al., 1993) labels (e.g., S, N P , etc.) (Charniak, 1996) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-12",
"text": "Instead researchers condition parsing decisions on many other features, such as parent phrase-marker, and, famously, the lexical-head of the phrase (Magerman, 1995; Collins, 1996; Collins, 1997; Johnson, 1998; Charniak, 2000; Henderson, 2003; Klein and Manning, 2003; Matsuzaki et al., 2005) (and others) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-13",
"text": "One particularly perspicuous way to view the use of extra conditioning information is that of tree-transformation (Johnson, 1998; Klein and Manning, 2003) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-14",
"text": "Rather than imagining the parser roaming around the tree for picking up the information it needs, we rather relabel the nodes to directly encode this information."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-15",
"text": "Thus rather than have the parser \"look\" to find out that, say, the parent of some N P is an S, we simply relabel the N P as an N P [S] ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-16",
"text": "This viewpoint is even more compelling if one does not intend to smooth the probabilities."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-17",
"text": "For example, consider p(N P \u2192 P RN | N P [S]) If we have no intention of backing off this probability to p(N P \u2192 P RN | N P ) we can treat N P [S] as an uninterpreted phrasal category and run all of the standard PCFG algorithms without change."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-18",
"text": "The result is a vastly simplified parser."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-19",
"text": "This is exactly what is done by Klein and Manning (2003) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-20",
"text": "Thus the \"phrasal categories\" of our title refer to these new, hybrid categories, such as N P [S] ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-21",
"text": "We hope to learn which of these categories work best given that they cannot be made too specific because that would create sparse data problems."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-22",
"text": "The Klein and Manning (2003) parser is an unlexicalized PCFG with various carefully selected context annotations."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-23",
"text": "Their model uses some parent annotations, and marks nodes which initiate or in certain cases conclude unary productions."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-24",
"text": "They also propose linguistically motivated annotations for several tags, including V P , IN , CC,N P and S. This results in a reasonably accurate unlexicalized PCFG parser."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-25",
"text": "The downside of this approach is that their features are very specific, applying different annotations to different treebank nonterminals."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-26",
"text": "For instance, they mark right-recursive N P s and not V P s (i.e., an N P which is the right-most child of another N P )."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-27",
"text": "This is because data sparsity issues preclude annotating the nodes in the treebank too liberally."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-28",
"text": "The goal of our work is to automate the process a bit, by annotating with more general features that apply broadly, and by learning clus-ters of these annotations."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-29",
"text": "Mohri and Roark (2006) tackle this problem by searching for what they call \"structural zeros\"or sets of events which are individually very likely, but are unlikely to coincide."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-30",
"text": "This is to be contrasted with sets of events that do not appear together simply because of sparse data."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-31",
"text": "They consider a variety of statistical tests to decide whether a joint event is a structural zero."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-32",
"text": "They mark the highest scoring nonterminals that are part of these joint events in the treebank, and use the resulting PCFG."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-33",
"text": "Coming to this problem from the standpoint of tree transformation, we naturally view our work as a descendent of Johnson (1998) and Klein and Manning (2003) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-34",
"text": "In retrospect, however, there are perhaps even greater similarities to that of (Magerman, 1995; Henderson, 2003; Matsuzaki et al., 2005) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-35",
"text": "Consider the approach of Matsuzaki et al. (2005) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-36",
"text": "They posit a series of latent annotations for each nonterminal, and learn a grammar using an EM algorithm similar to the inside-outside algorithm."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-37",
"text": "Their approach, however, requires the number of annotations to be specified ahead of time, and assigns the same number of annotations to each treebank nonterminal."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-38",
"text": "We would like to infer the number of annotations for each nonterminal automatically."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-39",
"text": "However, again in retrospect, it is in the work of Magerman (1995) that we see the greatest similarity."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-40",
"text": "Rather than talking about clustering nodes, as we do, Magerman creates a decision tree, but the differences between clustering and decision trees are small."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-41",
"text": "Perhaps a more substantial difference is that by not casting his problem as one of learning phrasal categories Magerman loses all of the free PCFG technology that we can leverage."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-42",
"text": "For instance, Magerman must use heuristic search to find his parses and incurs search errors because of it."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-43",
"text": "We use an efficient CKY algorithm to do exhaustive search in reasonable time."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-44",
"text": "Belz (2002) considers the problem in a manner more similar to our approach."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-45",
"text": "Beginning with both a non-annotated grammar and a parent annotated grammar, using a beam search they search the space of grammars which can be attained via merging nonterminals."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-46",
"text": "They guide the search using the performance on parsing (and several other tasks) of the grammar at each stage in the search."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-47",
"text": "In contrast, our approach explores the space of grammars by starting with few nonterminals and splitting them."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-48",
"text": "We also consider a much wider range of contextual information than just parent phrase-markers."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-50",
"text": "**BACKGROUND**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-51",
"text": "where V is a set of terminal symbols; M = {\u00b5 i } is a set of nonterminal symbols; \u00b5 0 is a start or root symbol; R is a set of productions of the form \u00b5 i \u2192 \u03c1, where \u03c1 is a sequence of terminals and nonterminals; and q is a family of probability distributions over rules conditioned on each rule's left-hand side."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-52",
"text": "As in (Johnson, 1998) and (Klein and Manning, 2003) , we annotate the Penn treebank nonterminals with various context information."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-53",
"text": "Suppose \u00b5 is a Treebank non-terminal."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-54",
"text": "Let \u03bb = \u00b5[\u03b1] denote the non-terminal category annotated with a vector of context features \u03b1."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-55",
"text": "A PCFG is derived from the trees in the usual manner, with production rules taken directly from the annotated trees, and the probability of an annotated rule q(\u03bb \u2192 \u03c1) = C(\u03bb\u2192\u03c1) C(\u03bb) where C(\u03bb \u2192 \u03c1) and C(\u03bb) are the number of observations of the production and its left hand side, respectively."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-56",
"text": "We refer to the grammar resulting from extracting annotated productions directly out of the treebank as the base grammar."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-57",
"text": "Our goal is to partition the set of annotated nonterminals into clusters \u03a6 = {\u03c6 i }."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-58",
"text": "Each possible clustering corresponds to a PCFG, with the set of non-terminals corresponding to the set of clusters."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-59",
"text": "The probability of a production under this PCFG is"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-60",
"text": "where \u03c6s \u2208 \u03a6 are clusters of annotated nonterminals and where:"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-61",
"text": "We refer to the PCFG of some clustering as the clustered grammar."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-63",
"text": "**FEATURES**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-64",
"text": "Most of the features we use are fairly standard."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-65",
"text": "These include the label of the parent and grandparent of a node, its lexical head, and the part of speech of the head."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-66",
"text": "Klein and Manning (2003) find marking nonterminals which have unary rewrites to be helpful."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-67",
"text": "They also find useful annotating two preterminals (DT ,RB) if they are the product of a unary production."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-68",
"text": "We generalize this via two width features: the first marking a node with the number of nonterminals to which it rewrites; the second marking each preterminal with the width of its parent."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-69",
"text": "Another feature is the span of a nonterminal, or the number of terminals it dominates, which we normalize by dividing by the length of the sentence."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-70",
"text": "Hence preterminals have normalized spans of 1/(length of the sentence), while the root has a normalized span of 1."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-71",
"text": "Extending on the notion of a Base NP, introduced by Collins (1996) , we mark any nonterminal that dominates only preterminals as Base."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-72",
"text": "Collins inserts a unary NP over any base NPs without NP parents."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-73",
"text": "However, Klein and Manning (2003) find that this hurts performance relative to just marking the NPs, and so our Base feature does not insert."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-74",
"text": "We have two features describing a node's position in the expansion of its parent."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-75",
"text": "The first, which we call the inside position, specifies the nonterminal's position relative to the heir of its parent's head, (to the left or right) or whether the nonterminal is the heir."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-76",
"text": "(By \"heir\" we mean the constituent donates its head, e.g. the heir of an S is typically the V P under the S.) The second feature, outside position, specifies the nonterminal's position relative to the boundary of the constituent: it is the leftmost child, the rightmost child, or neither."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-77",
"text": "Related to this, we further noticed that several of Klein & Manning's (2003) features, such as marking N P s as right recursive or possessive have the property of annotating with the label of the rightmost child (when they are NP and POS respectively)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-78",
"text": "We generalize this by marking all nodes both with their rightmost child and (an analogous feature) leftmost child."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-79",
"text": "We also mark whether or not a node borders the end of a sentence, save for ending punctuation."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-80",
"text": "(For instance, in this sentence, all the constituents with the second \"marked\" rightmost in their span would be marked)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-81",
"text": "Another Klein and Manning (2003) feature we try includes the temporal NP feature, where TMP markings in the treebank are retained, and propagated down the head inheritance path of the tree."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-82",
"text": "It is worth mentioning that all the features here come directly from the treebank."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-83",
"text": "For instance, the part of speech of the head feature has values only from the raw treebank tag set."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-84",
"text": "When a preterminal cluster is split, this assignment does not change the value of this feature."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-86",
"text": "**CLUSTERING**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-87",
"text": "The input to the clusterer is a set of annotated grammar productions and counts."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-88",
"text": "Our clustering algorithm is a divisive one reminiscent of (Martin et al., 1995) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-89",
"text": "We start with a single cluster for each Treebank nonterminal and one additional cluster for intermediate nodes, which are described in section 3.2."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-90",
"text": "The clustering method has two interleaved parts: one in which candidate splits are generated, and one in which we choose a candidate split to enact."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-91",
"text": "For each of the initial clusters, we generate a candidate split, and place that split in a priority queue."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-92",
"text": "The priority queue is ordered by the Bayesian Information Criterion (BIC), e.g. (Hastie et al., 2003) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-93",
"text": "The BIC of a model M is defined as -2*(log likelihood of the data according to M ) +d M *(log number of observations)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-94",
"text": "d M is the number of degrees of freedom in the model, which for a PCFG is the number of productions minus the number of nonterminals."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-95",
"text": "Thus in this context BIC can be thought of as optimizing the likelihood, but with a penalty against grammars with many rules."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-96",
"text": "While the queue is nonempty, we remove a candidate split to reevaluate."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-97",
"text": "Reevaluation is necessary because, if there is a delay between when a split is proposed and when a split is enacted, the grammar used to score the split will have changed."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-98",
"text": "However, we suppose that the old score is close enough to be a reasonable ordering measure for the priority queue."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-99",
"text": "If the reevaluated candidate is no longer better than the second candidate on the queue, we reinsert it and continue."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-100",
"text": "However, if it is still the best on the queue, and it improves the model, we enact the split; otherwise it is discarded."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-101",
"text": "When a split is enacted, the old cluster is removed from the set of nonterminals, and is replaced with the two new nonterminals of the split."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-102",
"text": "A candidate split for each of the two new clusters is generated, and placed on the priority queue."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-103",
"text": "This process of reevaluation, enacting splits, and generating new candidates continues until the priority queue is empty of potential splits."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-104",
"text": "We select a candidate split of a particular cluster as follows."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-105",
"text": "a potential nominee split."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-106",
"text": "To do this we first partition randomly the values for the feature into two buckets."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-107",
"text": "We then repeatedly try to move values from one bucket to the other."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-108",
"text": "If doing so results in an improvement to the likelihood of the training data, we keep the change, otherwise we reject it."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-109",
"text": "The swapping continues until moving no individual value results in an improvement in likelihood."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-110",
"text": "Suppose we have a grammar derived from a corpus of a single tree, whose nodes have been annotated with their parent as in Figure 1 ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-111",
"text": "The base productions for this corpus are:"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-112",
"text": "Suppose we are in the initial state, with a single cluster for each treebank nonterminal."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-113",
"text": "Consider a potential split of the N P cluster on the parent feature, which in this example has three values: S, V P , and N P ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-114",
"text": "If the S and V P values are grouped together in the left bucket, and the N P value is alone in the right bucket, we get cluster nonterminals"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-115",
"text": "The resulting grammar rules and their probabilities are:"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-116",
"text": "If however, V P is swapped to the right bucket with N P , the rules become:"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-117",
"text": "The likelihood of the tree in Figure 1 is 1/4 under the first grammar, but only 4/27 under the second."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-118",
"text": "Hence in this case we would reject the swap of V P from the right to the left buckets."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-119",
"text": "The process of swapping continues until no improvement can be made by swapping a single value."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-120",
"text": "The likelihood of the training data according to the clustered grammar is r\u2208R p(r) C(r) for R the set of observed productions r = \u03c6 i \u2192 \u03c6 j . . . in the clustered grammar."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-121",
"text": "Notice that when we are looking to split a cluster \u03c6, only productions that contain the nonterminal \u03c6 will have probabilities that change."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-122",
"text": "To evaluate whether a change increases the likelihood, we consider the ratio between the likelihood of the new model, and the likelihood of the old model."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-123",
"text": "Furthermore, when we move a value from one bucket to another, only a fraction of the rules will have their counts change."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-124",
"text": "Suppose we are moving value x from the left bucket to the right when splitting \u03c6 i . Let \u03c6 x \u2286 \u03c6 i be the set of base nonterminals in \u03c6 i that have value x for the feature being split upon."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-125",
"text": "Only clustered rules that contain base grammar rules which use nonterminals in \u03c6 x will have their probability change."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-126",
"text": "These observations allow us to process only a relatively small number of base grammar rules."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-127",
"text": "Once we have generated a potential nominee split for each feature, we select the partitioning which leads to the greatest improvement in the BIC as the candidate split of this cluster."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-128",
"text": "This candidate is placed on the priority queue."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-129",
"text": "One odd thing about the above is that in the local search phase of the clustering we use likelihood, while in the candidate selection phase we use BIC."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-130",
"text": "We tried both measures in each phase, but found that this hybrid measure outperformed using only one or the other."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-132",
"text": "**MODEL SELECTION**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-133",
"text": "Unfortunately, the grammar that results at the end of the clustering process seems to overfit the training data."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-134",
"text": "We resolve this by simply noting periodically the intermediate state of the grammar, and using this grammar to parse a small tuning set (we use the first 400 sentences of WSJ section 24, and parse this every 50 times we enact a split)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-135",
"text": "At the conclusion of clustering, we select the grammar with the highest f-score on this tuning set as the final model."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-137",
"text": "**BINARIZATION**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-138",
"text": "Since our experiments make use of a CKY (Kasami, 1965 ) parser 1 we must modify the treebank derived rules so that each expands to at most two labels."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-139",
"text": "We perform this in a manner similar to Klein and Manning (2003) and Matsuzaki et al. (2005) Our mechanism lays out the unmarkovized intermediate rules in the same way, but we mostly use our clustering scheme to reduce sparsity."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-140",
"text": "We do so by aligning the labels contained in the intermediate nodes in the order in which they would be added when increasing the markovization hori-1 The implementation we use was created by Mark Johnson and used for the research in (Johnson, 1998) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-141",
"text": "It is available at his homepage."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-142",
"text": "zon from zero to three."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-143",
"text": "We also always keep the heir label as a feature, following Klein and Manning (2003 (D, F, E, D, \u2212) , where the first item is the heir of the parent's head."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-144",
"text": "The \"-\" indicates that the fourth item to be expanded is here non-existent."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-145",
"text": "The clusterer would consider each of these five features as for a single possible split."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-146",
"text": "We also incorporate our other features into the intermediate nodes in two ways."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-147",
"text": "Some features, such as the parent or grandparent, will be the same for all the labels in the intermediate node, and hence only need to be included once."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-148",
"text": "Others, such as the part of speech of the head, may be different for each label."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-149",
"text": "These features we align with those of corresponding label in the Markov ordering."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-150",
"text": "In our running example, suppose each child node N has part of speech of its head P N , and we have a parent feature."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-151",
"text": "Our aligned intermediate feature vectors then become (A, D, C, P C , F, P F , E, P E , D, P D ) and (A, D, F, P F , E, P E , D, P D , \u2212, \u2212)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-152",
"text": "As these are somewhat complicated, let us explain them by unpacking the first, the vector for [C D EF ]."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-153",
"text": "Consulting Figure 2 we see that its parent is A. We have chosen to put parents first in the vector, thus explaining (A, ...)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-154",
"text": "Next comes the heir of the constituent, D. This is followed by the first constituent that is to be unpacked from the binarized version, C, which in turn is followed by its head part-of-speech P C , giving us (A, D, C, P C , ...)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-155",
"text": "We follow with the next non-terminal to be unpacked from the binarized node and its head partof-speech, etc."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-156",
"text": "It might be fairly objected that this formulation of binarization loses the information of whether a label is to the left, right, or is the heir of the parent's head."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-157",
"text": "This is solved by the inside position feature, described in Section 2.1 which contains exactly this information."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-159",
"text": "**SMOOTHING**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-160",
"text": "In order to ease comparison between our work and that of Klein and Manning (2003) , we follow their lead in smoothing no production probabilities save those going from preterminal to nonterminal."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-161",
"text": "Our smoothing mechanism runs roughly along the lines of theirs."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-162",
"text": "Table 2 : Parsing results for grammars generated using clusterer with different random seeds."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-163",
"text": "All numbers here are on the development test set (Section 22)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-164",
"text": "Preterminal rules are smoothed as follows."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-165",
"text": "We consider several classes of unknown words, based on capitalization, the presence of digits or hyphens, and the suffix."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-166",
"text": "We estimate the probability of a tag T given a word (or unknown class)"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-167",
"text": ", where p(T | unk) = C(T, unk)/C(unk) is the probability of the tag given any unknown word class."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-168",
"text": "In order to estimate counts of unknown classes,we let the clusterer see every tree twice: once unmodified, and once with the unknown class replacing each word seen less than five times."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-169",
"text": "The production probability p(W | T ) is then p(T | W )p(W )/p(T ) where p(W ) and p(T ) are the respective empirical distributions."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-170",
"text": "The clusterer does not use smoothed probabilities in allocating annotated preterminals to clusters, but simply the maximum likelihood estimates as it does elsewhere."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-171",
"text": "Smoothing is only used in the parser."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-173",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-174",
"text": "We trained our model on sections 2-21 of the Penn Wall Street Journal Treebank."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-175",
"text": "We used the first 400 sentences of section 24 for model selection."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-176",
"text": "Section 22 was used for testing during development, while section 23 was used for the final evaluation."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-178",
"text": "**DISCUSSION**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-179",
"text": "Our results are shown in Table 1 ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-180",
"text": "The first three columns show the labeled precision, recall and fmeasure, respectively."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-181",
"text": "The remaining two show the number of crossing brackets per sentence, and the percentage of sentences with no crossing brackets."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-182",
"text": "Unfortunately, our model does not perform quite as well as those of Klein and Manning (2003) or Matsuzaki et al. (2005) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-183",
"text": "It is worth noting that Matsuzaki's grammar uses a different parse evaluation scheme than Klein & Manning or we do."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-184",
"text": "We select the parse with the highest probability according to the annotated grammar."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-185",
"text": "Matsuzaki, on the other hand, argues that the proper thing to do is to find the most likely unannotated parse."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-186",
"text": "The probability of this parse is the sum over the probabilities of all annotated parses that reduce to that unannotated parse."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-187",
"text": "Since calculating the parse that maximizes this quantity is NP hard, they try several approximations."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-188",
"text": "One is what Klein & Manning and we do."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-189",
"text": "However, they have a better performing approximation which is used in their reported score."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-190",
"text": "They do not report their score on section 23 using the most-probable-annotatedparse method."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-191",
"text": "They do however compare the performance of different methods using development data, and find that their better approximation gives an absolute improvement in f-measure in the .5-1 percent range."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-192",
"text": "Hence it is probable that even with their better method our grammar would not outperform theirs."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-193",
"text": "Table 2 shows the results on the development test set (Section 22) for four different initial random seeds."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-194",
"text": "Recall that when splitting a cluster, the initial partition of the base grammar nonterminals is made randomly."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-195",
"text": "The model from the second run was used for parsing the final test set (Section 23) in Table 1 ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-196",
"text": "One interesting thing our method allows is for us to examine which features turn out to be useful in which contexts."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-197",
"text": "We noted for each trereebank nonterminal, and for each feature, how many times that nonterminal was split on that feature, for the grammar selected in the model selection stage."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-198",
"text": "We ran the clustering with these four different random seeds."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-199",
"text": "We find that in particular, the clusterer only found the head feature to be useful in very specific circumstances."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-200",
"text": "It was used quite a bit to split preterminals; but for phrasals it was only used to split ADJP ,ADV P ,N P ,P P ,V P ,QP , and SBAR."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-201",
"text": "The part of speech of the head was only used to split N P and V P ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-202",
"text": "Furthermore, the grandparent tag appears to be of importance primarily for V P and P P nonter-minals, though it is used once out of the four runs for N P s."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-203",
"text": "This indicates that perhaps lexical parsers might be able to make do by only using lexical head and grandparent information in very specific instances, thereby shrinking the sizes of their models, and speeding parsing."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-204",
"text": "This warrants further investigation."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-206",
"text": "**CONCLUSION**"
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-207",
"text": "We have presented a scheme for automatically discovering phrasal categories for parsing with a standard CKY parser."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-208",
"text": "The parser achieves 84.8% precision-recall f-measure on the standard testsection of the Penn WSJ-Treebank (section 23)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-209",
"text": "While this is not as accurate as the hand-tailored grammar of Klein and Manning (2003) , it is close, and we believe there is room for improvement."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-210",
"text": "For starters, the particular clustering scheme is only one of many."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-211",
"text": "Our algorithm splits clusters along particular features (e.g., parent, headpart-of-speech, etc.)."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-212",
"text": "One alternative would be to cluster simultaneously on all the features."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-213",
"text": "It is not obvious which scheme should be better, and they could be quite different."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-214",
"text": "Decisions like this abound, and are worth exploring."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-215",
"text": "More radically, it is also possible to grow many decision trees, and thus many alternative grammars."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-216",
"text": "We have been impressed by the success of random-forest methods in language modeling (Xu and Jelinek, 2004) ."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-217",
"text": "In these methods many trees (the forest) are grown, each trying to predict the next word."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-218",
"text": "The multiple trees together are much more powerful than any one individually."
},
{
"sent_id": "d785838888358a711fbf07c9dcf430-C001-219",
"text": "The same might be true for grammars."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-2",
"d785838888358a711fbf07c9dcf430-C001-3",
"d785838888358a711fbf07c9dcf430-C001-4",
"d785838888358a711fbf07c9dcf430-C001-5",
"d785838888358a711fbf07c9dcf430-C001-6",
"d785838888358a711fbf07c9dcf430-C001-7",
"d785838888358a711fbf07c9dcf430-C001-8"
],
[
"d785838888358a711fbf07c9dcf430-C001-12"
],
[
"d785838888358a711fbf07c9dcf430-C001-13",
"d785838888358a711fbf07c9dcf430-C001-14"
],
[
"d785838888358a711fbf07c9dcf430-C001-15",
"d785838888358a711fbf07c9dcf430-C001-16",
"d785838888358a711fbf07c9dcf430-C001-17",
"d785838888358a711fbf07c9dcf430-C001-18",
"d785838888358a711fbf07c9dcf430-C001-19"
],
[
"d785838888358a711fbf07c9dcf430-C001-22",
"d785838888358a711fbf07c9dcf430-C001-23",
"d785838888358a711fbf07c9dcf430-C001-24",
"d785838888358a711fbf07c9dcf430-C001-25",
"d785838888358a711fbf07c9dcf430-C001-26",
"d785838888358a711fbf07c9dcf430-C001-27",
"d785838888358a711fbf07c9dcf430-C001-28"
],
[
"d785838888358a711fbf07c9dcf430-C001-33"
],
[
"d785838888358a711fbf07c9dcf430-C001-77",
"d785838888358a711fbf07c9dcf430-C001-78",
"d785838888358a711fbf07c9dcf430-C001-79",
"d785838888358a711fbf07c9dcf430-C001-80"
],
[
"d785838888358a711fbf07c9dcf430-C001-81",
"d785838888358a711fbf07c9dcf430-C001-82",
"d785838888358a711fbf07c9dcf430-C001-83",
"d785838888358a711fbf07c9dcf430-C001-84"
],
[
"d785838888358a711fbf07c9dcf430-C001-143",
"d785838888358a711fbf07c9dcf430-C001-144",
"d785838888358a711fbf07c9dcf430-C001-145"
],
[
"d785838888358a711fbf07c9dcf430-C001-207",
"d785838888358a711fbf07c9dcf430-C001-208",
"d785838888358a711fbf07c9dcf430-C001-209"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-3",
"d785838888358a711fbf07c9dcf430-C001-12",
"d785838888358a711fbf07c9dcf430-C001-13",
"d785838888358a711fbf07c9dcf430-C001-19",
"d785838888358a711fbf07c9dcf430-C001-22",
"d785838888358a711fbf07c9dcf430-C001-33",
"d785838888358a711fbf07c9dcf430-C001-77",
"d785838888358a711fbf07c9dcf430-C001-81",
"d785838888358a711fbf07c9dcf430-C001-143",
"d785838888358a711fbf07c9dcf430-C001-209"
]
},
"@DIF@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-2",
"d785838888358a711fbf07c9dcf430-C001-3",
"d785838888358a711fbf07c9dcf430-C001-4",
"d785838888358a711fbf07c9dcf430-C001-5",
"d785838888358a711fbf07c9dcf430-C001-6",
"d785838888358a711fbf07c9dcf430-C001-7",
"d785838888358a711fbf07c9dcf430-C001-8"
],
[
"d785838888358a711fbf07c9dcf430-C001-13",
"d785838888358a711fbf07c9dcf430-C001-14"
],
[
"d785838888358a711fbf07c9dcf430-C001-22",
"d785838888358a711fbf07c9dcf430-C001-23",
"d785838888358a711fbf07c9dcf430-C001-24",
"d785838888358a711fbf07c9dcf430-C001-25",
"d785838888358a711fbf07c9dcf430-C001-26",
"d785838888358a711fbf07c9dcf430-C001-27",
"d785838888358a711fbf07c9dcf430-C001-28"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-3",
"d785838888358a711fbf07c9dcf430-C001-13",
"d785838888358a711fbf07c9dcf430-C001-22"
]
},
"@MOT@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-15",
"d785838888358a711fbf07c9dcf430-C001-16",
"d785838888358a711fbf07c9dcf430-C001-17",
"d785838888358a711fbf07c9dcf430-C001-18",
"d785838888358a711fbf07c9dcf430-C001-19"
],
[
"d785838888358a711fbf07c9dcf430-C001-22",
"d785838888358a711fbf07c9dcf430-C001-23",
"d785838888358a711fbf07c9dcf430-C001-24",
"d785838888358a711fbf07c9dcf430-C001-25",
"d785838888358a711fbf07c9dcf430-C001-26",
"d785838888358a711fbf07c9dcf430-C001-27",
"d785838888358a711fbf07c9dcf430-C001-28"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-19",
"d785838888358a711fbf07c9dcf430-C001-22"
]
},
"@SIM@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-33"
],
[
"d785838888358a711fbf07c9dcf430-C001-73"
],
[
"d785838888358a711fbf07c9dcf430-C001-139",
"d785838888358a711fbf07c9dcf430-C001-140"
],
[
"d785838888358a711fbf07c9dcf430-C001-160",
"d785838888358a711fbf07c9dcf430-C001-161"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-33",
"d785838888358a711fbf07c9dcf430-C001-73",
"d785838888358a711fbf07c9dcf430-C001-139",
"d785838888358a711fbf07c9dcf430-C001-160"
]
},
"@USE@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-52"
],
[
"d785838888358a711fbf07c9dcf430-C001-81",
"d785838888358a711fbf07c9dcf430-C001-82",
"d785838888358a711fbf07c9dcf430-C001-83",
"d785838888358a711fbf07c9dcf430-C001-84"
],
[
"d785838888358a711fbf07c9dcf430-C001-143",
"d785838888358a711fbf07c9dcf430-C001-144",
"d785838888358a711fbf07c9dcf430-C001-145"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-52",
"d785838888358a711fbf07c9dcf430-C001-81",
"d785838888358a711fbf07c9dcf430-C001-143"
]
},
"@EXT@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-77",
"d785838888358a711fbf07c9dcf430-C001-78",
"d785838888358a711fbf07c9dcf430-C001-79",
"d785838888358a711fbf07c9dcf430-C001-80"
],
[
"d785838888358a711fbf07c9dcf430-C001-139",
"d785838888358a711fbf07c9dcf430-C001-140"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-77",
"d785838888358a711fbf07c9dcf430-C001-139"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"d785838888358a711fbf07c9dcf430-C001-182"
]
],
"cite_sentences": [
"d785838888358a711fbf07c9dcf430-C001-182"
]
}
}
},
"ABC_0a7710557d020087035f4a94b5661c_5": {
"x": [
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-99",
"text": "Figure 4 : First four steps taken by E-GNPPA."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-41",
"text": "Due to this, they report that the parser suffers from local optimization during training."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-68",
"text": "The parser does a greedy search over all the possible relations and picks the one with the highest score at each stage."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-69",
"text": "This process is repeated until parents for all the nodes that do not belong to R are chosen."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-70",
"text": "Algorithm 1 lists the outline of the greedy nondirectional partial parsing algorithm (GNPPA)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-71",
"text": "builtPPs maintains a list of all the partial parses that have been built."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-67",
"text": "Given the sentence W and the set of unconnected nodes R, the parser follows a non-directional greedy approach to establish relations in a bottom up manner."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-2",
"text": "Recent work has shown how a parallel corpus can be leveraged to build syntactic parser for a target language by projecting automatic source parse onto the target sentence using word alignments."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-3",
"text": "The projected target dependency parses are not always fully connected to be useful for training traditional dependency parsers."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-4",
"text": "In this paper, we present a greedy non-directional parsing algorithm which doesn't need a fully connected parse and can learn from partial parses by utilizing available structural and syntactic information in them."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-5",
"text": "Our parser achieved statistically significant improvements over a baseline system that trains on only fully connected parses for Bulgarian, Spanish and Hindi."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-6",
"text": "It also gave a significant improvement over previously reported results for Bulgarian and set a benchmark for Hindi."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-9",
"text": "Parallel corpora have been used to transfer information from source to target languages for Part-Of-Speech (POS) tagging, word sense disambiguation (Yarowsky et al., 2001) , syntactic parsing (Hwa et al., 2005; Ganchev et al., 2009; Jiang and Liu, 2010) and machine translation (Koehn, 2005; Tiedemann, 2002) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-10",
"text": "Analysis on the source sentences was induced onto the target sentence via projections across word aligned parallel corpora."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-11",
"text": "Equipped with a source language parser and a word alignment tool, parallel data can be used to build an automatic treebank for a target language."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-12",
"text": "The parse trees given by the parser on the source sentences in the parallel data are projected onto the target sentence using the word alignments from the alignment tool."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-13",
"text": "Due to the usage of automatic source parses, automatic word alignments and differences in the annotation schemes of source and target languages, the projected parses are not always fully connected and can have edges missing (Hwa et al., 2005; Ganchev et al., 2009 )."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-14",
"text": "Nonliteral translations and divergences in the syntax of the two languages also lead to incomplete projected parse trees."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-72",
"text": "It is initialized in line 1 by considering each word as a separate partial parse with just one node."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-15",
"text": "Figure 1 shows an English-Hindi parallel sentence with correct source parse, alignments and target dependency parse."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-16",
"text": "For the same sentence, Figure 2 is a sample partial dependency parse projected using an automatic source parser on aligned text."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-17",
"text": "This parse is not fully connected with the words banaa, kottaige and dikhataa left without any parents."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-18",
"text": "To train the traditional dependency parsers (Yamada and Matsumoto, 2003; Eisner, 1996; Nivre, 2003) , the dependency parse has to satisfy four constraints: connectedness, single-headedness, acyclicity and projectivity (Kuhlmann and Nivre, 2006) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-19",
"text": "Projectivity can be relaxed in some parsers (McDonald et al., 2005; Nivre, 2009) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-20",
"text": "But these parsers can not directly be used to learn from partially connected parses (Hwa et al., 2005; Ganchev et al., 2009 )."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-21",
"text": "In the projected Hindi treebank (section 4) that was extracted from English-Hindi parallel text, only 5.9% of the sentences had full trees."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-22",
"text": "In Spanish and Bulgarian projected data extracted by Ganchev et al. (2009) In this paper, we present a dependency parsing algorithm which can train on partial projected parses and can take rich syntactic information as features for learning."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-23",
"text": "The parsing algorithm constructs the partial parses in a bottom-up manner by performing a greedy search over all possible relations and choosing the best one at each step without following either left-to-right or right-to-left traversal."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-24",
"text": "The algorithm is inspired by earlier nondirectional parsing works of Shen and Joshi (2008) and Goldberg and Elhadad (2010) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-25",
"text": "We also propose an extended partial parsing algorithm that can learn from partial parses whose yields are partially contiguous."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-26",
"text": "Apart from bitext projections, this work can be extended to other cases where learning from partial structures is required."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-27",
"text": "For example, while bootstrapping parsers high confidence parses are extracted and trained upon (Steedman et al., 2003; Reichart and Rappoport, 2007) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-28",
"text": "In cases where these parses are few, learning from partial parses might be beneficial."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-29",
"text": "We train our parser on projected Hindi, Bulgarian and Spanish treebanks and show statistically significant improvements in accuracies between training on fully connected trees and learning from partial parses."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-31",
"text": "**RELATED WORK**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-32",
"text": "Learning from partial parses has been dealt in different ways in the literature."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-33",
"text": "Hwa et al. (2005) used post-projection completion/transformation rules to get full parse trees from the projections and train Collin's parser (Collins, 1999) on them."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-34",
"text": "Ganchev et al. (2009) handle partial projected parses by avoiding committing to entire projected tree during training."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-35",
"text": "The posterior regularization based framework constrains the projected syntactic relations to hold approximately and only in expectation."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-36",
"text": "Jiang and Liu (2010) refer to alignment matrix and a dynamic programming search algorithm to obtain better projected dependency trees."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-37",
"text": "They deal with partial projections by breaking down the projected parse into a set of edges and training on the set of projected relations rather than on trees."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-38",
"text": "While Hwa et al. (2005) requires full projected parses to train their parser, Ganchev et al. (2009) and Jiang and Liu (2010) can learn from partially projected trees."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-39",
"text": "However, the discriminative training in (Ganchev et al., 2009 ) doesn't allow for richer syntactic context and it doesn't learn from all the relations in the partial dependency parse."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-40",
"text": "By treating each relation in the projected dependency data independently as a classification instance for parsing, Jiang and Liu (2010) sacrifice the context of the relations such as global structural context, neighboring relations that are crucial for dependency analysis."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-42",
"text": "The parser proposed in this work (section 3) learns from partial trees by using the available structural information in it and also in neighboring partial parses."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-43",
"text": "We evaluated our system (section 5) on Bulgarian and Spanish projected dependency data used in (Ganchev et al., 2009 ) for comparison."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-44",
"text": "The same could not be carried out for Chinese (which was the language (Jiang and Liu, 2010 ) worked on) due to the unavailability of projected data used in their work."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-45",
"text": "Comparison with the traditional dependency parsers (McDonald et al., 2005; Yamada and Matsumoto, 2003; Nivre, 2003; Goldberg and Elhadad, 2010) which train on complete dependency parsers is out of the scope of this work."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-47",
"text": "**PARTIAL PARSING**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-48",
"text": "A standard dependency graph satisfies four graph constraints: connectedness, single-headedness, acyclicity and projectivity (Kuhlmann and Nivre, 2006) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-49",
"text": "In our work, we assume the dependency graph for a sentence only satisfies the single- Given a sentence W =w 0 \u00b7 \u00b7 \u00b7 w n with a set of directed arcs A on the words in W , w i \u2192 w j denotes a dependency arc from w i to w j , (w i ,w j ) A. w i is the parent in the arc and w j is the child in the arc."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-50",
"text": "* \u2212 \u2192 denotes the reflexive and transitive closure of the arc."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-51",
"text": "w i * \u2212 \u2192 w j says that w i dominates w j , i.e. there is (possibly empty) path from w i to w j ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-52",
"text": "A node w i is unconnected if it does not have an incoming arc."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-53",
"text": "R is the set of all such unconnected nodes in the dependency graph."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-54",
"text": "For the example in Figure 2 , R={banaa, kottaige, dikhataa}. A partial parse rooted at node w i denoted by \u03c1(w i ) is the set of arcs that can be traversed from node w i ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-55",
"text": "The yield of a partial parse \u03c1(w i ) is the set of nodes dominated by it."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-56",
"text": "We use \u03c0(w i ) to refer to the yield of \u03c1(w i ) arranged in the linear order of their occurrence in the sentence."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-57",
"text": "The span of the partial tree is the first and last words in its yield."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-58",
"text": "The dependency graph D can now be represented in terms of partial parses by D = (W, R, (R)) where W ={w 0 \u00b7 \u00b7 \u00b7 w n } is the sentence, R={r 1 \u00b7 \u00b7 \u00b7 r m } is the set of unconnected nodes and (R)= {\u03c1(r 1 ) \u00b7 \u00b7 \u00b7 \u03c1(r m )} is the set of partial parses rooted at these unconnected nodes."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-59",
"text": "w 0 is a dummy word added at the beginning of W to behave as a root of a fully connected parse."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-60",
"text": "A fully connected dependency graph would have only one element w 0 in R and the dependency graph rooted at w 0 as the only (fully connected) parse in (R)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-61",
"text": "We assume the combined yield of (R) spans the entire sentence and each of the partial parses in (R) to be contiguous and non-overlapping with one another."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-62",
"text": "A partial parse is contiguous if its yield is contiguous i.e. if a node w j \u03c0(w i ), then all the words between w i and w j also belong to \u03c0(w i )."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-63",
"text": "A partial parse \u03c1(w i ) is non-overlapping if the intersection of its yield \u03c0(w i ) with yields of all other partial parses is empty."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-65",
"text": "**GREEDY NON-DIRECTIONAL PARTIAL PARSING**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-66",
"text": "Algorithm (GNPPA)"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-73",
"text": "candidateArcs stores all the arcs that are possible at each stage of the parsing process in a bottom up strategy."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-74",
"text": "It is initialized in line 2 using the method initCandidateArcs(w 0 \u00b7 \u00b7 \u00b7 w n )."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-75",
"text": "initCandidateArcs(w 0 \u00b7 \u00b7 \u00b7 w n ) adds two candidate arcs for each pair of consecutive words with each other as parent (see Figure 3b) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-76",
"text": "If an arc has one of the nodes in R as the child, it isn't included in candidateArcs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-78",
"text": "**ALGORITHM 1 PARTIAL PARSING ALGORITHM**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-79",
"text": "Input: sentence w0 \u00b7 \u00b7 \u00b7 wn and set of partial tree roots unConn={r1 \u00b7 \u00b7 \u00b7 rm} Output: set of partial parses whose roots are in unConn"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-80",
"text": "builtPPs.remove(bestArc.parent) 7:"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-81",
"text": "builtPPs.add(bestArc) 8:"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-82",
"text": "updateCandidateArcs(bestArc, candidateArcs, builtPPs, unConn) 9: end while 10: return builtPPs Once initialized, the candidate arc with the highest score (line 4) is chosen and accepted into builtPPs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-83",
"text": "This involves replacing the best arc's child partial parse \u03c1(arc.child) and parent partial parse \u03c1(arc.parent) over which the arc has been formed with the arc \u03c1(arc.parent) \u2192 \u03c1(arc.child) itself in builtPPs (lines 5-7)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-84",
"text": "In Figure 3f , to accept the best candidate arc \u03c1(banaa) \u2192 \u03c1(pahaada), the parser would remove the nodes \u03c1(banaa) and \u03c1(pahaada) in builtPPs and add \u03c1(banaa) \u2192 \u03c1(pahaada) to builtPPs (see Figure 3g) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-85",
"text": "After the best arc is accepted, the candidateArcs has to be updated (line 8) to remove the arcs that are no longer valid and add new arcs in the context of the updated builtPPs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-86",
"text": "Algorithm 2 shows the update procedure."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-87",
"text": "First, all the arcs that end on the child are removed (lines 3-7) along with the arc from child to parent."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-88",
"text": "Then, the immediately previous and next partial parses of the best arc in builtPPs are retrieved (lines 8-9) to add possible candidate arcs between them and the partial parse representing the best arc (lines 10-23)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-89",
"text": "In the example, between Figures 3b and 3c, the arcs \u03c1(kottaige) \u2192 \u03c1(bahuta) and \u03c1(bahuta) \u2192 \u03c1(sundara) are first removed and the arc \u03c1(kottaige) \u2192 \u03c1(sundara) is added to candidateArcs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-90",
"text": "Care is taken to avoid adding arcs that end on unconnected nodes listed in R."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-91",
"text": "The entire GNPPA parsing process for the example sentence in Figure 2 is shown in Figure 3 ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-92",
"text": "Algorithm 2 updateCandidateArcs(bestArc, candidateArcs, builtPPs, unConn)"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-94",
"text": "**LEARNING**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-95",
"text": "The algorithm described in the previous section uses a weight vector \u2212 \u2192 w to compute the best arc from the list of candidate arcs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-96",
"text": "This weight vector is learned using a simple Perceptron like algorithm similar to the one used in (Shen and Joshi, 2008) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-97",
"text": "Algorithm 3 lists the learning framework for GNPPA."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-98",
"text": "For a training sample with sentence w 0 \u00b7 \u00b7 \u00b7 w n , projected partial parses projectedPPs={\u03c1(r i ) \u00b7 \u00b7 \u00b7 \u03c1(r m )}, unconnected words unConn and weight vector \u2212 \u2192 w , the builtPPs and candidateArcs are initiated as in algorithm 1. Then the arc with the highest score is selected."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-100",
"text": "The blue colored dotted arcs are the additional candidate arcs that are added to candidateArcs algorithm 1. If it doesn't, it is treated as a negative sample and a corresponding positive candidate arc which is present both projectedPPs and candidateArcs is selected (lines 11-12)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-101",
"text": "The weights of the positive candidate arc are increased while that of the negative sample (best arc) are decreased."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-102",
"text": "To reduce over fitting, we use averaged weights (Collins, 2002) in algorithm 1."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-104",
"text": "**ALGORITHM 3 LEARNING FOR NON-DIRECTIONAL GREEDY PARTIAL PARSING ALGORITHM**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-105",
"text": "Input: sentence w0 \u00b7 \u00b7 \u00b7 wn, projected partial parses projectedPPs, unconnected words unConn, current \u2212 \u2192 w Output: updated \u2212 \u2192 w 1: builtPPs = {\u03c1(r1) \u00b7 \u00b7 \u00b7 \u03c1(rn)} \u2190 {w0 \u00b7 \u00b7 \u00b7 wn} 2: candidateArcs = initCandidateArcs(w0 \u00b7 \u00b7 \u00b7 wn)"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-107",
"text": "**EXTENDED GNPPA (E-GNPPA)**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-108",
"text": "The GNPPA described in section 3.1 assumes that the partial parses are contiguous."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-109",
"text": "The example in Figure 5 has a partial tree \u03c1(dikhataa) which isn't contiguous."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-110",
"text": "Its yield doesn't contain bahuta and sundara."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-111",
"text": "We call such noncontiguous partial parses whose yields encompass the yield of an other partial parse as partially contiguous."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-112",
"text": "Partially contiguous parses are common in the projected data and would not be parsable by the algorithm 1 (\u03c1(dikhataa) \u2192 \u03c1(kottaige) would not be identified)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-113",
"text": "In order to identify and learn from relations which are part of partially contiguous partial parses, we propose an extension to GNPPA."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-114",
"text": "The extended GNPAA (E-GNPPA) broadens its scope while searching for possible candidate arcs given R and builtPPs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-115",
"text": "If the immediate previous or the next partial parses over which arcs are to be formed are designated unconnected nodes, the parser looks further for a partial parse over which it can form arcs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-116",
"text": "For example, in Figure 4b , the arc \u03c1(para) \u2192 \u03c1(banaa) can not be added to the candidateArcs since banaa is a designated unconnected node in unConn."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-117",
"text": "The E-GNPPA looks over the unconnected node and adds the arc \u03c1(para) \u2192 \u03c1(huaa) to the candidate arcs list candidateArcs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-118",
"text": "E-GNPPA differs from algorithm 1 in lines 2 and 8."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-119",
"text": "The E-GNPPA uses an extended initialization method initCandidateArcsExtended(w 0 ) for Parent and Child par.pos, chd.pos, par.lex, chd.lex Sentence Context par-1.pos, par-2.pos, par+1.pos, par+2.pos, par-1.lex, par+1.lex chd-1.pos, chd-2.pos, chd+1.pos, chd+2.pos, chd-1.lex, chd+1.lex Table 1 : Information on which features are defined."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-120",
"text": "par denotes the parent in the relation and chd the child."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-121",
"text": ".pos and .lex is the POS and word-form of the corresponding node."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-122",
"text": "+/-i is the previous/next i th word in the sentence."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-123",
"text": "leftMostChild() and rightMostChild() denote the left most and right most children of a node."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-124",
"text": "leftSibling() and rightSibling() get the immediate left and right siblings of a node."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-125",
"text": "previousPP() and nextPP() return the immediate previous and next partial parses of the arc in builtPPs at the state."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-127",
"text": "**STRUCTURAL INFO LEFTMOSTCHILD(PAR).POS, RIGHTMOSTCHILD(PAR).POS, LEFTSIBLING(CHD).POS, RIGHTSIBLING(CHD).POS PARTIAL PARSE CONTEXT PREVIOUSPP().POS, PREVIOUSPP().LEX, NEXTPP().POS, NEXTPP().LEX**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-128",
"text": "candidateArcs in line 2 and an extended procedure updateCandidateArcsExtended to update the candidateArcs after each step in line 8."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-129",
"text": "Algorithm 4 shows the changes w.r.t algorithm 2."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-130",
"text": "Figure 4 presents the steps taken by the E-GNPPA parser for the example parse in Figure 5 ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-132",
"text": "**ALGORITHM 4 UPDATECANDIDATEARCSEXTENDED ( BESTARC, CANDIDATEARCS, BUILTPPS,UNCONN )**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-133",
"text": "\u00b7 \u00b7 \u00b7 lines 1 to 7 of Algorithm 2 \u00b7 \u00b7 \u00b7 prevPP = builtPPs.previousPP(bestArc) while prevPP \u2208 unConn do prevPP = builtPPs.previousPP(prevPP) end while nextPP = builtPPs.nextPP(bestArc) while nextPP \u2208 unConn do nextPP = builtPPs.nextPP(nextPP) end while \u00b7 \u00b7 \u00b7 lines 10 to 24 of Algorithm 2 \u00b7 \u00b7 \u00b7"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-135",
"text": "**FEATURES**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-136",
"text": "Features for a relation (candidate arc) are defined on the POS tags and lexical items of the nodes in the relation and those in its context."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-137",
"text": "Two kinds of context are used a) context from the input sentence (sentence context) b) context in builtPPs i.e. nearby partial parses (partial parse context)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-138",
"text": "Information from the partial parses (structural info) such as left and right most children of the parent node in the relation, left and right siblings of the child node in the relation are also used."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-139",
"text": "Table 1 lists the information on which features are defined in the various configurations of the three language parsers."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-140",
"text": "The actual features are combinations of the information present in the table."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-141",
"text": "The set varies depending on the language and whether its GNPPA or E-GNPPA approach."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-142",
"text": "While training, no features are defined on whether a node is unconnected (present in unConn) or not as this information isn't available during testing."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-144",
"text": "**HINDI PROJECTED DEPENDENCY TREEBANK**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-145",
"text": "We conducted experiments on English-Hindi parallel data by transferring syntactic information from English to Hindi to build a projected dependency treebank for Hindi."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-146",
"text": "The TIDES English-Hindi parallel data containing 45,000 sentences was used for this purpose 1 (Venkatapathy, 2008) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-147",
"text": "Word alignments for these sentences were obtained using the widely used GIZA++ toolkit in grow-diag-final-and mode (Och and Ney, 2003) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-148",
"text": "Since Hindi is a morphologically rich language, root words were used instead of the word forms."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-149",
"text": "A bidirectional English POS tagger (Shen et al., 2007) was used to POS tag the source sentences and the parses were obtained using the first order MST parser (McDonald et al., 2005) trained on dependencies extracted from Penn treebank using the head rules of Yamada and Matsumoto (2003) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-150",
"text": "A CRF based Hindi POS tagger (PVS. and Gali, 2007) was used to POS tag the target sentences."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-151",
"text": "English and Hindi being morphologically and syntactically divergent makes the word alignment and dependency projection a challenging task."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-152",
"text": "The source dependencies are projected using an approach similar to (Hwa et al., 2005) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-153",
"text": "While they use post-projection transformations on the projected parse to account for annotation differences, we use pre-projection transformations on the source parse."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-154",
"text": "The projection algorithm pro-duces acyclic parses which could be unconnected and non-projective."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-156",
"text": "**ANNOTATION DIFFERENCES IN HINDI AND ENGLISH**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-157",
"text": "Before projecting the source parses onto the target sentence, the parses are transformed to reflect the annotation scheme differences in English and Hindi."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-158",
"text": "While English dependency parses reflect the PTB annotation style (Marcus et al., 1994) , we project them to Hindi to reflect the annotation scheme described in (Begum et al., 2008) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-159",
"text": "The differences in the annotation schemes are with respect to three phenomena: a) head of a verb group containing auxiliary and main verbs, b) prepositions in a prepositional phrase (PP) and c) coordination structures."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-160",
"text": "In the English parses, the auxiliary verb is the head of the main verb while in Hindi, the main verb is the head of the auxiliary in the verb group."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-161",
"text": "For example, in the Hindi parse in Figure 1 , dikhataa is the head of the auxiliary verb hai."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-162",
"text": "The prepositions in English are realized as postpositions in Hindi."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-163",
"text": "While prepositions are the heads in a preposition phrase, post-positions are the modifiers of the preceding nouns in Hindi."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-164",
"text": "In pahaada para (on the hill), hill is the head of para."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-165",
"text": "In coordination structures, while English differentiates between how NP coordination and VP coordination structures behave, Hindi annotation scheme is consistent in its handling."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-166",
"text": "Leftmost verb is the head of a VP coordination structure in English whereas the rightmost noun is the head in case of NP coordination."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-167",
"text": "In Hindi, the conjunct is the head of the two verbs/nouns in the coordination structure."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-168",
"text": "These three cases are identified in the source tree and appropriate transformations are made to the source parse itself before projecting the relations using word alignments."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-170",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-171",
"text": "We carried out all our experiments on parallel corpora belonging to English-Hindi, EnglishBulgarian and English-Spanish language pairs."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-172",
"text": "While the Hindi projected treebank was obtained using the method described in section 4, Bulgarian and Spanish projected datasets were obtained using the approach in (Ganchev et al., 2009) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-173",
"text": "The datasets of Bulgarian and Spanish that contributed to the best accuracies for Ganchev et al. (2009) were used in our work (7 rules dataset for Bulgarian and 3 rules dataset for Spanish)."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-174",
"text": "The Hindi, Bulgarian and Spanish projected dependency treebanks have 44760, 39516 and 76958 sentences respectively."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-175",
"text": "Since we don't have confidence scores for the projections on the sentences, we picked 10,000 sentences randomly in each of the three datasets for training the parsers 2 ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-176",
"text": "Other methods of choosing the 10K sentences such as those with the max."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-177",
"text": "no. of relations, those with least no. of unconnected words, those with max."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-178",
"text": "no. of contiguous partial trees that can be learned by GNPPA parser etc. were tried out."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-179",
"text": "Among all these, random selection was consistent and yielded the best results."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-180",
"text": "The errors introduced in the projected parses by errors in word alignment, source parser and projection are not consistent enough to be exploited to select the better parses from the entire projected data."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-181",
"text": "Table 2 gives an account of the randomly chosen 10k sentences in terms of the number of words, words without parents etc."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-182",
"text": "Around 40% of the words spread over 88% of sentences in Bulgarian and 97% of sentences in Spanish have no parents."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-183",
"text": "Traditional dependency parsers which only train from fully connected trees would not be able to learn from these sentences."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-184",
"text": "P(GNPPA) is the percentage of relations in the data that are learned by the GNPPA parser satisfying the contiguous partial tree constraint and P(E-GNPPA) is the per- Table 3 : UAS for Hindi, Bulgarian and Spanish with the baseline, GNPPA and E-GNPPA parsers trained on 10k parses selected randomly."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-185",
"text": "Punct indicates evaluation with punctuation whereas NoPunct indicates without punctuation."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-186",
"text": "* next to an accuracy denotes statistically significant (McNemar's and p < 0.05) improvement over the baseline."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-187",
"text": "\u2020 denotes significance over GNPPA centage that satisfies the partially contiguous constraint."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-188",
"text": "E-GNPPA parser learns around 2-5% more no. of relations than GNPPA due to the relaxation in the constraints."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-189",
"text": "The Hindi test data that was released as part of the ICON-2010 Shared Task (Husain et al., 2010) was used for evaluation."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-190",
"text": "For Bulgarian and Spanish, we used the same test data that was used in the work of Ganchev et al. (2009) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-191",
"text": "These test datasets had sentences from the training section of the CoNLL Shared Task (Nivre et al., 2007) that had lengths less than or equal to 10."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-192",
"text": "All the test datasets have gold POS tags."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-193",
"text": "A baseline parser was built to compare learning from partial parses with learning from fully connected parses."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-194",
"text": "Full parses are constructed from partial parses in the projected data by randomly assigning parents to unconnected parents, similar to the work in (Hwa et al., 2005) ."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-195",
"text": "The unconnected words in the parse are selected randomly one by one and are assigned parents randomly to complete the parse."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-196",
"text": "This process is repeated for all the sentences in the three language datasets."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-197",
"text": "The parser is then trained with the GNPPA algorithm on these fully connected parses to be used as the baseline."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-198",
"text": "Table 3 lists the accuracies of the baseline, GNPPA and E-GNPPA parsers."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-199",
"text": "The accuracies are unlabeled attachment scores (UAS): the percentage of words with the correct head."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-200",
"text": "Table 4 compares our accuracies with those reported in (Ganchev et al., 2009) for Bulgarian and Spanish."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-202",
"text": "**DISCUSSION**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-203",
"text": "The baseline reported in (Ganchev et al., 2009 ) significantly outperforms our baseline (see Table 4 ) due to the different baselines used in both the works."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-204",
"text": "In our work, while creating the data for the baseline by assigning random parents to unconnected words, acyclicity and projectivity con- Table 4 : Comparison of baseline, GNPPA and E-GNPPA with baseline and discriminative model from (Ganchev et al., 2009) for Bulgarian and Spanish."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-205",
"text": "Evaluation didn't include punctuation."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-206",
"text": "straints are not enforced."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-207",
"text": "Ganchev et al. (2009) 's baseline is similar to the first iteration of their discriminative model and hence performs better than ours."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-208",
"text": "Our Bulgarian E-GNPPA parser achieved a 1.8% gain over theirs while the Spanish results are lower."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-209",
"text": "Though their training data size is also 10K, the training data is different in both our works due to the difference in the method of choosing 10K sentences from the large projected treebanks."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-210",
"text": "The GNPPA accuracies (see table 3 ) for all the three languages are significant improvements over the baseline accuracies."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-211",
"text": "This shows that learning from partial parses is effective when compared to imposing the connected constraint on the partially projected dependency parse."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-212",
"text": "Even while projecting source dependencies during data creation, it is better to project high confidence relations than look to project more relations and thereby introduce noise."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-213",
"text": "The E-GNPPA which also learns from partially contiguous partial parses achieved statistically significant gains for all the three languages."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-214",
"text": "The gains across languages is due to the fact that in the 10K data that was used for training, E-GNPPA parser could learn 2 \u2212 5% more relations over GNPPA (see Table 2 )."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-215",
"text": "Figure 6 shows the accuracies of baseline and E- GNPPA parser for the three languages when training data size is varied."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-216",
"text": "The parsers peak early with less than 1000 sentences and make small gains with the addition of more data."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-218",
"text": "**CONCLUSION**"
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-219",
"text": "We presented a non-directional parsing algorithm that can learn from partial parses using syntactic and contextual information as features."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-220",
"text": "A Hindi projected dependency treebank was developed from English-Hindi bilingual data and experiments were conducted for three languages Hindi, Bulgarian and Spanish."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-221",
"text": "Statistically significant improvements were achieved by our partial parsers over the baseline system."
},
{
"sent_id": "0a7710557d020087035f4a94b5661c-C001-222",
"text": "The partial parsing algorithms presented in this paper are not specific to bitext projections and can be used for learning from partial parses in any setting."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"0a7710557d020087035f4a94b5661c-C001-9"
],
[
"0a7710557d020087035f4a94b5661c-C001-13"
],
[
"0a7710557d020087035f4a94b5661c-C001-19",
"0a7710557d020087035f4a94b5661c-C001-20"
],
[
"0a7710557d020087035f4a94b5661c-C001-22"
],
[
"0a7710557d020087035f4a94b5661c-C001-34",
"0a7710557d020087035f4a94b5661c-C001-35"
],
[
"0a7710557d020087035f4a94b5661c-C001-38",
"0a7710557d020087035f4a94b5661c-C001-39"
],
[
"0a7710557d020087035f4a94b5661c-C001-190",
"0a7710557d020087035f4a94b5661c-C001-191",
"0a7710557d020087035f4a94b5661c-C001-192",
"0a7710557d020087035f4a94b5661c-C001-193"
]
],
"cite_sentences": [
"0a7710557d020087035f4a94b5661c-C001-22",
"0a7710557d020087035f4a94b5661c-C001-38",
"0a7710557d020087035f4a94b5661c-C001-39",
"0a7710557d020087035f4a94b5661c-C001-190"
]
},
"@MOT@": {
"gold_contexts": [
[
"0a7710557d020087035f4a94b5661c-C001-13"
],
[
"0a7710557d020087035f4a94b5661c-C001-19",
"0a7710557d020087035f4a94b5661c-C001-20"
],
[
"0a7710557d020087035f4a94b5661c-C001-34",
"0a7710557d020087035f4a94b5661c-C001-35"
],
[
"0a7710557d020087035f4a94b5661c-C001-38",
"0a7710557d020087035f4a94b5661c-C001-39"
]
],
"cite_sentences": [
"0a7710557d020087035f4a94b5661c-C001-38",
"0a7710557d020087035f4a94b5661c-C001-39"
]
},
"@EXT@": {
"gold_contexts": [
[
"0a7710557d020087035f4a94b5661c-C001-43"
]
],
"cite_sentences": [
"0a7710557d020087035f4a94b5661c-C001-43"
]
},
"@USE@": {
"gold_contexts": [
[
"0a7710557d020087035f4a94b5661c-C001-172"
],
[
"0a7710557d020087035f4a94b5661c-C001-173"
],
[
"0a7710557d020087035f4a94b5661c-C001-190",
"0a7710557d020087035f4a94b5661c-C001-191",
"0a7710557d020087035f4a94b5661c-C001-192",
"0a7710557d020087035f4a94b5661c-C001-193"
]
],
"cite_sentences": [
"0a7710557d020087035f4a94b5661c-C001-172",
"0a7710557d020087035f4a94b5661c-C001-173",
"0a7710557d020087035f4a94b5661c-C001-190"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"0a7710557d020087035f4a94b5661c-C001-200"
],
[
"0a7710557d020087035f4a94b5661c-C001-204"
],
[
"0a7710557d020087035f4a94b5661c-C001-207"
]
],
"cite_sentences": [
"0a7710557d020087035f4a94b5661c-C001-200",
"0a7710557d020087035f4a94b5661c-C001-204"
]
},
"@DIF@": {
"gold_contexts": [
[
"0a7710557d020087035f4a94b5661c-C001-203"
]
],
"cite_sentences": [
"0a7710557d020087035f4a94b5661c-C001-203"
]
}
}
},
"ABC_5c13e64d468b8a1c403072f213c992_5": {
"x": [
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-2",
"text": "1 In this paper, we present a study for extracting and aligning paraphrases in the context of Sentence Compression."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-25",
"text": "Their approach has three main steps."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-3",
"text": "First, we justify the application of a new measure for the automatic extraction of paraphrase corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-4",
"text": "Second, we discuss the work done by (Barzilay & Lee, 2003) who use clustering of paraphrases to induce rewriting rules."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-5",
"text": "We will see, through classical visualization methodologies (Kruskal & Wish, 1977) and exhaustive experiments, that clustering may not be the best approach for automatic pattern identification."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-6",
"text": "Finally, we will provide some results of different biology based methodologies for pairwise paraphrase alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-9",
"text": "Sentence Compression can be seen as the removal of redundant words or phrases from an input sentence by creating a new sentence in which the gist of the original meaning of the sentence remains unchanged."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-10",
"text": "Sentence Compression takes an important place for Natural Language Processing (NLP) tasks where specific constraints must be satisfied, such as length in summarization (Barzilay & Lee, 2002; Knight & Marcu, 2002; Shinyama et al., 2002; Barzilay & Lee, 2003; Le Nguyen & Ho, 2004; Unno et al., 2006) , style in text simplification (Marsi & Krahmer, 2005) or sentence simplification for subtitling (Daelemans et al., 2004) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-11",
"text": "Generally, Sentence Compression involves performing the following three steps: (1) Extraction of paraphrases from comparable corpora, (2) Alignment of paraphrases and (3) Induction of rewriting rules."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-12",
"text": "Obviously, each of these steps can be performed in many different ways going from totally unsupervised to totally supervised."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-13",
"text": "In this paper, we will focus on the first two steps."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-14",
"text": "In particular, we will first justify the application of a new measure for the automatic extraction of paraphrase corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-15",
"text": "Second, we will discuss the work done by (Barzilay & Lee, 2003) who use clustering of paraphrases to induce rewriting rules."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-16",
"text": "We will see, through classical visualization methodologies (Kruskal & Wish, 1977) and exhaustive experiments, that clustering may not be the best approach for automatic pattern identification."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-17",
"text": "Finally, we will provide some results of different biology based methodologies for pairwise paraphrase alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-18",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-19",
"text": "**RELATED WORK**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-20",
"text": "Two different approaches have been proposed for Sentence Compression: purely statistical methodologies (Barzilay & Lee, 2003; Le Nguyen & Ho, 2004) and hybrid linguistic/statistic methodologies (Knight & Marcu, 2002; Shinyama et al., 2002; Daelemans et al., 2004; Marsi & Krahmer, 2005; Unno et al., 2006) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-21",
"text": "As our work is based on the first paradigm, we will focus on the works proposed by (Barzilay & Lee, 2003) and (Le Nguyen & Ho, 2004) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-22",
"text": "(Barzilay & Lee, 2003 ) present a knowledge-lean algorithm that uses multiple-sequence alignment to learn generate sentence-level paraphrases essentially from unannotated corpus data alone."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-23",
"text": "In contrast to (Barzilay & Lee, 2002) , they need neither parallel data nor explicit information about sentence semantics."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-24",
"text": "Rather, they use two comparable corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-26",
"text": "First, working on each of the comparable corpora separately, they compute lattices compact graph-based representations to find commonalities within groups of structurally similar sentences."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-27",
"text": "Next, they identify pairs of lattices from the two different corpora that are paraphrases of each other."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-28",
"text": "Finally, given an input sentence to be paraphrased, they match it to a lattice and use a paraphrase from the matched lattices mate to generate an output sentence."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-29",
"text": "(Le Nguyen & Ho, 2004) propose a new sentencereduction algorithm that do not use syntactic parsing for the input sentence."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-30",
"text": "The algorithm is an extension of the template-translation algorithm (one of example-based machine-translation methods) via innovative employment of the Hidden Markov model, which uses the set of template rules learned from examples."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-31",
"text": "In particular, (Le Nguyen & Ho, 2004) do not propose any methodology to automatically extract paraphrases."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-32",
"text": "Instead, they collect a corpus by performing the decomposition program using news and their summaries."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-33",
"text": "After correcting them manually, they obtain more than 1,500 pairs of long and reduced sentences."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-34",
"text": "Comparatively, (Barzilay & Lee, 2003) propose to use the N-gram Overlap metric to capture similarities between sentences and automatically create paraphrase corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-35",
"text": "However, this choice is arbitrary and mainly leads to the extraction of quasi-exact or exact matching pairs."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-36",
"text": "For that purpose, we introduce a new metric, the Sumo-Metric."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-37",
"text": "Unlike (Le Nguyen & Ho, 2004) , one interesting idea proposed by (Barzilay & Lee, 2003 ) is to cluster similar pairs of paraphrases to apply multiplesequence alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-38",
"text": "However, once again, this choice is not justified and we will see by classical visualization methodologies (Kruskal & Wish, 1977) and exhaustive experiments by applying different clustering algorithms, that clustering may not be the best approach for automatic pattern identification."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-39",
"text": "As a consequence, we will study global and local biology based sequence alignments compared to multi-sequence alignment that may lead to better results for the induction of rewriting rules."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-41",
"text": "**PARAPHRASE CORPUS CONSTRUCTION**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-42",
"text": "Paraphrase corpora are golden resources for learning monolingual text-to-text rewritten patterns."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-43",
"text": "However, such corpora are expensive to construct manually and will always be an imperfect and biased representation of the language paraphrase phenomena."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-44",
"text": "Therefore, reliable automatic methodologies able to extract paraphrases from text and subsequently corpus construction are crucial, enabling better pattern identification."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-45",
"text": "In fact, text-to-text generation is a particularly promising research direction given that there are naturally occurring examples of comparable texts that convey the same information but are written in different styles."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-46",
"text": "Web news stories are an obvious example."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-47",
"text": "Thus, presented with such texts, one can pair sentences that convey the same information, thereby building a training set of rewriting examples i.e. a paraphrase corpus."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-49",
"text": "**PARAPHRASE IDENTIFICATION**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-50",
"text": "A few unsupervised metrics have been applied to automatic paraphrase identification and extraction (Barzilay & Lee, 2003; Dolan & Brockett, 2004) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-51",
"text": "However, these unsupervised methodologies show a major drawback by extracting quasi-exact 2 or even exact match pairs of sentences as they rely on classical string similarity measures such as the Edit Distance in the case of (Dolan & Brockett, 2004) and word N-gram overlap for (Barzilay & Lee, 2003) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-52",
"text": "Such pairs are clearly useless."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-53",
"text": "More recently, (Anonymous, 2007) proposed a new metric, the Sumo-Metric specially designed for asymmetrical entailed pairs identification, and proved better performance over previous established metrics, even in the specific case when tested with the Microsoft Paraphrase Research Corpus (Dolan & Brockett, 2004) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-54",
"text": "For a given sentence pair, having each sentence x and y words, and with \u03bb exclusive links between the sentences, the Sumo-Metric is defined in Equation 1 and 2."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-55",
"text": "S (x, y, \u03bb) if S(x, y, \u03bb) < 1.0"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-56",
"text": "where"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-57",
"text": "with \u03b1, \u03b2 \u2208 [0, 1] and \u03b1 + \u03b2 = 1. (Anonymous, 2007) show that the Sumo-Metric outperforms all state-of-the-art metrics over all tested corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-58",
"text": "In particular, it shows systematically better F-Measure and Accuracy measures over all other metrics showing an improvement of (1) at least 2.86% in terms of F-Measure and 3.96% in terms of Accuracy and (2) at most 6.61% in terms of FMeasure and 6.74% in terms of Accuracy compared to the second best metric which is also systematically the word N-gram overlap similarity measure used by (Barzilay & Lee, 2003) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-60",
"text": "**CLUSTERING**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-61",
"text": "Literature shows that there are two main reasons to apply clustering for paraphrase extraction."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-62",
"text": "On one hand, as (Barzilay & Lee, 2003) evidence, clusters of paraphrases can lead to better learning of text-totext rewriting rules compared to just pairs of paraphrases."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-63",
"text": "On the other hand, clustering algorithms may lead to better performance than stand-alone similarity measures as they may take advantage of the different structures of sentences in the cluster to detect a new similar sentence."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-64",
"text": "However, as (Barzilay & Lee, 2003) do not propose any evaluation of which clustering algorithm should be used, we experiment a set of clustering algorithms and present the comparative results."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-65",
"text": "Contrarily to what expected, we will see that clustering is not a worthy effort."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-66",
"text": "Instead of extracting only sentence pairs from corpora 3 , one may consider the extraction of paraphrase sentence clusters."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-67",
"text": "There are many well-known clustering algorithms, which may be applied to a corpus sentence set S = {s 1 , ..., s n }."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-68",
"text": "Clustering implies the definition of a similarity or (distance) matrix A n\u00d7n , where each each element a ij is the similarity (distance) between sentences s i and s j ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-69",
"text": "3 A pair may be seen as a cluster with only two elements."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-71",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-72",
"text": "We experimented four clustering algorithms on a corpus of web news stories and then three human judges manually cross-classified a random sample of the generated clusters."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-73",
"text": "They were asked to classify a cluster as a \"wrong cluster\" if it contained at least two sentences without any entailment relation between them."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-74",
"text": "Results are shown in the next table 1."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-75",
"text": "The \"BASE\" column is the baseline, where the Sumo-Metric was applied rather than clustering."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-76",
"text": "Columns \"S-HAC\" and \"C-HAC\" express the results for Single-link and Complete-link Hierarchical Agglomerative Clustering (Jain et al., 1999) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-77",
"text": "The \"QT\" column shows the Quality Threshold algorithm (Heyer et al., 1999) and the last column \"EM\" is the Expectation Maximization clustering algorithm (Hogg et al., 2005) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-78",
"text": "One main conclusion, from table 1 is that clustering tends to achieve worst results than simple paraphrase pair extraction."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-79",
"text": "Only the QT achieves better results, but if we take the average of the four clustering algorithms it is equal to 0.568, smaller than the 0.618 baseline."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-80",
"text": "Moreover, these results with the QT algorithm were applied with a very restrictive value for cluster attribution as it is shown in table 2 with an average of almost two sentences per cluster."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-81",
"text": "In fact, table 2 shows that most of the clusters have less than 6 sentences which leads to question the results presented by (Barzilay & Lee, 2003) who only keep the clusters that contain more than 10 sentences."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-82",
"text": "In fact, the first conclusion is that the number of experimented clusters is very low, and more important, all clusters with more than 10 sentences showed to be of very bad quality."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-83",
"text": "The next subsection will reinforce the sight that clustering is a worthless effort for automatic paraphrase corpora construction."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-85",
"text": "**VISUALIZATION**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-86",
"text": "In this subsection, we propose a visual analysis of the different similarity measures tested previously: the Edit Distance (Levenshtein, 1966) , the BLEU metric (Papineni et al., 2001) , the word Ngram overlap and the Sumo-Metric."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-87",
"text": "The goal of this study is mainly to give the reader a visual interpretation about the organization each measure induces on the data."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-88",
"text": "To perform this study, we use a Multidimensional Scaling (MDS) process which is a traditional data analysis technique."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-89",
"text": "MDS (Kruskal & Wish, 1977) allows to display the structure of distance-like data into an Euclidean space."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-90",
"text": "Since the only available information is a similarity in our case, we transform similarity values into distance values as in Equation 3."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-91",
"text": "This transformation enables to obtain a (pseudo) distance measure satisfying properties like minimality, identity and symmetry."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-92",
"text": "On a theoretical point of view, the measure we obtain is a pseudo-distance only, since triangular inequality is not necessary satisfied."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-93",
"text": "In practice, the projection space we build with the MDS from such a pseudo-distance is sufficient to have an idea about whether data are organized into classes."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-94",
"text": "We perform the MDS process on 500 sentences 4 randomly selected from the Microsoft Research Paraphrase Corpus."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-95",
"text": "In particular, the projection over the three first eigenvectors (or proper vectors) provides the best visualization where data are clearly organized into several classes (at least two classes)."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-96",
"text": "The obtained visualizations (Figure 1) show distinctly that no particular data organization can be drawn from the used similarity measures."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-97",
"text": "Indeed, we observe only one central class with some \"satellite\" data randomly placed around the class."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-98",
"text": "The last observation allows us to anticipate on the results we could obtain with a clustering step."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-99",
"text": "First, clustering seems not to be a natural way to manage 4 The limitation to 500 data is due to computation costs since MDS requires the diagonalization of the square similarity or distance matrix."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-100",
"text": "such data."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-101",
"text": "Then, according to the clustering method used, several types of clusters can be expected: very small clusters which contain \"satellite\" data (pretty relevant) or large clusters with part of the main central class (pretty irrelevant)."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-102",
"text": "These results confirm the observed figures in the previous subsection and reinforce the sight that clustering is a worthless effort for automatic paraphrase corpora construction, contrarily to what (Barzilay & Lee, 2003) suggest."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-104",
"text": "**BIOLOGY BASED ALIGNMENTS**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-105",
"text": "Sequence alignments have been extensively explored in bioinformatics since the beginning of the Human Genome Project."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-106",
"text": "In general, one wants to align two sequences of symbols (genes in Biology) to find structural similarities, differences or transformations between them."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-107",
"text": "In NLP, alignment is relevant in sub-domains like Text Generation (Barzilay & Lee, 2002) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-108",
"text": "In our work, we employ alignment methods for aligning words between two sentences, which are paraphrases."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-109",
"text": "The words are the base blocks of our sequences (sentences)."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-110",
"text": "There are two main classes of pairwise alignments: the global and local classes."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-111",
"text": "In the first one, the algorithms try to fully align both sequences, admitting gap insertions at a certain cost, while in the local methods the goal is to find pairwise subalignments."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-112",
"text": "How suitable each algorithm may be applied to a certain problem is discussed in the next two subsections."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-114",
"text": "**GLOBAL ALIGNMENT**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-115",
"text": "The well established and widely used NeedlemanWunsch algorithm for pairwise global sequence alignment, uses dynamic programming to find the best possible alignment between two sequences."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-116",
"text": "It is an optimal algorithm."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-117",
"text": "However, it reveals space and time inefficiency as sequence length increases, since an m * n matrix must be maintained and processed during computations."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-118",
"text": "This is the case with DNA sequence alignments, composed by many thousands of nucleotides."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-119",
"text": "Therefore, a huge optimization effort were engaged and new algorithms appeared like ktuple, not guaranteeing to find optimal alignments but able to tackle the complexity problem."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-120",
"text": "In our alignment tasks, we do not have these com-plexity obstacles, because in our corpora the mean length of a sentence is equal to 20.9 words, which is considerably smaller than in a DNA sequence."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-121",
"text": "Therefore an implementation of the NeedlemanWunsch algorithm has been used to generate optimal global alignments."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-122",
"text": "The figure 2 exemplifies a global word alignment on a paraphrase pair."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-124",
"text": "**LOCAL ALIGNMENT**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-125",
"text": "The Smith-Waterman (SW) algorithm is similar to the Needleman Wunsch (NW) one, since dynamic programming is also followed hence denoting the similar complexity issues, to which our alignment task is immune."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-126",
"text": "The main difference is that SW seeks optimal sub-alignments instead of a global alignment and, as described in the literature, it is well tailored for pairs with considerable differences 5 , in length and type."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-127",
"text": "In Remark that in the second pair, only the maximal local sub-alignment is shown."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-128",
"text": "However, there exists another sub-alignment: (DRQ, D-Q)."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-129",
"text": "This means that local alignment may be tuned to generate not only the maximum sub-alignment but a set of subalignments that satisfy some criterium, like having alignment value greater than some minimum threshold."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-130",
"text": "In fact, this is useful in our word alignment problem and were experimented by adapting the Smith Waterman algorithm."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-132",
"text": "**DYNAMIC ALIGNMENT**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-133",
"text": "According to the previous two subsections, where two alignment strategies were presented, a natural question rises: which alignment algorithm to use for our problem of inter-sentence word alignment?"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-134",
"text": "Initially, we thought to use only the global 5 With sufficient similar sequences there is no difference between NW and SW."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-135",
"text": "6 As in DNA subsequences and is same for word sequences."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-136",
"text": "If a global alignment is applied for such a pair, then weird alignments will be generated, like the one that is shown in the next representation (we use character sequences for space convenience and try to preserve the word first letter, from the previous example):"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-137",
"text": "Here it would be more adequate to apply local alignment and extract all relevant sub-alignments."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-138",
"text": "In this case, two sub-alignments would be generated:"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-139",
"text": "|D H M S| |T P R P I R| |D H _ S| |T P _ P I R| Therefore, for inter-paraphrase word alignments, we propose a dynamic algorithm which chooses the best alignment to perform: global or local."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-140",
"text": "To compute this pre-scan, we regard the notion of linkcrossing between sequences as illustrated in the figure 3 , where the 4 crossings are signalized with the small squares."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-141",
"text": "It is easily verifiable that the maximum number of crossings, among two sequences with n exclusive links in between is equal to \u03b8 = 1 2 * n * (n \u2212 1)."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-142",
"text": "We suggest that if a fraction of these crossings holds, for example 0.4 * \u03b8 or 0.5 * \u03b8, then a local alignment should be used."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-143",
"text": "Remark that the more this fraction tends to 1.0 the more unlikely it is to use global alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-144",
"text": "Crossings may be calculated by taking index pairs x i , y i to represent links between sequences, where x i and y i are respectively the first and second sequence indexes, for instance in figure 3 the \"U\" link has pair 5, 1 ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-145",
"text": "It is easily verifiable that two links x i , y i and x j , y j have a crossing point if:"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-147",
"text": "**ALIGNMENT WITH SIMILARITY MATRIX**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-148",
"text": "In bioinformatics, DNA sequence alignment algorithms are usually guided by a scoring function, related to the field of expertise, that defines what is the mutation probability between nucleotides."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-149",
"text": "These scoring functions are defined by PAM 7 or BLO-SUM 8 matrices and encode evolutionary approximations regarding the rates and probabilities of amino acid mutations."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-150",
"text": "Different matrices might produce different alignments."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-151",
"text": "Subsequently, this motivated the idea of modeling word mutation."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-152",
"text": "It seems intuitive to allow such a word mutation, considering the possible relationships that exit between words: lexical, syntactical or semantic."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-153",
"text": "For example, it seems evident that between spirit and spiritual there exists a stronger relation (higher mutation probability) than between spiritual and hamburger."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-154",
"text": "A natural possibility to choose a word mutation representation function is the Edit-distance (Levenshtein, 1966) (edist(.,.) ) as a negative reward for word alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-155",
"text": "For a given word pair w i , w j , the greater the Edit-distance value, the more unlikely the word w i will be aligned with word w j ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-156",
"text": "However, after some early experiments with this function, it revealed to lead to some problems by enabling alignments between very different words, like total, israel , f ire, made or troops, members , despite many good alignments also achieved."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-157",
"text": "This happens because the Editdistance returns relatively small values, unable to sufficiently penalize different words, like the ones listed before, to inhibit the alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-158",
"text": "In bioinformatics language, it means that even for such pairs the mutation probability is still high."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-159",
"text": "Another problem of the Edit-distance is that it does not distinguish between long and small words, for instance the pairs in, by and governor, governed have both the Edit-distance equals to 2."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-160",
"text": "As a consequence, we propose a new function (Equation 4) for word mutation penalization, able to give better answers for the mentioned problems."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-161",
"text": "The idea is to divide the Editdistance value by the length of the normalized 9 maximum common subsequence maxseq(., .) between both words."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-162",
"text": "For example, the longest common subsequence for the pair w 1 , w 2 = reinterpretation, interpreted is \"interpret\", 7 Point Access Mutation."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-163",
"text": "8 Blocks Substitution Matrices."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-164",
"text": "9 The length of the longest common subsequence divided by the word with maximum length value."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-165",
"text": "with length equal to 9 and maxseq(w 1 , w 2 ) = 9 max{16,11} = 0.5625"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-166",
"text": "where \u03b5 is a small value 10 that acts like a \"safety hook\" against divisions by zero, when maxseq(w i , w j ) = 0."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-167",
"text": "Table 4 : Word mutation functions comparision."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-168",
"text": "Remark that with the costAlign(., .) scoring function the problems with pairs like in, by simply vanish."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-169",
"text": "The smaller the words, the more constrained the mutation will be."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-171",
"text": "**EXPERIMENTS AND RESULTS**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-173",
"text": "**CORPUS OF PARAPHRASES**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-174",
"text": "To test our alignment method, we used two types of corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-175",
"text": "The first is the \"DUC 2002\" corpus (DUC2002) and the second is automatically extracted from related web news stories (WNS) automatically extracted."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-176",
"text": "For both original corpora, paraphrase extraction has been performed by using the Sumo-Metric and two corpora of paraphrases were obtained."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-177",
"text": "Afterwards the alignment algorithm was applied over both corpora."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-179",
"text": "**QUALITY OF DYNAMIC ALIGNMENT**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-180",
"text": "We tested the proposed alignment methods by giving a sample of 201 aligned paraphrase sentence pairs to a human judge and ask to classify each pair as correct, acorrect 11 , error 12 , and merror 13 ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-181",
"text": "We also asked to classify the local alignment choice 14"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-183",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-184",
"text": "A set of important steps toward automatic construction of aligned paraphrase corpora are presented and inherent relevant issues discussed, like clustering and alignment."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-185",
"text": "Experiments, by using 4 algorithms and through visualization techniques, revealed that clustering is a worthless effort for paraphrase corpora construction, contrary to the literature claims (Barzilay & Lee, 2003) ."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-186",
"text": "Therefore simple paraphrase pair extraction is suggested and by using a recent and more reliable metric (Sumo-Metric) (Anonymous, 2007) designed for asymmetrical entailed pairs."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-187",
"text": "We also propose a dynamic choosing of the alignment algorithm and a word scoring function for the alignment algorithms."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-188",
"text": "In the future we intend to clean the automatic constructed corpus by introducing syntactical constraints to filter the wrong alignments."
},
{
"sent_id": "5c13e64d468b8a1c403072f213c992-C001-189",
"text": "Our next step will be to employ Machine Learning techniques for rewriting rule induction, by using this automatically constructed aligned paraphrase corpus."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5c13e64d468b8a1c403072f213c992-C001-4",
"5c13e64d468b8a1c403072f213c992-C001-5"
],
[
"5c13e64d468b8a1c403072f213c992-C001-10"
],
[
"5c13e64d468b8a1c403072f213c992-C001-15",
"5c13e64d468b8a1c403072f213c992-C001-16"
],
[
"5c13e64d468b8a1c403072f213c992-C001-20",
"5c13e64d468b8a1c403072f213c992-C001-21",
"5c13e64d468b8a1c403072f213c992-C001-22",
"5c13e64d468b8a1c403072f213c992-C001-23",
"5c13e64d468b8a1c403072f213c992-C001-24",
"5c13e64d468b8a1c403072f213c992-C001-25",
"5c13e64d468b8a1c403072f213c992-C001-26",
"5c13e64d468b8a1c403072f213c992-C001-27",
"5c13e64d468b8a1c403072f213c992-C001-28"
],
[
"5c13e64d468b8a1c403072f213c992-C001-31",
"5c13e64d468b8a1c403072f213c992-C001-32",
"5c13e64d468b8a1c403072f213c992-C001-34",
"5c13e64d468b8a1c403072f213c992-C001-35"
],
[
"5c13e64d468b8a1c403072f213c992-C001-36",
"5c13e64d468b8a1c403072f213c992-C001-37",
"5c13e64d468b8a1c403072f213c992-C001-38",
"5c13e64d468b8a1c403072f213c992-C001-39"
],
[
"5c13e64d468b8a1c403072f213c992-C001-50",
"5c13e64d468b8a1c403072f213c992-C001-51",
"5c13e64d468b8a1c403072f213c992-C001-52"
],
[
"5c13e64d468b8a1c403072f213c992-C001-53",
"5c13e64d468b8a1c403072f213c992-C001-58"
],
[
"5c13e64d468b8a1c403072f213c992-C001-61",
"5c13e64d468b8a1c403072f213c992-C001-62",
"5c13e64d468b8a1c403072f213c992-C001-63"
]
],
"cite_sentences": [
"5c13e64d468b8a1c403072f213c992-C001-4",
"5c13e64d468b8a1c403072f213c992-C001-10",
"5c13e64d468b8a1c403072f213c992-C001-15",
"5c13e64d468b8a1c403072f213c992-C001-20",
"5c13e64d468b8a1c403072f213c992-C001-21",
"5c13e64d468b8a1c403072f213c992-C001-34",
"5c13e64d468b8a1c403072f213c992-C001-37",
"5c13e64d468b8a1c403072f213c992-C001-50",
"5c13e64d468b8a1c403072f213c992-C001-51",
"5c13e64d468b8a1c403072f213c992-C001-58",
"5c13e64d468b8a1c403072f213c992-C001-62"
]
},
"@MOT@": {
"gold_contexts": [
[
"5c13e64d468b8a1c403072f213c992-C001-4",
"5c13e64d468b8a1c403072f213c992-C001-5"
],
[
"5c13e64d468b8a1c403072f213c992-C001-15",
"5c13e64d468b8a1c403072f213c992-C001-16"
],
[
"5c13e64d468b8a1c403072f213c992-C001-31",
"5c13e64d468b8a1c403072f213c992-C001-32",
"5c13e64d468b8a1c403072f213c992-C001-34",
"5c13e64d468b8a1c403072f213c992-C001-35"
],
[
"5c13e64d468b8a1c403072f213c992-C001-36",
"5c13e64d468b8a1c403072f213c992-C001-37",
"5c13e64d468b8a1c403072f213c992-C001-38",
"5c13e64d468b8a1c403072f213c992-C001-39"
],
[
"5c13e64d468b8a1c403072f213c992-C001-50",
"5c13e64d468b8a1c403072f213c992-C001-51",
"5c13e64d468b8a1c403072f213c992-C001-52"
],
[
"5c13e64d468b8a1c403072f213c992-C001-64",
"5c13e64d468b8a1c403072f213c992-C001-65",
"5c13e64d468b8a1c403072f213c992-C001-66"
]
],
"cite_sentences": [
"5c13e64d468b8a1c403072f213c992-C001-4",
"5c13e64d468b8a1c403072f213c992-C001-15",
"5c13e64d468b8a1c403072f213c992-C001-34",
"5c13e64d468b8a1c403072f213c992-C001-37",
"5c13e64d468b8a1c403072f213c992-C001-50",
"5c13e64d468b8a1c403072f213c992-C001-51",
"5c13e64d468b8a1c403072f213c992-C001-64"
]
},
"@DIF@": {
"gold_contexts": [
[
"5c13e64d468b8a1c403072f213c992-C001-36",
"5c13e64d468b8a1c403072f213c992-C001-37",
"5c13e64d468b8a1c403072f213c992-C001-38",
"5c13e64d468b8a1c403072f213c992-C001-39"
],
[
"5c13e64d468b8a1c403072f213c992-C001-64",
"5c13e64d468b8a1c403072f213c992-C001-65",
"5c13e64d468b8a1c403072f213c992-C001-66"
],
[
"5c13e64d468b8a1c403072f213c992-C001-100",
"5c13e64d468b8a1c403072f213c992-C001-101",
"5c13e64d468b8a1c403072f213c992-C001-102",
"5c13e64d468b8a1c403072f213c992-C001-99"
],
[
"5c13e64d468b8a1c403072f213c992-C001-185",
"5c13e64d468b8a1c403072f213c992-C001-186"
]
],
"cite_sentences": [
"5c13e64d468b8a1c403072f213c992-C001-37",
"5c13e64d468b8a1c403072f213c992-C001-64",
"5c13e64d468b8a1c403072f213c992-C001-102",
"5c13e64d468b8a1c403072f213c992-C001-185"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"5c13e64d468b8a1c403072f213c992-C001-78",
"5c13e64d468b8a1c403072f213c992-C001-81"
]
],
"cite_sentences": [
"5c13e64d468b8a1c403072f213c992-C001-81"
]
}
}
},
"ABC_518d8a8395e38d9971bd51344cf1b8_5": {
"x": [
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-2",
"text": "We present a generalized discriminative model for spelling error correction which targets character-level transformations."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-3",
"text": "While operating at the character level, the model makes use of wordlevel and contextual information."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-4",
"text": "In contrast to previous work, the proposed approach learns to correct a variety of error types without guidance of manuallyselected constraints or language-specific features."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-5",
"text": "We apply the model to correct errors in Egyptian Arabic dialect text, achieving 65% reduction in word error rate over the input baseline, and improving over the earlier state-of-the-art system."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-8",
"text": "Spelling error correction is a longstanding Natural Language Processing (NLP) problem, and it has recently become especially relevant because of the many potential applications to the large amount of informal and unedited text generated online, including web forums, tweets, blogs, and email."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-9",
"text": "Misspellings in such text can lead to increased sparsity and errors, posing a challenge for many NLP applications such as text summarization, sentiment analysis and machine translation."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-10",
"text": "In this work, we present GSEC, a Generalized character-level Spelling Error Correction model, which uses supervised learning to map input characters into output characters in context."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-11",
"text": "The approach has the following characteristics:"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-12",
"text": "Character-level Corrections are learned at the character-level 1 using a supervised sequence labeling approach."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-13",
"text": "Generalized The input space consists of all characters, and a single classifier is used to learn common error patterns over all the training data, without guidance of specific rules."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-14",
"text": "Context-sensitive The model looks beyond the context of the current word, when making a decision at the character-level."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-15",
"text": "Discriminative The model provides the freedom of adding a number of different features, which may or may not be language-specific."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-16",
"text": "Language-Independent In this work, we integrate only language-independent features, and therefore do not consider morphological or linguistic features."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-17",
"text": "However, we apply the model to correct errors in Egyptian Arabic dialect text, following a conventional orthography standard, CODA (Habash et al., 2012) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-18",
"text": "Using the described approach, we demonstrate a word-error-rate (WER) reduction of 65% over a do-nothing input baseline, and we improve over a state-of-the-art system (Eskander et al., 2013) which relies heavily on language-specific and manually-selected constraints."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-19",
"text": "We present a detailed analysis of mistakes and demonstrate that the proposed model indeed learns to correct a wider variety of errors."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-21",
"text": "**RELATED WORK**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-22",
"text": "Most earlier work on automatic error correction addressed spelling errors in English and built models of correct usage on native English data (Kukich, 1992; Golding and Roth, 1999; Carlson and Fette, 2007; Banko and Brill, 2001 )."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-23",
"text": "Arabic spelling correction has also received considerable interest (Ben Othmane Zribi and Ben Ahmed, 2003; Haddad and Yaseen, 2007; Hassan et al., 2008; Shaalan et al., 2010; Alkanhal et al., 2012; Eskander et al., 2013; Zaghouani et al., 2014) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-24",
"text": "Supervised spelling correction approaches trained on paired examples of errors and their corrections have recently been applied for non-native English correction (van Delden et al., 2004; Gamon, 2010; Dahlmeier and Ng, 2012; Rozovskaya and Roth, 2011) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-25",
"text": "Discriminative models have been proposed at the word-level for error correction and for error detection (Habash and Roth, 2011) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-26",
"text": "In addition, there has been growing work on lexical normalization of social media data, a somewhat related problem to that considered in this paper (Han and Baldwin, 2011; Han et al., 2013; Subramaniam et al., 2009; Ling et al., 2013) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-27",
"text": "The work of Eskander et al. (2013) is the most relevant to the present study: it presents a character-edit classification model (CEC) using the same dataset we use in this paper."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-28",
"text": "2 Eskander et al. (2013) analyzed the data to identify the seven most common types of errors."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-29",
"text": "They developed seven classifiers and applied them to the data in succession."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-30",
"text": "This makes the approach tailored to the specific data set in use and limited to a specific set of errors."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-31",
"text": "In this work, a single model is considered for all types of errors."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-32",
"text": "The model considers every character in the input text for a possible spelling error, as opposed to looking only at certain input characters and contexts in which they appear."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-33",
"text": "Moreover, in contrast to Eskander et al. (2013) , it looks beyond the boundary of the current word."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-34",
"text": "3 The GSEC Approach"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-36",
"text": "**MODELING SPELLING CORRECTION AT THE CHARACTER LEVEL**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-37",
"text": "We recast the problem of spelling correction into a sequence labeling problem, where for each input character, we predict an action label describing how to transform it to obtain the correct character."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-38",
"text": "The proposed model therefore transforms a given input sentence e = e 1 , . . . , e n of n characters that possibly include errors, to a corrected sentence c of m characters, where corrected characters are produced by one of the following four actions applied to each input character e i :"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-39",
"text": "\u2022 ok: e i is passed without transformation."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-40",
"text": "\u2022 substitute \u2212 with(c): e i is substituted with a character c where c could be any character encountered in the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-41",
"text": "\u2022 delete: e i is deleted."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-42",
"text": "\u2022 insert(c): A character c is inserted bef ore e i ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-43",
"text": "To address errors occurring at the end 2 Eskander et al. (2013) also considered a slower, more expensive, and more language-specific method using a morphological tagger that outperformed the CEC model; however, we do not compare to it in this paper."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-44",
"text": "We use a multi-class SVM classifier to predict the action labels for each input character e i \u2208 e."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-45",
"text": "A decoding process is then applied to transform the input characters accordingly to produce the corrected sentence."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-46",
"text": "Note that we consider the space character as a character like any other, which gives us the ability to correct word merge errors with space character insertion actions and word split errors with space character deletion actions."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-47",
"text": "Table 1 shows an example of the spelling correction process."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-48",
"text": "In this paper, we only model single-edit actions and ignore cases where a character requires multiple edits (henceforth, complex actions), such as multiple insertions or a combination of insertions and substitutions."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-49",
"text": "This choice was motivated by the need to reduce the number of output labels, as many infrequent labels are generated by complex actions."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-50",
"text": "An error analysis of the training data, described in detail in section 3.2, showed that complex errors are relatively infrequent (4% of data)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-51",
"text": "We plan to address these errors in future work."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-52",
"text": "Finally, in order to generate the training data in the described form, we require a parallel corpus of erroneous and corrected reference text (described below), which we align at the character level."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-53",
"text": "We use the alignment tool Sclite (Fiscus, 1998) , which is part of the SCTK Toolkit."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-55",
"text": "**DESCRIPTION OF DATA**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-56",
"text": "We apply our model to correcting Egyptian Arabic dialect text."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-57",
"text": "Since there is no standard dialect orthography adopted by native speakers of Arabic dialects, it is common to encounter multiple Table 3 : Character-level distribution of correction labels."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-58",
"text": "We model all types of transformations except complex actions, and rare Insert labels with counts below a tuned threshold."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-59",
"text": "The Delete label is a single label that comprises all deletion actions."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-60",
"text": "Labels modeled by Eskander et al. (2013) are marked with E , and EP for cases modeled partially, for example, the Insert{A} would only be applied at certain positions such as the end of the word."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-61",
"text": "spellings of the same word."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-62",
"text": "The CODA orthography was proposed by Habash et al. (2012) in an attempt to standardize dialectal writing, and we use it as a reference of correct text for spelling correction following the previous work by Eskander et al. (2013) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-63",
"text": "We use the same corpus (labeled \"ARZ\") and experimental setup splits used by them."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-64",
"text": "The ARZ corpus was developed by the Linguistic Data Consortium (Maamouri et al., 2012a-e) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-65",
"text": "See Table 2 for corpus statistics."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-66",
"text": "Table 3 presents the distribution of correction action labels that correspond to spelling errors in the training data together with examples of these errors."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-67",
"text": "3 We group the actions into: Substitute, Insert, Delete, and Complex, and also list common transformations within each group."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-68",
"text": "We further distinguish between the phenomena modeled by our system and by Eskander et al. (2013) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-69",
"text": "At least 10% of all generated action labels are not handled by Eskander et al. (2013) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-71",
"text": "**ERROR DISTRIBUTION**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-73",
"text": "**FEATURES**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-74",
"text": "Each input character is represented by a feature vector."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-75",
"text": "We include a set of basic features inspired by Eskander et al. (2013) in their CEC system and additional features for further improvement."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-76",
"text": "Basic features We use a set of nine basic features: the given character, the preceding and following two characters, and the first two and last two characters in the word."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-77",
"text": "These are the same features used by CEC, except that CEC does not include characters beyond the word boundary, while we consider space characters as well as characters from the previous and next words."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-79",
"text": "**NGRAM FEATURES**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-80",
"text": "We extract sequences of characters corresponding to the current character and the following and previous two, three, or four characters."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-81",
"text": "We refer to these sequences as bigrams, trigrams, or 4-grams, respectively."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-82",
"text": "These are an extension of the basic features and allow the model to look beyond the context of the current word."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-84",
"text": "**MAXIMUM LIKELIHOOD ESTIMATE (MLE)**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-85",
"text": "We implemented another approach for error correction based on a word-level maximum likelihood model."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-86",
"text": "The MLE method uses a unigram model which replaces each input word with its most likely correct word based on counts from the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-87",
"text": "The intuition behind MLE is that it can easily correct frequent errors; however, it is quite dependent on the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-89",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-91",
"text": "**MODEL EVALUATION**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-92",
"text": "Setup The training data was extracted to generate the form described in Section 3.1, using the Sclite tool (Fiscus, 1998) to align the input and reference sentences."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-93",
"text": "A speech effect handling step was applied as a preprocessing step to all models."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-94",
"text": "This step removes redundant repetitions of characters in sequence, e.g., ktyyyyyr 'veeeeery'."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-95",
"text": "The same speech effect handling was applied by Eskander et al. (2013) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-96",
"text": "For classification, we used the SVM implementation in YamCha (Kudo and Matsumoto, 2001) , and trained with different variations of the features described above."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-97",
"text": "Default parameters were selected for training (c=1, quadratic kernel, and context window of +/-2)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-98",
"text": "In all results listed below, the baseline corresponds to the do-nothing baseline of the input text."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-99",
"text": "Metrics Three evaluation metrics are used."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-100",
"text": "The word-error-rate WER metric is computed by summing the total number of word-level substitution errors, insertion errors, and deletion errors in the output, and dividing by the number of words in the reference."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-101",
"text": "The correct-rate Corr metric is computed by dividing the number of correct output words by the total number of words in the reference."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-102",
"text": "These two metrics are produced by Sclite (Fiscus, 1998) , using automatic alignment."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-103",
"text": "Finally, the accuracy Acc metric, used by Eskander et al. (2013) , is a simple string matching metric which enforces a word alignment that pairs words in the reference to those of the output."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-104",
"text": "It is calculated by dividing the number of correct output words by the number of words in the input."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-105",
"text": "This metric assumes no split errors in the data (a word incorrectly split into two words), which is the case in the data we are working with."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-107",
"text": "**CHARACTER-LEVEL MODEL EVALUATION**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-108",
"text": "The performance of the generalized spelling correction model (GSEC) on the dev data is presented in the first half of Table 4 ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-109",
"text": "The results of the Eskander et al. (2013) CEC system are also presented for the purpose of comparison."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-110",
"text": "We can see that using a single classifier, the generalized model is able to outperform CEC, which relies on a cascade of classifiers (p = 0.03 for the basic model and p < 0.0001 for the best model, GSEC+4grams)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-111",
"text": "4 Model Combination Evaluation Here we present results on combining GSEC with the MLE component (GSEC+MLE)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-112",
"text": "We combine the two models in cascade: the MLE component is applied to the output of GSEC."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-113",
"text": "To train the MLE model, we use the word pairs obtained from the original training data, rather than from the output of GSEC."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-114",
"text": "We found that this configuration allows 4 Significance results are obtained using McNemar's test."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-115",
"text": "us to include a larger sample of word pair errors for learning, because our model corrects many errors, leaving fewer example pairs to train an MLE post-processor."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-116",
"text": "The results are shown in the second half of Table 4 ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-117",
"text": "We first observe that MLE improves the performance of both CEC and GSEC."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-118",
"text": "In fact, CEC+MLE and GSEC+MLE perform similarly (p = 0.36, not statistically significant)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-119",
"text": "When adding features that go beyond the word boundary, we achieve an improvement over MLE, GSEC+MLE, and CEC+MLE, all of which are mostly restricted within the boundary of the word."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-120",
"text": "The best GSEC model outperforms CEC+MLE (p < 0.0001), achieving a WER of 8.3%, corresponding to 65% reduction compared to the baseline."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-121",
"text": "It is worth noting that adding the MLE component allows Eskander's CEC to recover various types of errors that were not modeled previously."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-122",
"text": "However, the contribution of MLE is limited to words that are in the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-123",
"text": "On the other hand, because GSEC is trained on character transformations, it is likely to generalize better to words unseen in the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-124",
"text": "Table 5 presents the results of our best model (GSEC+4grams), and best model+MLE."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-125",
"text": "The latter achieves a 92.1% Acc score."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-126",
"text": "The Acc score reported by Eskander et al. (2013) for CEC+MLE is 91.3% ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-127",
"text": "The two results are statistically significant (p < 0.0001) with respect to CEC and CEC+MLE respectively."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-129",
"text": "**RESULTS ON TEST DATA**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-131",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-132",
"text": "To gain a better understanding of the performance of the models on different types of errors and their interaction with the MLE component, we separate the words in the dev data into: (1) words seen in the training data, or in-vocabulary words (IV), and (2) out-of-vocabulary (OOV) words not seen in the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-133",
"text": "Because the MLE model maps every input word to its most likely gold word seen in the training data, we expect the MLE component to recover a large portion of errors in the IV category (but not all, since an input word can have multiple correct readings depending on the context)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-134",
"text": "On the other hand, the recovery of errors in OOV words indicates how well the character-level model is doing independently of the MLE component."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-135",
"text": "Table 6 presents the performance, using the Acc metric, on each of these types of words."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-136",
"text": "Here our best model (GSEC+4grams) is considered."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-137",
"text": "When considering words seen in the training data, CEC and GSEC have the same performance."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-138",
"text": "However, when considering OOV words, GSEC performs significantly better (p < 0.0001), verifying our hypothesis that a generalized model reduces dependency on training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-139",
"text": "The data is heavily skewed towards IV words (83%), which explains the generally high performance of MLE."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-140",
"text": "We performed a manual error analysis on a sample of 50 word errors from the IV set and found that all of the errors came from gold annotation errors and inconsistencies, either in the dev or train."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-141",
"text": "We then divided the character transformations in the OOV words into four groups: (1) characters that were unchanged by the gold (X-X transformations), (2) character transformations modeled by CEC (X-Y CEC), (3) character transformations not modeled by CEC, and which include all phenomena that were only partially modeled by CEC (X-Y not CEC), and (4) complex errors."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-142",
"text": "The characterlevel accuracy on each of these groups is shown in Table 7 ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-143",
"text": "Both CEC and GSEC do much better on the second group of character transformations (that is, X-Y CEC) than on the third group (X-Y not CEC)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-144",
"text": "This is not surprising because the former transformations correspond to phenomena that are most common in the training data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-145",
"text": "For GSEC, they are learned automatically, while for CEC they are selected and modeled explicitly."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-146",
"text": "Despite this fact, GSEC generalizes better to OOV words."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-147",
"text": "As for the third group, both CEC and GSEC perform more poorly, but GSEC corrects more errors (43.48% vs. 31.68% accuracy)."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-148",
"text": "Finally, CEC is better at recognizing complex errors, which, although are not modeled explicitly by CEC, can sometimes be corrected as a result of applying multiple classifiers in cascade."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-149",
"text": "Dealing with complex errors, though there are few of them in this dataset, is an important direction for future work, and for generalizing to other datasets, e.g., (Zaghouani et al., 2014) ."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-151",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-152",
"text": "We showed that a generalized character-level spelling error correction model can improve spelling error correction on Egyptian Arabic data."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-153",
"text": "This model learns common spelling error patterns automatically, without guidance of manually selected or language-specific constraints."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-154",
"text": "We also demonstrate that the model outperforms existing methods, especially on out-of-vocabulary words."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-155",
"text": "In the future, we plan to extend the model to use word-level language models to select between top character predictions in the output."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-156",
"text": "We also plan to apply the model to different datasets and different languages."
},
{
"sent_id": "518d8a8395e38d9971bd51344cf1b8-C001-157",
"text": "Finally, we plan to experiment with more features that can also be tailored to specific languages by using morphological and linguistic information, which was not explored in this paper."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"518d8a8395e38d9971bd51344cf1b8-C001-18",
"518d8a8395e38d9971bd51344cf1b8-C001-19"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-27"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-43"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-75",
"518d8a8395e38d9971bd51344cf1b8-C001-76",
"518d8a8395e38d9971bd51344cf1b8-C001-77"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-103",
"518d8a8395e38d9971bd51344cf1b8-C001-104",
"518d8a8395e38d9971bd51344cf1b8-C001-105"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-126"
]
],
"cite_sentences": [
"518d8a8395e38d9971bd51344cf1b8-C001-18",
"518d8a8395e38d9971bd51344cf1b8-C001-27",
"518d8a8395e38d9971bd51344cf1b8-C001-43",
"518d8a8395e38d9971bd51344cf1b8-C001-75",
"518d8a8395e38d9971bd51344cf1b8-C001-103",
"518d8a8395e38d9971bd51344cf1b8-C001-126"
]
},
"@MOT@": {
"gold_contexts": [
[
"518d8a8395e38d9971bd51344cf1b8-C001-18",
"518d8a8395e38d9971bd51344cf1b8-C001-19"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-75",
"518d8a8395e38d9971bd51344cf1b8-C001-76",
"518d8a8395e38d9971bd51344cf1b8-C001-77"
]
],
"cite_sentences": [
"518d8a8395e38d9971bd51344cf1b8-C001-18",
"518d8a8395e38d9971bd51344cf1b8-C001-75"
]
},
"@DIF@": {
"gold_contexts": [
[
"518d8a8395e38d9971bd51344cf1b8-C001-18",
"518d8a8395e38d9971bd51344cf1b8-C001-19"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-31",
"518d8a8395e38d9971bd51344cf1b8-C001-32",
"518d8a8395e38d9971bd51344cf1b8-C001-33"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-43"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-68",
"518d8a8395e38d9971bd51344cf1b8-C001-69"
]
],
"cite_sentences": [
"518d8a8395e38d9971bd51344cf1b8-C001-18",
"518d8a8395e38d9971bd51344cf1b8-C001-33",
"518d8a8395e38d9971bd51344cf1b8-C001-43",
"518d8a8395e38d9971bd51344cf1b8-C001-68",
"518d8a8395e38d9971bd51344cf1b8-C001-69"
]
},
"@SIM@": {
"gold_contexts": [
[
"518d8a8395e38d9971bd51344cf1b8-C001-27"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-28",
"518d8a8395e38d9971bd51344cf1b8-C001-29",
"518d8a8395e38d9971bd51344cf1b8-C001-30"
]
],
"cite_sentences": [
"518d8a8395e38d9971bd51344cf1b8-C001-27",
"518d8a8395e38d9971bd51344cf1b8-C001-28"
]
},
"@USE@": {
"gold_contexts": [
[
"518d8a8395e38d9971bd51344cf1b8-C001-28",
"518d8a8395e38d9971bd51344cf1b8-C001-29",
"518d8a8395e38d9971bd51344cf1b8-C001-30"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-60"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-62",
"518d8a8395e38d9971bd51344cf1b8-C001-63"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-75",
"518d8a8395e38d9971bd51344cf1b8-C001-76",
"518d8a8395e38d9971bd51344cf1b8-C001-77"
],
[
"518d8a8395e38d9971bd51344cf1b8-C001-93",
"518d8a8395e38d9971bd51344cf1b8-C001-94",
"518d8a8395e38d9971bd51344cf1b8-C001-95"
]
],
"cite_sentences": [
"518d8a8395e38d9971bd51344cf1b8-C001-28",
"518d8a8395e38d9971bd51344cf1b8-C001-60",
"518d8a8395e38d9971bd51344cf1b8-C001-62",
"518d8a8395e38d9971bd51344cf1b8-C001-75",
"518d8a8395e38d9971bd51344cf1b8-C001-95"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"518d8a8395e38d9971bd51344cf1b8-C001-109",
"518d8a8395e38d9971bd51344cf1b8-C001-110"
]
],
"cite_sentences": [
"518d8a8395e38d9971bd51344cf1b8-C001-109"
]
}
}
},
"ABC_f2ff155003d139b3677f746baf3807_5": {
"x": [
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-2",
"text": "Automated ICD coding, which assigns the International Classification of Disease codes to patient visits, has attracted much research attention since it can save time and labor for billing."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-3",
"text": "The previous state-of-the-art model utilized one convolutional layer to build document representations for predicting ICD codes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-4",
"text": "However, the lengths and grammar of text fragments, which are closely related to ICD coding, vary a lot in different documents."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-5",
"text": "Therefore, a flat and fixed-length convolutional architecture may not be capable of learning good document representations."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-6",
"text": "In this paper, we proposed a Multi-Filter Residual Convolutional Neural Network (Mul-tiResCNN) for ICD coding."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-7",
"text": "The innovations of our model are two-folds: it utilizes a multi-filter convolutional layer to capture various text patterns with different lengths and a residual convolutional layer to enlarge the receptive field."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-8",
"text": "We evaluated the effectiveness of our model on the widely-used MIMIC dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-9",
"text": "On the full code set of MIMIC-III, our model outperformed the state-of-the-art model in 4 out of 6 evaluation metrics."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-10",
"text": "On the top-50 code set of MIMIC-III and the full code set of MIMIC-II, our model outperformed all the existing and state-of-the-art models in all evaluation metrics."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-11",
"text": "The code is available at https://github.com/foxlf823/Multi-Filter-Residual-Convolutional-Neural-Network."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-12",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-13",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-14",
"text": "The International Classification of Diseases (ICD), which is organized by the World Health Organization, is a common coding method used in various healthcare systems such as hospitals."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-15",
"text": "It includes many pre-defined ICD codes which can be assigned to patients' files such as electronic health records (EHRs)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-16",
"text": "These codes represent diagnostic and procedural information during patient visits."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-17",
"text": "Healthcare providers and insurance companies need these information to diagnose patients and bill for services (Bottle and Aylin 2008) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-18",
"text": "However, manual ICD coding has been demonstrated to be labor-consuming and costly (O'malley et al. 2005) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-19",
"text": "The research community has investigated a number of approaches for automated ICD coding, including the models based on both traditional machine learning (Perotte et al. 2013; Kavuluru, Rios, and Lu 2015) and deep learning (Shi Copyright c 2020 , Association for the Advancement of Artificial Intelligence (www.aaai.org)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-20",
"text": "All rights reserved."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-21",
"text": "Xie and Xing 2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-22",
"text": "In terms of data, prior work utilized different domains of data such as radiology reports (Pestian et al. 2007 ) and death certificates (Koopman et al. 2015) , and different modal data such as structured (Perotte et al. 2013) and unstructured text (Scheurwegs et al. 2017) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-23",
"text": "Moreover, some previous work adopted full ICD codes to perform this task (Baumel et al. 2018 ) while other work adopted partial codes (Xu et al. 2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-24",
"text": "Due to such situation, it is difficult to directly compare different work."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-25",
"text": "In this paper, we followed the line of predicting ICD codes from unstructured text of the MIMIC dataset (Johnson et al. 2016 ), because it is widely studied and publicly available."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-26",
"text": "The state-of-the-art model for this line of work is the combination of the convolutional neural network (CNN) and the attention mechanism (Mullenbach et al. 2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-27",
"text": "However, this model only contains one convolutional layer to build document representations for subsequent layers to predict ICD codes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-28",
"text": "As shown in Table 1 , ICD-related text spans and patterns vary in different examples."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-29",
"text": "Therefore, it may not be sufficient to learn decent document representations from a flat and fixed-length convolutional architecture."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-30",
"text": "In this paper, we proposed a Multi-Filter Residual Convolutional Neural Network (MultiResCNN) for ICD coding using clinical discharge summaries."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-31",
"text": "Our Mul-arXiv:1912.00862v1 [cs.CL] 25 Nov 2019 tiResCNN model is composed of five layers: the input layer leverages word embeddings pre-trained by word2vec (Mikolov et al. 2013) ; the multi-filter convolutional layer consists of multiple convolutional filters (Kim 2014); the residual convolutional layer contains multiple residual blocks (He et al. 2016) ; the attention layer keeps the interpretability for the model following (Mullenbach et al. 2018) ; the output layer utilizes the sigmoid function to predict the probability of each ICD code."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-32",
"text": "Our main contribution is that we proposed a novel CNN architecture that combines the multi-filter CNN (Kim 2014) and residual CNN (He et al. 2016) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-33",
"text": "The advantages are two-folds: MultiResCNN not only captures various text patterns with different lengths via the multi-filter CNN, but also enlarges the receptive field 1 (Garcia and Delakis 2004) via the residual CNN."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-34",
"text": "Thus, our model can benefit from rich patterns, the large receptive field and deep architecture."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-35",
"text": "Such method has achieved great success in natural language processing (Vaswani et al. 2017 ) and computer vision (Krizhevsky, Sutskever, and Hinton 2012) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-36",
"text": "To evaluate our model, we employed the MIMIC dataset (Johnson et al. 2016 ) which has been widely used for automated ICD coding."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-37",
"text": "Compared with 5 existing and stateof-the-art models (Perotte et al. 2013; Prakash et al. 2017; Shi et al. 2017; Baumel et al. 2018; Mullenbach et al. 2018) , our model outperformed them in nearly all the evaluation metrics (i.e., macro-and micro-AUC, macro-and micro-F1, precision at K)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-38",
"text": "Concretely, in the MIMIC-III experiment using full codes, our model outperformed these models in macro-AUC, micro-F1 and precision at 8 and 15."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-39",
"text": "In the MIMIC-III experiment using top-50 codes and the MIMIC-II experiment using full codes, our model outperformed these models in all evaluation metrics."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-40",
"text": "Moreover, hyper-parameter tuning experiments show that the multifilter and residual convolutional layers help our model to improve its performance significantly."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-42",
"text": "**RELATED WORK**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-43",
"text": "To the best of our knowledge, the earliest work of automated ICD coding was proposed by Larkey and Croft (1996) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-44",
"text": "They combined three classifiers, K-nearest-neighbor, relevance feedback and Bayesian independence, to assign ICD9 codes to inpatient discharge summaries."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-45",
"text": "However, their method only assigns one code to each discharge summary."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-46",
"text": "Pestian et al. (2007) organized a shared task of assigning ICD-9 codes to radiology reports and their task requires models to assign a large set of codes to each report."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-47",
"text": "Early work usually used supervised machine learning approaches for ICD coding."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-48",
"text": "Perotte et al. (2013) leveraged \"flat\" and \"hierarchical\" Support Vector Machines (SVMs) for automatically assigning ICD9 codes to the discharge summaries of the MIMIC-II repository (Johnson et al. 2016) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-49",
"text": "Their results show that the hierarchical SVM performs better than the flat one."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-50",
"text": "Kavuluru et al. (2015) used the unstructured text in 71,463 EMRs, which come from the University of Kentucky Medical Center, to evaluate supervised learning approaches such as multi-label clas-sification and learning to rank for the ICD9 code assignment."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-51",
"text": "Koopman et al. (2015) employed the SVM to identify cancer-related causes of death from 447,336 death certificates."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-52",
"text": "Their model is cascaded: the first one identified the presence of cancer and the second identified the type of cancer according to the ICD-10 classification system."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-53",
"text": "Scheurwegs et al. (2017) evaluated coverage-based feature selection methods and Random Forests on seven medical specialties for ICD9 code prediction and two for ICD10, incorporating structured and unstructured text."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-54",
"text": "With the development of deep learning, researchers also explored neural networks for this task."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-55",
"text": "Shi et al. (2017) utilized the long short-term memory (LSTM) and attention mechanism for automated ICD coding from diagnosis descriptions."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-56",
"text": "Xie and Xing (2018) also adopted the LSTM but they introduced the tree structure and adversarial learning to utilize code descriptions."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-57",
"text": "Prakash et al. (2017) exploited condensed memory neural networks and evaluated it on the free-text medical notes of the MIMIC-III dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-58",
"text": "Baumel et al. (2018) proposed a hierarchical gated recurrent unit (GRU) network, which encodes sentences and documents with two stacked layers, to assign multiple ICD codes to discharge summaries of the MIMIC II and III datasets."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-59",
"text": "Mullenbach et al. (2018) incorporated the convolutional neural network (CNN) with per-label attention mechanism."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-60",
"text": "Their model achieved the state-of-the-art performance among the work using only unstructured text of the MIMIC dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-61",
"text": "Xu et al. (2018) built a hybrid system that includes the CNN, LSTM and decision tree to predict ICD codes from unstructured, semi-structured and structured tabular data."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-62",
"text": "In addition, Lipton et al. (2015) utilized LSTMs to predict diagnostic codes from time series of clinical measurements, while our work focuses on text data."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-64",
"text": "**METHOD**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-65",
"text": "In this section, we will introduce our Multi-filter Residual Convolutional Neural Network (MultiResCNN), whose architecture is shown in Figure 1 ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-66",
"text": "Throughout this paper, we employed the following notation rules: matrices are written as italic uppercase letters (e.g., X); vectors and scalars are written as italic lowercase letters (e.g., x)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-68",
"text": "**INPUT LAYER**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-69",
"text": "Our model leverages a word sequence w = {w 1 , w 2 , ..., w n } as input, where n denotes the sequence length."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-70",
"text": "Assuming that\u1ebc denotes the word embedding matrix, which is pretrained via word2vec (Mikolov et al. 2013 ) from the raw text of the dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-71",
"text": "A word w n will correspond to a vector e n by looking up\u1ebc. Therefore, the input will be a matrix E = {e 1 , e 2 , ..., e n } \u2208 R n\u00d7d e ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-73",
"text": "**MULTI-FILTER CONVOLUTIONAL LAYER**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-74",
"text": "To capture the patterns with different lengths, we leveraged the multi-filter convolutional neural network (Kim 2014), where each filter has a different kernel size (i.e., word window size)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-75",
"text": "Assuming we have m filters f 1 , f 2 , ..., f m and their kernel sizes denote as k 1 , k 2 , ..., k m ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-76",
"text": "Therefore, m 1- Figure 1 : The architecture of our MultiResCNN model. \"Conv1d\" represents the 1-dimensional convolution, \"Res-Block\" represents the residual block, \"\u2295\" represents the concatenation operation and \"\u2297\" represents the matrix multiplication."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-77",
"text": "Here we use orange and green for U and W to denote they are learnable parameters, and to distinguish with other matrices (e.g., H) which are not parameters."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-78",
"text": "dimensional convolutions can be applied to the input matrix E. The convolutional procedure can be formalized as:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-79",
"text": "where n j=1 indicates the convolutional operations from left to right."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-80",
"text": "Here we forced the row number n of the output H 1 or H m \u2208 R n\u00d7d f to be the same as that of the input E, because we aimed to keep the sequence length unchanged after convolution."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-81",
"text": "It is simple to implement such goal, e.g., setting the kernel size, padding and stride as k, f loor(k/2) and 1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-82",
"text": "d f indicates the out-channel size of a filter and every filter has the same output size."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-83",
"text": "Moreover, E j:j+k1\u22121 \u2208 R k1\u00d7d e and E j:j+km\u22121 \u2208 R km\u00d7d e indicate the sub-matrices of E, starting from the j-th row and ending at the j + k 1 \u2212 1 or j + k m \u2212 1 row."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-84",
"text": "and W m \u2208 R (km\u00d7d e )\u00d7d f indicate the weight matrices of corresponding filters."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-85",
"text": "Throughout this paper, the biases of all layers are ignored for conciseness."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-86",
"text": "The overview of a 1-dimensional convolution filter f m is shown in Figure 2 ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-87",
"text": "Figure 2: The architecture of a 1-dimensional convolution filter f m ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-88",
"text": "\"\u2295\" represents the concatenation operation and \"\u2297\" represents the matrix multiplication."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-90",
"text": "**RESIDUAL CONVOLUTIONAL LAYER**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-91",
"text": "On top of each filter in the multi-filter convolutional layer, there is a residual convolutional layer which consists of p residual blocks (He et al. 2016) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-92",
"text": "Take the m-th filter as an example, the computational procedure of its corresponding residual blocks r m1 , r m2 , ..., r mp can be formalized as:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-93",
"text": "For the residual block r mi (Figure 3 ), it consists of three convolutional filters, namely r mi1 , r mi2 and r mi3 ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-94",
"text": "The computational procedure can be denoted as:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-95",
"text": "where n j=1"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-96",
"text": "indicates the convolutional operations."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-97",
"text": "X denotes the input matrix of this residual block and X j:j+km\u22121 \u2208 R km\u00d7d i\u22121 indicate the sub-matrices of X, starting from the j-th row and ending at the j + k m \u2212 1 row."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-98",
"text": "H mi \u2208 R n\u00d7d i denotes the output matrix of the residual block."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-99",
"text": "d i\u22121 and d i denote the in-channel and out-channel sizes of this residual block."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-100",
"text": "Therefore, the in-channel size of the first residual block r m1 should be d f and the out-channel size of the last residual block r mp is defined as d p ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-101",
"text": "Similar with the multifilter convolutional layer, we let the row numbers of H mi as well as X 1 , X 2 and X 3 \u2208 R n\u00d7d i be n, which is identical to that of the input X."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-102",
"text": "Moreover"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-103",
"text": "denote the weight matrices of the three convolutional filters, r mi1 , r mi2 and r mi3 ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-104",
"text": "Thereinto, r mi1 and r mi2 have the same kernel size k m with the corresponding filter f m in the multi-filter convolutional layer, but they have different in-channel sizes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-105",
"text": "r mi3 is a special convolutional filter whose kernel size is 1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-106",
"text": "Because the m-th filter f m in the multi-filter convolutional layer corresponds to p residual blocks r m1 , r m2 , ..., r mp in the residual convolutional layer, we employed the output H mp \u2208 R n\u00d7d p of the p-th residual block r mp as the output of these residual blocks."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-107",
"text": "Since there are totally m filters in the multi-filter convolutional layer, the final output of the residual convolutional layer is a concatenation of the output of m residual blocks, namely"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-109",
"text": "**ATTENTION LAYER**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-110",
"text": "Following Mullenbach et al. (2018) , we employed the perlabel attention mechanism to make each ICD code attend to different parts of the document representation H. The attention layer is formalized as:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-111",
"text": "where U \u2208 R (m\u00d7d p )\u00d7l represents the parameter matrix of the attention layer, A \u2208 R n\u00d7l represents the attention weights for each pair of an ICD code and a word, V \u2208 R l\u00d7(m\u00d7d p ) represents the output of the attention layer."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-112",
"text": "Here l denotes the number of ICD codes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-114",
"text": "**OUTPUT LAYER**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-115",
"text": "In the output layer, V is first fed into a linear layer followed by the sum-pooling operation to obtain the score vector\u0177 for all ICD codes, and then the probability vector\u1ef9 is calculated from\u0177 by the sigmoid function."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-116",
"text": "This process can be formalized as:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-117",
"text": "where W \u2208 R (m\u00d7d p )\u00d7l is the weight matrix of the output layer."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-118",
"text": "For training, we treated the ICD coding task as a multi-label classification problem following previous work (McCallum 1999; Mullenbach et al. 2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-119",
"text": "The training objective is to minimize the binary cross entropy loss between the prediction\u1ef9 and the target y:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-120",
"text": "where w denotes the input word sequence and \u03b8 denotes all the parameters."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-121",
"text": "We utilized the back-propagation algorithm and Adam optimizer (Kingma and Ba 2014) to train our model."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-123",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-124",
"text": "Datasets MIMIC-III In this paper, we employed the third version of Medical Information Mart for Intensive Care (MIMIC-III) (Johnson et al. 2016 ) as the first dataset to evaluate our models."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-125",
"text": "Following Mullenbach et al. (2018) , we used discharge summaries, split them by patient IDs, and conducted experiments using the full codes as well as the top-50 most frequent codes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-126",
"text": "Finally, the MIMIC-III dataset using 8,921 ICD-9 codes consists of 47,719, 1,631 and 3,372 discharge summaries for training, development and testing respectively."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-127",
"text": "The dataset using top-50 codes has 8,067 discharge summaries for training, 1,574 for development, and 1,730 for testing."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-129",
"text": "**MIMIC-II**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-130",
"text": "Besides the MIMIC-III dataset, we also leveraged the MIMIC-II dataset to compare our models with the ones in previous work (Perotte et al. 2013; Mullenbach et al. 2018; Baumel et al. 2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-131",
"text": "Follow their experimental setting, there are 20,533 and 2,282 clinical notes for training and testing, and 5,031 unique ICD-9 codes in the dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-132",
"text": "Preprocessing Following previous work (Mullenbach et al. 2018) , the text was tokenized, and each token were transformed into its lowercase."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-133",
"text": "The tokens that contain no alphabetic characters were removed such as numbers and punctuations."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-134",
"text": "The maximum length of a token sequence is 2,500 and the one that exceeds this length will be truncated."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-135",
"text": "We"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-137",
"text": "**EVALUATION METRICS**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-138",
"text": "To compare with previous work, we utilized different evaluation metrics in different experiments."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-139",
"text": "In the MIMIC-III experiment using full ICD codes, we utilized macro-averaged and micro-averaged AUC (area under the ROC, i.e., receiver operating characteristic curve), macro-averaged and micro-averaged F1, precision at 8 (P@8) and precision at 15 (P@15)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-140",
"text": "When computing macro-averaged AUC or F1, we first computed the performance for each label and then averaged them."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-141",
"text": "When computing micro-averaged AUC or F1, we considered every pair of a clinical note and a code as an independent prediction."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-142",
"text": "The precision at K (P@K) indicates the proportion of the correctly-predicted labels in the top-K predicted labels."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-143",
"text": "In the MIMIC-III experiment using the top-50 ICD codes, we employed the P@5 besides macro-averaged and microaveraged AUC, macro-averaged and micro-averaged F1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-144",
"text": "In the MIMIC-II experiment using full codes, we employed the same evaluation metrics except that P@5 was changed to P@8."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-146",
"text": "**HYPER-PARAMETER TUNING**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-147",
"text": "Since our model has a number of hyper-parameters, it is infeasible to search optimal values for all hyper-parameters."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-148",
"text": "Therefore, some hyper-parameter values were chosen empirically or following prior work (Mullenbach et al. 2018 )."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-149",
"text": "The word embedding size d e is 100, the out-channel size d f of a filter in the multi-filter convolutional layer is 100, the learning rate is 0.0001, the batch size is 16 and the dropout rate is 0.2."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-150",
"text": "To explore a better configuration for the filter number m and the kernel sizes k 1 , k 2 , ..., k m in the multi-filter convolutional layer, and the residual block number p in the residual convolutional layer, we conducted the following experiments."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-151",
"text": "First, we developed three variations:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-152",
"text": "\u2022 CNN, which only has one convolutional filter and is equivalent to the CAML model (Mullenbach et al. 2018 )."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-153",
"text": "2 https://github.com/jamesmullenbach/caml-mimic"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-154",
"text": "\u2022 MultiCNN, which only has the multi-filter convolutional layer."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-155",
"text": "\u2022 ResCNN, which only has the residual convolutional layer."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-156",
"text": "Then we tried several configurations for these models on the development set of MIMIC-III using the full and top-50 code settings."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-157",
"text": "The experimental results are shown in Table 2 ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-158",
"text": "For each configuration, we tried three runs by initializing the model parameters randomly."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-159",
"text": "The results shown in the table are the means of three runs."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-160",
"text": "We selected such kernel sizes since they do not only capture various text patterns from different granularities, but also keeps the sequence length unchanged after convolution (e.g., setting the padding and stride sizes as floor(k/2) and 1)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-161",
"text": "In addition, we pre-defined the in-channel and out-channel sizes of residual blocks empirically:"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-162",
"text": "As shown in Table 2 , MultiCNN performs better than CNN."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-163",
"text": "As the kernel number increases, the performance increases consistently in both full and top-50 code settings."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-164",
"text": "The performance reaches a peak when the kernel sizes are 3, 5, 9, 15, 19, 25 ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-165",
"text": "Moreover, ResCNN also performs better than CNN, but the difference is that the performances deteriorate as the residual block number increases."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-166",
"text": "ResCNN achieves the best performance when the residual block number is 1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-167",
"text": "Therefore, we applied the best configuration of Mul-tiCNN and ResCNN to MultiResCNN."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-168",
"text": "The results show that the performance of MultiResCNN was further improved after combining MultiCNN and ResCNN."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-169",
"text": "Therefore, we kept such configuration in other experiments."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-171",
"text": "**BASELINES**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-172",
"text": "CAML & DR-CAML The Convolutional Attention network for Multi-Label classification (CAML) was proposed by Mullenbach et al. (2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-173",
"text": "It has achieved the state-of-theart results on the MIMIC-III and MIMIC-II datasets among the models using unstructured text."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-174",
"text": "It consists of one convolutional layer and one attention layer to generate label-aware features for multi-label classification (McCallum 1999)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-175",
"text": "The Description Regularized CAML (DR-CAML) is an extension of CAML and incorporates the text description of each code to regularize the model."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-177",
"text": "**C-MEMNN**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-178",
"text": "The Condensed Memory Neural Network was proposed by Prakash et al. (2017) , which equips the neural network with iterative condensed memory representations."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-179",
"text": "The model achieved competitive results to predict the top-50 ICD codes for the medical notes in the MIMIC-III dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-180",
"text": "Shi et al. (2017) proposed a Characteraware LSTM-based Attention model to assign ICD codes to clinical notes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-181",
"text": "They employed LSTM-based language models to generate representations of clinical notes and ICD codes, and proposed an attention method to address the mismatch between notes and codes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-182",
"text": "They also focused on predicting the top-50 ICD codes for the medical notes in the MIMIC-III dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-184",
"text": "**C-LSTM-ATT**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-185",
"text": "SVM Perotte et al. (2013) experimented two approaches: one treats each ICD9 code independently (flat SVM) and the other uses the hierarchical nature of ICD9 codes (hierarchy SVM)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-186",
"text": "Their results show that the hierarchy SVM performs better than the flat one, yielding 29.3% f1-measure in the MIMIC-II dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-187",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-188",
"text": "**HA-GRU BAUMEL ET AL. (2018) PRESENTED A MODEL NAMED**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-189",
"text": "Hierarchical Attention Gated Recurrent Unit (HA-GRU) for automatic ICD coding of clinical documents."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-190",
"text": "HA-GRU includes two main layers: the first one encodes sentences and the second one encodes documents."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-191",
"text": "They reported their results in the MIMIC-II dataset, following the data split from Perotte et al. (2013) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-193",
"text": "**RESULTS**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-194",
"text": "In this section, we compared our model with existing work for automated ICD coding."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-195",
"text": "We ran our model three times for each experiment and each time we used different random seeds for parameter initialization."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-196",
"text": "The final results are the means and standard deviations of three runs."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-197",
"text": "Following prior work (Mullenbach et al. 2018) , we compared our model with existing work using the MIMIC-III and MIMIC-II dataset."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-198",
"text": "For the MIMIC-III dataset, we also performed the comparisons with two experimental settings, namely using the full codes and top-50 codes."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-199",
"text": "For the MIMIC-II dataset, only the full codes were employed."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-200",
"text": "Table 3 , we can see that our model obtained better results in the macro-AUC, micro-F1, precision@8 and precision@15, compared with the state-of-the-art models, CAML and DR-CAML."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-201",
"text": "Our model improved the macro-AUC by 0.013, the micro-F1 by 0.013, the precision@8 by 0.025, the precision@15 by 0.023."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-202",
"text": "In addition, our model achieved comparable performance on the micro-AUC and a slightly worse macro-F1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-203",
"text": "More importantly, we observed that our model is able to attain stable good results from the standard deviations."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-204",
"text": "Table 4 , we observed that our model outperformed all the baselines, namely C-MemNN (Prakash et al. 2017 ), C-LSTM-Att (Shi et al. 2017) , CAML and DR-CAML (Mullenbach et al. 2018) , in all evaluation metrics."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-205",
"text": "Our model improves the macro-AUC, micro-AUC, macro-F1, micro-F1 and preci-sion@5 by 0.015, 0.012, 0.030, 0.037 and 0.023, respectively."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-206",
"text": "Our model outperformed the C-MemNN by 0.221 and 0.066 in precision@5 and macro-AUC."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-207",
"text": "It also outperformed the C-LSTM-Att by 0.138 and 0.028 in micro-F1 and micro-AUC."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-208",
"text": "Its precision@5 is 0.032 and 0.023 higher than those of CAML and DR-CAML."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-209",
"text": "Table 5 shows the results on the full code set of MIMIC-II."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-210",
"text": "Perotte et al. (2013) used the SVM to predict ICD codes from clinical text and their method obtained 0.293 micro-F1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-211",
"text": "By contrast, our model outperformed their method by 0.171 in micro-F1."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-212",
"text": "Baumel et al. (2018) utilized the attention mechanism and GRU"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-213",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-214",
"text": "**MIMIC-III RESULTS (FULL CODES) AS SHOWN IN**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-215",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-216",
"text": "**MIMIC-III RESULTS (TOP-50 CODES) FROM**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-218",
"text": "**MIMIC-II RESULTS (FULL CODES)**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-219",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-220",
"text": "**DISCUSSION**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-222",
"text": "**COMPUTATIONAL COST ANALYSIS**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-223",
"text": "In this section, we analyzed the computational cost between the state-of-the-art model, CAML and our model, Mul-tiResCNN."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-224",
"text": "The analysis was conducted from four aspects, namely the parameter amount, training time, training epoch, inference speed."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-225",
"text": "Our experimental settings are as follows."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-226",
"text": "For CAML, we used the optimal hyper-parameter setting reported in their paper (Mullenbach et al. 2018) ."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-227",
"text": "For Mul-tiResCNN, we used six filters and 1 residual block, which obtained the best result in our hyper-parameter tuning experiments."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-228",
"text": "The batch size, learning rate and dropout rate are identical in every experiment."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-229",
"text": "We used the training set and development set of MIMIC-III (full codes) as experimental data."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-230",
"text": "The experiments were conducted on NVIDIA Tesla P40 GPUs."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-231",
"text": "Training will terminate if the performance on the development set does not increase for 10 times."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-232",
"text": "As shown in Table 6 , the parameter of MultiResCNN is approximately 1.9 times as many as that of CAML."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-233",
"text": "The training time of MultiResCNN is about 2.3 times more than that of CAML."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-234",
"text": "It is reasonable since MultiResCNN has more filters and layers."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-235",
"text": "Interestingly, MultiResCNN needs much less epochs to converge."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-236",
"text": "Considering the inference speed, CAML is approximately 1.5 times faster than MultiResCNN."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-237",
"text": "Overall, the computational cost of Mul-tiResCNN is larger than that of CAML, but we hold the opinion that the increased cost is still acceptable."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-238",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-239",
"text": "**EFFECT OF TRUNCATING DATA**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-240",
"text": "During preprocessing, we truncated the discharge summaries that are longer than 2,500 tokens."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-241",
"text": "To investigate the effect of the length limitation, we further conducted the experiments using 3,500, 4,500, 5,500 and 6,500."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-242",
"text": "We selected these values because the maximum length of the discharge summaries in the development set is approximately 6,300."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-243",
"text": "Results show that the performance differences between different settings are not significant."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-244",
"text": "P@8 ranges between 0.736 and 0.741, and micro-F1 ranges between 0.557 and 0.566."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-245",
"text": "2,500 seems to be a decent selection considering the tradeoff between performance and cost."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-246",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-247",
"text": "**LIMITATIONS**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-248",
"text": "In this study, the performance improvement mostly comes from deep and diversified representations of text."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-249",
"text": "In the future, we will explore how to incorporate BERT (Devlin et al. 2019 ) into this task effectively and efficiently."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-250",
"text": "In our preliminary experiments, BERT did not perform well due to the limitations of hardware and its fixed-length context."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-251",
"text": "Therefore, potential solutions include recurrent Transformer (Dai et al. 2019 ) and hierarchical BERT (Zhang, Wei, and Zhou 2019)."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-252",
"text": "Moreover, we chose the kernel sizes of the multi-filter layer and channel sizes of the residual layer empirically, which should be further studied and optimized in the future."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-253",
"text": "----------------------------------"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-254",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-255",
"text": "In this paper, we proposed a multi-filter residual convolutional neural network for ICD coding."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-256",
"text": "We conducted three experiments on the widely-used MIMIC-III and MIMIC-II datasets."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-257",
"text": "Results show that our model achieved the stateof-the-art performance compared with several competitive baselines."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-258",
"text": "We found that both multi-filter convolution and residual convolution helped the performance improvement with acceptable computational cost."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-259",
"text": "This shows deep and diversified text representations could benefit the ICD coding from clinical text."
},
{
"sent_id": "f2ff155003d139b3677f746baf3807-C001-260",
"text": "Our model can be a strong baseline for not only ICD coding, but also other text classification tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f2ff155003d139b3677f746baf3807-C001-25",
"f2ff155003d139b3677f746baf3807-C001-26",
"f2ff155003d139b3677f746baf3807-C001-27"
],
[
"f2ff155003d139b3677f746baf3807-C001-31"
],
[
"f2ff155003d139b3677f746baf3807-C001-36",
"f2ff155003d139b3677f746baf3807-C001-37"
],
[
"f2ff155003d139b3677f746baf3807-C001-59",
"f2ff155003d139b3677f746baf3807-C001-60"
],
[
"f2ff155003d139b3677f746baf3807-C001-147",
"f2ff155003d139b3677f746baf3807-C001-148"
],
[
"f2ff155003d139b3677f746baf3807-C001-172",
"f2ff155003d139b3677f746baf3807-C001-173",
"f2ff155003d139b3677f746baf3807-C001-174",
"f2ff155003d139b3677f746baf3807-C001-175"
]
],
"cite_sentences": [
"f2ff155003d139b3677f746baf3807-C001-26",
"f2ff155003d139b3677f746baf3807-C001-31",
"f2ff155003d139b3677f746baf3807-C001-37",
"f2ff155003d139b3677f746baf3807-C001-59",
"f2ff155003d139b3677f746baf3807-C001-148",
"f2ff155003d139b3677f746baf3807-C001-172"
]
},
"@MOT@": {
"gold_contexts": [
[
"f2ff155003d139b3677f746baf3807-C001-25",
"f2ff155003d139b3677f746baf3807-C001-26",
"f2ff155003d139b3677f746baf3807-C001-27"
]
],
"cite_sentences": [
"f2ff155003d139b3677f746baf3807-C001-26"
]
},
"@SIM@": {
"gold_contexts": [
[
"f2ff155003d139b3677f746baf3807-C001-31"
]
],
"cite_sentences": [
"f2ff155003d139b3677f746baf3807-C001-31"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"f2ff155003d139b3677f746baf3807-C001-36",
"f2ff155003d139b3677f746baf3807-C001-37"
],
[
"f2ff155003d139b3677f746baf3807-C001-130"
],
[
"f2ff155003d139b3677f746baf3807-C001-197",
"f2ff155003d139b3677f746baf3807-C001-198",
"f2ff155003d139b3677f746baf3807-C001-199",
"f2ff155003d139b3677f746baf3807-C001-200",
"f2ff155003d139b3677f746baf3807-C001-201",
"f2ff155003d139b3677f746baf3807-C001-202",
"f2ff155003d139b3677f746baf3807-C001-203",
"f2ff155003d139b3677f746baf3807-C001-204"
]
],
"cite_sentences": [
"f2ff155003d139b3677f746baf3807-C001-37",
"f2ff155003d139b3677f746baf3807-C001-130",
"f2ff155003d139b3677f746baf3807-C001-197",
"f2ff155003d139b3677f746baf3807-C001-204"
]
},
"@USE@": {
"gold_contexts": [
[
"f2ff155003d139b3677f746baf3807-C001-110",
"f2ff155003d139b3677f746baf3807-C001-111"
],
[
"f2ff155003d139b3677f746baf3807-C001-118",
"f2ff155003d139b3677f746baf3807-C001-119",
"f2ff155003d139b3677f746baf3807-C001-120"
],
[
"f2ff155003d139b3677f746baf3807-C001-125"
],
[
"f2ff155003d139b3677f746baf3807-C001-132",
"f2ff155003d139b3677f746baf3807-C001-133",
"f2ff155003d139b3677f746baf3807-C001-134"
],
[
"f2ff155003d139b3677f746baf3807-C001-147",
"f2ff155003d139b3677f746baf3807-C001-148"
],
[
"f2ff155003d139b3677f746baf3807-C001-152"
],
[
"f2ff155003d139b3677f746baf3807-C001-226"
]
],
"cite_sentences": [
"f2ff155003d139b3677f746baf3807-C001-110",
"f2ff155003d139b3677f746baf3807-C001-118",
"f2ff155003d139b3677f746baf3807-C001-125",
"f2ff155003d139b3677f746baf3807-C001-132",
"f2ff155003d139b3677f746baf3807-C001-148",
"f2ff155003d139b3677f746baf3807-C001-152",
"f2ff155003d139b3677f746baf3807-C001-226"
]
}
}
},
"ABC_f2925513a7cce2e80ade1f948164d0_5": {
"x": [
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-8",
"text": "**I. INTRODUCTION**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-28",
"text": "**A. WORD EMBEDDING**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-2",
"text": "Abstract-Many modern Artificial Intelligence (AI) systems make use of data embeddings, particularly in the domain of Natural Language Processing (NLP)."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-3",
"text": "These embeddings are learnt from data that has been gathered \"from the wild\" and have been found to contain unwanted biases."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-4",
"text": "In this paper we make three contributions towards measuring, understanding and removing this problem."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-5",
"text": "We present a rigorous way to measure some of these biases, based on the use of word lists created for social psychology applications; we observe how gender bias in occupations reflects actual gender bias in the same occupations in the real world; and finally we demonstrate how a simple projection can significantly reduce the effects of embedding bias."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-6",
"text": "All this is part of an ongoing effort to understand how trust can be built into AI systems."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-29",
"text": "A word embedding is a mapping of words into an ndimensional vector space."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-30",
"text": "Given a corpus of text, a word embedding can be created that will translate that corpus into a set of semantic vectors representing each word."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-31",
"text": "Each word that appears in the corpus will be represented by an n-dimensional vector to indicate its position within the embedding."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-32",
"text": "This embedding has a set of features that can be used in natural language processing methods."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-33",
"text": "The nearest neighbours of a word will be other words that have similar linguistic or semantic meaning, when comparing words using a measurement such as cosine similarity."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-34",
"text": "There are also linear substructures within the word embeddings that can explain how multiple words are related to each other, making it a useful preprocessing step for natural language processing applications."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-35",
"text": "A word vector for a given word will now be defined as w. Word vectors are normalized to unit length for measurement:"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-36",
"text": "All future analysis will be done using normalised word vectors, if vectors in the future are edited they will again be normalised to unit length."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-38",
"text": "**B. COMPARISON OF EMBEDDED WORDS**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-39",
"text": "Two words vectors w 1 and w 2 within a vector space can be compared by taking the dot product of their words:"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-40",
"text": "As both word vectors are normalized, this is equivalent to the cosine similarity between the two word vectors."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-41",
"text": "A cosine similarity closer to 1 means that the vectors are similar to each other, while a cosine similarity of 0 means that the vectors are orthogonal to each other."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-42",
"text": "In addition to comparisons between individual word vectors, we can compare an individual word vector to a set of word vectors."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-43",
"text": "This is done by finding the mean of the set, normalizing the resulting vector and calculating the dot product with the individual word vectors as follows:"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-44",
"text": "The resulting calculation gives us how closely an individual word is associated with a larger set of words."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-45",
"text": "This association can be used to assess how closely related a given word is to different topics or concepts within the embedding space."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-47",
"text": "**C. REMOVING BIAS**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-48",
"text": "To remove bias, first two vectors have to be identified that contain contrasting directions of the bias."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-49",
"text": "These two vectors (w 1 and w 2 ) must be considered \"opposite\" of each other semantically, in terms of the bias that is required to be removed."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-50",
"text": "The following method of debiasing is the same as presented in [2] :"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-51",
"text": "where the vector w b will have the direction of bias in the embedding (for example, he and she are different genders and could potentially be used to capture a gender direction)."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-52",
"text": "Using this bias direction, all word vectors can now have that component removed by projecting them into a space that is orthogonal to the bias vector:"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-53",
"text": "where w \u22a5 is the original word vector with the biased component removed."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-54",
"text": "This resulting vector will now have the number of effective dimensions reduced to n \u2212 1, indicating that it is orthogonal to the bias vector."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-55",
"text": "These orthogonal word vectors are required to be again be normalised for further analysis."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-57",
"text": "**III. EXPERIMENTS**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-58",
"text": "In this paper, we conduct three experiments on semantic word embeddings."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-59",
"text": "We first propose a new version of the Word Embedding Association Tests studied in [3] by using the LIWC lexica to systematically detect and measure the biases within the embedding, keeping the tests comparable with the same set of target words."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-60",
"text": "We further extend this work using additional sets of target words, and compare sentiment across male and female names."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-61",
"text": "Furthermore, we investigate gender bias in words that represent different occupations, comparing these associations with UK national employment statistics."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-62",
"text": "In the last experiment, we use orthogonal projections [2] to debias our word embeddings, and measure the reduction in the biases demonstrated in the previous two experiments."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-64",
"text": "**A. DATA DESCRIPTION AND EMBEDDING**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-65",
"text": "In all of our experiments, the first step is to obtain semantic vectors from a word embedding that we wish to analyse."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-66",
"text": "We use GloVe embeddings [18] , pre-trained using a window size of 10 words on a combination of Wikipedia from 2014, and the English Gigaword corpus [16] , where each of the 400,000 words in the vocabulary for this embedding are represented by a 300-dimensional vector."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-67",
"text": "These vectors capture, in a quantitative way, the nuanced semantics between words necessary to perform meaningful analysis of words, reflecting the semantics found in the underlying corpora used to build them."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-68",
"text": "The Wikipedia data includes the page content from all English Wikipedia pages as they appeared in 2014 when a snapshot was taken."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-69",
"text": "The English Gigaword corpus is an archive of newswire text data from seven distinct international sources of English newswire covering several years up until the end of 2010 [16] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-71",
"text": "**B. EXPERIMENT 1: LIWC WORD EMBEDDING ASSOCIATION TEST (LIWC-WEAT)**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-72",
"text": "In this experiment, we introduce the LIWC Word Embedding Association Test (LIWC-WEAT), where we measure the association between sets of target words with larger sets of words known to relate to sentiment and gender coming from the LIWC lexica [17] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-73",
"text": "We begin by using the target words from [3] which were originally used in [8] , allowing us to directly compare our findings with the original WEAT."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-74",
"text": "Our approach differs from that of [3] in that while we use the same set of target words in each test, we use an expanded set of attribute words, allowing us to perform a more rigorous, systematic study of the associations found within the word embeddings."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-75",
"text": "For this, we use attribute words sourced from the LIWC lexica [17] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-76",
"text": "The categories specified in the LIWC lexica are based on many factors, including emotions, thinking styles, and social concerns."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-77",
"text": "For each of the original word categories used in [3] , we matched them with their closest equivalent within the LIWC categories, for example matching the word lists for 'career' and 'family' with the 'work' and 'family' LIWC categories."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-78",
"text": "We tested the association between each target word and the set of attribute words using the method described in Sec. II-B, focussing on the differences in association between sentimental terms and European-and African-American names, subject disciplines to each of the genders, career and family terms with gendered names, as well as looking at the association between gender and sentiment."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-80",
"text": "**1) ASSOCIATION OF EUROPEAN AND AFRICAN-AMERICAN NAMES WITH SENTIMENT :**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-81",
"text": "Taking the list of target European-American and African-American names used in [3] , we tested each of them for their associated with the positive and negative emotion concepts found in [17] by using the methodology described by Eq. 3 in Sec. II-B, replacing the short list of words used to originally represent pleasant and unpleasant attribute sets."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-82",
"text": "Our test found that while both European-American names and African-American names are more associated with positive emotions than negative emotions, the test showed that European-American names are more associated with positive emotions than their African-American counterparts, as shown in Fig. 1a ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-83",
"text": "This finding supports the association test in [3] , where they also found that European-American names were more pleasant than African-American names."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-85",
"text": "**2) ASSOCIATION OF SUBJECT DISCIPLINES WITH GENDER :**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-86",
"text": "A further test was conducted to find the association between words related to different subject disciplines (e.g. arts, maths, science) with each of the genders using the 'he' and 'she' categories from LIWC [17] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-87",
"text": "The results of our test again support the findings of [3] , with Maths and Science terms being more closely associated with males, while Arts terms are more closely associated with females, as shown in Fig. 1b."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-88",
"text": "3) Association of Gender with Career and Family : Taking the list of target gendered names used in [3] , we tested each of them for their associated with the career and family concepts using the categories of 'work' and 'family' found in LIWC [17] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-89",
"text": "As shown in Fig. 1c , we found that the set of male names was more associated with the concept of work, while the female names were more associated with family, mirroring the results found in [3] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-90",
"text": "Extending this test, we generated a much larger set of male and female target names from an online list of baby names 1 ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-91",
"text": "Repeating the same test on this larger set of names, we found that male and female names were much less separated than suggested by previous results, with only minor differences between the two, as shown in Fig. 1d ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-93",
"text": "**4) ASSOCIATION OF GENDER WITH SENTIMENT :**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-94",
"text": "Extending the number of tests performed in the original WEAT study, we additionally tested the set of target male and female names and computed their association with the positive and negative emotions."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-95",
"text": "We found that both sets of names are considered to be positive, similarly to the European-American and AfricanAmerican names used in the previous test, but with male names appearing to be slightly more positive, as shown in Fig. 1e ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-96",
"text": "We further tested these associations using our extended list of gendered baby names, as in Sec. III-B3, finding that there is no clear difference between the positive and negative sentiment attached to names of different gender in the word embedding."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-97",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-98",
"text": "**C. EXPERIMENT 2: ASSOCIATIONS BETWEEN OCCUPATIONS AND GENDER**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-99",
"text": "In this experiment, we test the association between different occupations and gender categories coming from LIWC [17] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-100",
"text": "The association between each of the occupations is further contrasted against official employment statistics for the United Kingdom detailing the actual number of people working in each job role."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-102",
"text": "**1) ASSOCIATION OF OCCUPATION WITH GENDER:**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-103",
"text": "We first generated a list of 62 occupations from data published by the Office of National Statistics [15] , filtering the list to only include those occupations for which there is reliable employment statistics and can be summarised by a single word in the embedding, e.g. doctor, engineer, secretary."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-104",
"text": "For each of these occupations, we tested their association with each of the genders, as shown in Fig. 2a , with the top ten occupations associated with each gender shown in Table I ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-105",
"text": "We found there was a 70% (p-value < 10 \u221210 ) correlation in the closeness of association between occupations and each of the gender attribute sets."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-107",
"text": "**2) OCCUPATION STATISTICS VERSUS OCCUPATION ASSOCIATION :**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-108",
"text": "Using the list of occupations from the previous section, we compared their association with each of the genders with the ratio of the actual number of men and women working in those roles, as recorded in the official statistics [15] , where 1 indicates only men work in this role, and 0 only women."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-109",
"text": "We found that there is a strong, significant correlation (\u03c1 = 0.57, p-value < 10 \u22126 ) between the word embedding association between gender and occupation and the number of people of each gender in the United Kingdom working in those roles."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-110",
"text": "This supports a similar finding for U.S. employment statistics using an independent set of occupations found in [3] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-112",
"text": "**D. EXPERIMENT 3: MINIMISING ASSOCIATIONS VIA ORTHOGONAL PROJECTION**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-113",
"text": "In this experiment, we deploy a method for removing bias from word embeddings, first published in [2] , and repeat all previous association tests related to gender reported in this paper, empirically showing the effect of bias removal on the word associations."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-115",
"text": "**1) FINDING AN ORTHOGONAL PROJECTION FOR GENDER:**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-116",
"text": "To remove gender from the embedding, we first need to find a projection within the space that best encapsulates the gender differences between words."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-117",
"text": "To find the best projection, we began from a list of 5 gendered pronouns in LIWC [17] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-118",
"text": "For each of the pronouns, we paired them with their gender-opposite, for example pairing \"he\" and \"she\", \"himself\" and \"herself\" and so on."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-119",
"text": "Taking the word vector from the embedding for each pronoun, we computed their difference, as described in Sec. II-C, giving us a set of 5 potential gender projections."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-120",
"text": "Each gender projection was tested against an independent set of paired gender words sourced from WordNet [13] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-121",
"text": "After applying the gender projection to the test word-pairs, following the procedure of [2] , we measured the average difference between the word-pairs."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-122",
"text": "The gender projection that led to the word-pairs that are closest together (smallest difference) was then selected as our gender projection, corresponding to the difference between the vectors for \"himself\" and \"herself\"."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-123",
"text": "2) Revised Association Tests: Using the orthogonal gender projection found in the previous section, we repeated the tests from the LIWC-WEAT in Sec. III-B that were related to gender."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-124",
"text": "This included the association of science, mathematics and the arts with gender, the association of male and females names with sentiment, work and family, and the ranking of occupations by their gender association."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-125",
"text": "In Experiment 1, we previously found that the disciplines of science and maths were more associated with male terms in the embedding, while the arts were closer to female terms."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-126",
"text": "The association of each of these subject disciplines with gender after orthogonal projection was found to be more balanced, with closer to equal association for both male and female terms, shown in Fig. 3a ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-127",
"text": "Male and Females names tested in [3] showed a clear distinction in their association with work and family respectively, with our replication of the test in Sec. III-B3 finding the same results."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-128",
"text": "Performing the same tests again after applying the gender projection to both name lists, we wished to quantify the change in associations."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-129",
"text": "We calculated the change in the distance between the centroids of each set of names before and after applying the orthogonal gender projection, finding that the association with work for males and family for females reduced, closing the gap between male and female names by 37.5% for the target names found in the original WEAT and 66% for the extended list of names respectively."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-130",
"text": "In our experiment looking at the association of positive and negative emotions with male and female names, we found that male and female names were both positive, with male names being slightly more associated with positive emotions than female names."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-131",
"text": "The same finding were also true when using a larger set of names and making the same comparison."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-132",
"text": "Applying the orthogonal gender projection to the word vectors, we again looked at how much the difference between the two sets was reduced."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-133",
"text": "We found that for the target names found in the original WEAT, the distance between the two sets of names was reduced by 27%, while for the extended list the difference was reduced by 40%."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-134",
"text": "In Experiment 2, we found that there was a significant correlation of 70% between the male and female association of each occupation, while comparing the associations with official statistics of the number of men and women in each role showed a correlation of 53%."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-135",
"text": "Again, applying the orthogonal gender projection and repeating these tests, we found that, on average, occupations moved closer to having an equal association with each of the genders (Fig. 3f) and that their association with gender was not significantly correlated (\u03c1 = 0.178, p-value = 0.167) with the number of men and women working in each role."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-137",
"text": "**IV. DISCUSSION**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-138",
"text": "In our experiments, we have shown the effect of one debiasing procedure for reducing the association a given word has in a word embedding generated from natural language corpora with concepts related to gender."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-139",
"text": "Being able to do so relies on a set of gendered terms from which we can obtain pairings with opposite meaning, allowing us to find an orthogonal projection within the space."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-140",
"text": "This will not always be possible for every type of bias that we may wish to remove (or at least reduce) in an embedding because there will not always be a suitable word vector pair that can be used to represent a given bias."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-141",
"text": "Other biases which are present may also be impossible to detect with our LIWC-WEAT method, as a pre-defined and validated list of words from LIWC were required to perform the tests."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-142",
"text": "Other potentially undesired biases such as race or age are not currently able to be captured using the LIWC lexica, and thus different, carefully considered sets of words would need to be curated."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-143",
"text": "Indeed, general solutions to this problem are probably impossible, for philosophical reasons, but we believe that biases can at least be mitigated or compensated for, by removing specific subtypes of bias, given we have ways to measure and detect them in the first place."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-144",
"text": "However, in this process, care should also be taken as we may introduce or compound other existing biases in the embeddings."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-146",
"text": "**V. CONCLUSIONS**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-147",
"text": "If we want AI to take a central position in society, we need to be able to detect and remove any source of possible discrimination, to ensure fairness and transparency, and ultimately trust in these learning systems."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-148",
"text": "Principled methods to measure biases will certainly need to play a central role in this, as will an understanding of the origins of biases, and new developments in methods that can be used to remove biases once detected."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-149",
"text": "In this paper, we have introduced the LIWC-WEAT, a set of objective tests extending the association tests in [3] by using the LIWC lexica to measure bias within word embeddings."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-150",
"text": "We found bias in both the associations of gender and race, as first described in [3] , while additionally finding that male names have a slightly higher positive association than female names."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-151",
"text": "Biases found in the embedding were also shown to reflect biases in the real world and the media, where we found a correlation between the number of men and women in an occupation and its association with each set of male and female names."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-152",
"text": "Finally, using a projection algorithm [2] , we were able to reduce the gender bias shown in the embeddings, resulting in a decrease in the difference between associations for all tests based upon gender."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-153",
"text": "Further work in this direction will include removing bias in n-gram embeddings, embeddings that include multiple languages and new procedures for both generating better projections to remove a given bias, using debiased embeddings as an input to an upstream system and testing performance, and learning word embeddings which can be generated without chosen directions by construction."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-9",
"text": "With the latest wave of learning models taking advantage of advances in deep learning [21] , [22] , [23] , Artificial Intelligence (AI) systems are gaining widespread publicity, coupled with a drive from industry to incorporate intelligence into all manner of processes that handle our private and personal data, giving them a central position in our modern-day society."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-10",
"text": "This development has lead to demand for fairer AI, where we wish to establish trust in the automated intelligent systems by ensuring that systems represent us fairly and transparently."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-11",
"text": "However, there has been growing concern about potential biases in learning systems [1] , [6] which can be difficult to analyse or query for explanations of their predictions, leading to an increasing number of studies investigating the way blackbox systems represent knowledge and make decisions [7] , [9] , [11] , [19] , [20] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-12",
"text": "Indeed, principled methods are now required that allow us to measure, understand and remove biases in our data in order for these systems to be truly accepted as a prominent part of our lives."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-13",
"text": "In the domain of text, many modern approaches often begin by embedding the input text data into an embedding space that is used as the first layer in a subsequent deep network [4] , [14] ."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-14",
"text": "These word embeddings have been shown to contain the same biases [3] , due to the source data from which they are trained."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-15",
"text": "In effect, biases from the source data, such as in the differences in representation for men and women, that have been found in many different large-scale studies [5] , [10] , [12] , carry through to the semantic relations in the word embeddings, which become baked into the learning systems that are built on top of them."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-16",
"text": "In this paper, we make three contributions towards addressing these concerns."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-17",
"text": "First we propose a new version of the Word Embedding Association Tests (WEATs) studied in [3] , designed to demonstrate and quantify bias in word embeddings, which puts them on a firm foundation by using the Linguistic Inquiry and Word Count (LIWC) lexica [17] to systematically detect and measure embedding biases."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-18",
"text": "With this improved experimental setting, we find that European-American names are viewed more positively than African-American names, male names are more associated with work while female names are more associated with family, and that the academic disciplines of science and maths are more associated with male terms than the arts, which are more associated with female terms."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-19",
"text": "Using this new methodology, we then find that there is a gender bias in the way different occupations are represented by the embedding."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-20",
"text": "Furthermore, we use the latest official employment statistics in the UK, and find that there is a correlation between the ratio of men and women working in different occupation roles and how those roles are associated with gender in the word embeddings."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-21",
"text": "This suggests that biases in the embeddings reflect biases in the world."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-22",
"text": "Finally, we look at methods of removing gender bias from the word embeddings."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-23",
"text": "Having established that there is a direction in the embedding space that correlates with gender, we use a simple orthogonal projection to remove that dimension from the embedding."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-24",
"text": "After projecting the embeddings, we investigate the effect on bias in the embeddings by considering the changes in associations between the words, demonstrating that the associations in the modified embeddings now correlate less to UK employment statistics among other things."
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-26",
"text": "**II. METHODOLOGY**"
},
{
"sent_id": "f2925513a7cce2e80ade1f948164d0-C001-27",
"text": "----------------------------------"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-13",
"f2925513a7cce2e80ade1f948164d0-C001-14",
"f2925513a7cce2e80ade1f948164d0-C001-15"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-17",
"f2925513a7cce2e80ade1f948164d0-C001-18",
"f2925513a7cce2e80ade1f948164d0-C001-19"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-59",
"f2925513a7cce2e80ade1f948164d0-C001-60",
"f2925513a7cce2e80ade1f948164d0-C001-61"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-109",
"f2925513a7cce2e80ade1f948164d0-C001-110"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-149"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-150",
"f2925513a7cce2e80ade1f948164d0-C001-151"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-14",
"f2925513a7cce2e80ade1f948164d0-C001-17",
"f2925513a7cce2e80ade1f948164d0-C001-59",
"f2925513a7cce2e80ade1f948164d0-C001-110",
"f2925513a7cce2e80ade1f948164d0-C001-149",
"f2925513a7cce2e80ade1f948164d0-C001-150"
]
},
"@MOT@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-13",
"f2925513a7cce2e80ade1f948164d0-C001-14",
"f2925513a7cce2e80ade1f948164d0-C001-15"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-17",
"f2925513a7cce2e80ade1f948164d0-C001-18",
"f2925513a7cce2e80ade1f948164d0-C001-19"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-149"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-14",
"f2925513a7cce2e80ade1f948164d0-C001-17",
"f2925513a7cce2e80ade1f948164d0-C001-149"
]
},
"@DIF@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-17",
"f2925513a7cce2e80ade1f948164d0-C001-18",
"f2925513a7cce2e80ade1f948164d0-C001-19"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-150",
"f2925513a7cce2e80ade1f948164d0-C001-151"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-17",
"f2925513a7cce2e80ade1f948164d0-C001-150"
]
},
"@EXT@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-59",
"f2925513a7cce2e80ade1f948164d0-C001-60",
"f2925513a7cce2e80ade1f948164d0-C001-61"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-73",
"f2925513a7cce2e80ade1f948164d0-C001-74",
"f2925513a7cce2e80ade1f948164d0-C001-75"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-77",
"f2925513a7cce2e80ade1f948164d0-C001-78"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-89",
"f2925513a7cce2e80ade1f948164d0-C001-90",
"f2925513a7cce2e80ade1f948164d0-C001-91"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-59",
"f2925513a7cce2e80ade1f948164d0-C001-73",
"f2925513a7cce2e80ade1f948164d0-C001-74",
"f2925513a7cce2e80ade1f948164d0-C001-77",
"f2925513a7cce2e80ade1f948164d0-C001-89"
]
},
"@USE@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-81"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-88"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-81",
"f2925513a7cce2e80ade1f948164d0-C001-88"
]
},
"@SIM@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-82",
"f2925513a7cce2e80ade1f948164d0-C001-83"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-87"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-109",
"f2925513a7cce2e80ade1f948164d0-C001-110"
],
[
"f2925513a7cce2e80ade1f948164d0-C001-127"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-83",
"f2925513a7cce2e80ade1f948164d0-C001-87",
"f2925513a7cce2e80ade1f948164d0-C001-110",
"f2925513a7cce2e80ade1f948164d0-C001-127"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"f2925513a7cce2e80ade1f948164d0-C001-89",
"f2925513a7cce2e80ade1f948164d0-C001-90",
"f2925513a7cce2e80ade1f948164d0-C001-91"
]
],
"cite_sentences": [
"f2925513a7cce2e80ade1f948164d0-C001-89"
]
}
}
},
"ABC_4cb16f436d910d82c3661052c1fa30_5": {
"x": [
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-76",
"text": "C.E is the set of edges crossing the cut, and G is the current graph before the cut."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-124",
"text": "The methods disagree about the placement of mention 1."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-2",
"text": "In this paper we describe a coreference resolution method that employs a classification and a clusterization phase."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-3",
"text": "In a novel way, the clusterization is produced as a graph cutting algorithm, in which nodes of the graph correspond to the mentions of the text, whereas the edges of the graph constitute the confidences derived from the coreference classification."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-4",
"text": "In experiments, the graph cutting algorithm for coreference resolution, called BESTCUT, achieves state-of-the-art performance."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-7",
"text": "Recent coreference resolution algorithms tackle the problem of identifying coreferent mentions of the same entity in text as a two step procedure: (1) a classification phase that decides whether pairs of noun phrases corefer or not; and (2) a clusterization phase that groups together all mentions that refer to the same entity."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-8",
"text": "An entity is an object or a set of objects in the real world, while a mention is a textual reference to an entity 1 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-9",
"text": "Most of the previous coreference resolution methods have similar classification phases, implemented either as decision trees (Soon et al., 2001) or as maximum entropy classifiers (Luo et al., 2004) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-10",
"text": "Moreover, these methods employ similar feature sets."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-11",
"text": "The clusterization phase is different across current approaches."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-12",
"text": "For example, there are several linking decisions for clusterization."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-13",
"text": "(Soon et al., 2001 ) advocate the link-first decision, which links a mention to its closest candidate referent, while (Ng and Cardie, 2002) consider instead the link-best decision, which links a mention to its most confident candidate referent."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-14",
"text": "Both these clustering decisions are locally optimized."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-15",
"text": "In contrast, globally optimized clustering decisions were reported in (Luo et al., 2004) and (DaumeIII and Marcu, 2005a) , where all clustering possibilities are considered by searching on a Bell tree representation or by using the Learning as Search Optimization (LaSO) framework (DaumeIII and Marcu, 2005b) respectively, but the first search is partial and driven by heuristics and the second one only looks back in text."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-16",
"text": "We argue that a more adequate clusterization phase for coreference resolution can be obtained by using a graph representation."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-17",
"text": "In this paper we describe a novel representation of the coreference space as an undirected edge-weighted graph in which the nodes represent all the mentions from a text, whereas the edges between nodes constitute the confidence values derived from the coreference classification phase."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-18",
"text": "In order to detect the entities referred in the text, we need to partition the graph such that all nodes in each subgraph refer to the same entity."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-19",
"text": "We have devised a graph partitioning method for coreference resolution, called BESTCUT, which is inspired from the well-known graph-partitioning algorithm Min-Cut (Stoer and Wagner, 1994) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-20",
"text": "BESTCUT has a different way of computing the cut weight than Min-Cut and a different way of stopping the cut 2 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-21",
"text": "Moreover, we have slightly modified the Min-Cut procedures."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-22",
"text": "BESTCUT replaces the bottom-up search in a tree representation (as it was performed in (Luo et al., 2004) ) with the top-down problem of obtaining the best partitioning of a graph."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-23",
"text": "We start by assuming that all mentions refer to a single entity; the graph cut splits the mentions into subgraphs and the split-ting continues until each subgraph corresponds to one of the entities."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-24",
"text": "The cut stopping decision has been implemented as an SVM-based classification (Cortes and Vapnik, 1995) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-25",
"text": "The classification and clusterization phases assume that all mentions are detected."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-26",
"text": "In order to evaluate our coreference resolution method, we have (1) implemented a mention detection procedure that has the novelty of employing information derived from the word senses of common nouns as well as selected lexico-syntactic information; and (2) used a maximum entropy model for coreference classification."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-50",
"text": "**FEATURE REPRESENTATION**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-27",
"text": "The experiments conducted on MUC and ACE data indicate state-of-the-art results when compared with the methods reported in (Ng and Cardie, 2002) and (Luo et al., 2004) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-28",
"text": "The remainder of the paper is organized as follows."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-29",
"text": "In Section 2 we describe the coreference resolution method that uses the BESTCUT clusterization; Section 3 describes the approach we have implemented for detecting mentions in texts; Section 4 reports on the experimental results; Section 5 discusses related work; finally, Section 6 summarizes the conclusions."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-31",
"text": "**BESTCUT COREFERENCE RESOLUTION**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-32",
"text": "For each entity type (PERSON, ORGANIZATION, LOCATION, FACILITY or GPE 3 ) we create a graph in which the nodes represent all the mentions of that type in the text, the edges correspond to all pairwise coreference relations, and the edge weights are the confidences of the coreference relations."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-33",
"text": "We will divide this graph repeatedly by cutting the links between subgraphs until a stop model previously learned tells us that we should stop the cutting."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-34",
"text": "The end result will be a partition that approximates the correct division of the text into entities."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-35",
"text": "We consider this graph approach to clustering a more accurate representation of the relations between mentions than a tree-based approach that treats only anaphora resolution, trying to connect mentions with candidate referents that appear in text before them."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-36",
"text": "We believe that a correct resolution has to tackle cataphora resolution as well, by taking into account referents that appear in the text after the anaphors."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-37",
"text": "Furthermore, we believe that a graph representation of mentions in a text is more adequate than a tree representation because the coreference relation is symmetrical in addi-3 Entity types as defined by (NIST, 2003) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-38",
"text": "tion to being transitive."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-39",
"text": "A greedy bottom-up approach does not make full use of this property."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-40",
"text": "A graph-based clusterization starts with a complete overall view of all the connections between mentions, therefore local errors are much less probable to influence the correctness of the outcome."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-41",
"text": "If two mentions are strongly connected, and one of them is strongly connected with the third, all three of them will most probably be clustered together even if the third edge is not strong enough, and that works for any order in which the mentions might appear in the text."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-43",
"text": "**LEARNING ALGORITHM**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-44",
"text": "The coreference confidence values that become the weights in the starting graphs are provided by a maximum entropy model, trained on the training datasets of the corpora used in our experiments."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-45",
"text": "For maximum entropy classification we used a maxent 4 tool."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-46",
"text": "Based on the data seen, a maximum entropy model (Berger et al., 1996) offers an expression (1) for the probability that there exists coreference C between a mention m i and a mention m j ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-47",
"text": "where g k (m i , m j , C) is a feature and \u03bb k is its weight; Z(m i , m j ) is a normalizing factor."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-48",
"text": "We created the training examples in the same way as (Luo et al., 2004) , by pairing all mentions of the same type, obtaining their feature vectors and taking the outcome (coreferent/noncoreferent) from the key files."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-51",
"text": "We duplicated the statistical model used by (Luo et al., 2004) , with three differences."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-52",
"text": "First, no feature combination was used, to prevent long running times on the large amount of ACE data."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-53",
"text": "Second, through an analysis of the validation data, we implemented seven new features, presented in Table 1."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-54",
"text": "Third, as opposed to (Luo et al., 2004) , who represented all numerical features quantized, we translated each numerical feature into a set of binary features that express whether the value is in certain intervals."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-55",
"text": "This transformation was necessary because our maximum entropy tool performs better on binary features."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-56",
"text": "(Luo et al., 2004) 's features were not reproduced here from lack of space; please refer to the relevant paper for details."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-57",
"text": "Each of these initial graphs will be cut repeatedly until the resulting partition is satisfactory."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-58",
"text": "In each cut, we eliminate from the graph the edges between subgraphs that have a very weak connection, and whose mentions are most likely not part of the same entity."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-59",
"text": "Formally, the graph model can be defined as follows."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-60",
"text": "Let M = {m i : 1..n} be n mentions in the document and E = {e j : 1..m} be m entities."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-61",
"text": "Let g : M \u2192 E be the map from a mention m i \u2208 M to an entity e j \u2208 E. Let c : M xM \u2192 [0, 1] be the confidence the learning algorithm attaches to the coreference between two mentions m i , m j \u2208 M ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-62",
"text": "Let T = {t k : 1..p} be the set of entity types or classes."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-63",
"text": "Then we attach to each entity class t k an undirected, edge-weighted graph"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-64",
"text": "The partitioning of the graph is based at each step on the cut weight."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-65",
"text": "As a starting point, we used the Min-Cut algorithm, presented and proved correct in (Stoer and Wagner, 1994) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-66",
"text": "In this simple and efficient method, the weight of the cut of a graph into two subgraphs is the sum of the weights of the edges crossing the cut."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-67",
"text": "The partition that minimizes the cut weight is the one chosen."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-68",
"text": "The main procedure of the algorithm computes cutsof-the-phase repeatedly and selects the one with the minimum cut value (cut weight)."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-69",
"text": "We adapted this algorithm to our coreference situation."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-70",
"text": "To decide the minimum cut (from here on called the BESTCUT), we use as cut weight the number of mentions that are correctly placed in their set."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-71",
"text": "The method for calculating the correctness score is presented in Figure 1 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-72",
"text": "The BESTCUT at one stage is the cut-of-the-phase with the highest correctness score."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-73",
"text": "then corrects-max++ 9 return (corrects-avg + corrects-max) / 2 An additional learning model was trained to decide if cutting a set of mentions is better or worse than keeping the mentions together."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-74",
"text": "The model was optimized to maximize the ECM-F score 5 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-75",
"text": "We will denote by S the larger part of the cut and T the smaller one."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-77",
"text": "S.V and T.V are the set of vertexes in S and in T , respectively."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-78",
"text": "S.E is the set of edges from S, while T.E is the set of edges from T ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-79",
"text": "The features for stopping the cut are presented in Table 2 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-80",
"text": "The model was trained using 10-fold cross-validation on the training set."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-81",
"text": "In order to learn when to stop the cut, we generated a list of positive and negative examples from the training files."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-82",
"text": "Each training example is associated with a certain cut (S, T )."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-83",
"text": "Since we want to learn a stop function, the positive examples must be examples that describe when the cut must not be done, and the negative examples are examples that present situations when the cut must be performed."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-84",
"text": "Let us consider that the list of entities from a text is E = {e j : 1..m} with e j = {m i 1 , m i 2 , ...m i k } the list of mentions that refer to e j ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-85",
"text": "We generated a negative example for each pair (S = {e i }, T = {e j }) with i = jeach entity must be separated from any other en-"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-87",
"text": "**FEATURE NAME FEATURE DESCRIPTION ST-RATIO**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-88",
"text": "|S.V |/|T.V | -the ratio between the cut parts ce-ratio |C.E|/|G.E| -the proportion of the cut from the entire graph c-min min(C.E) -the smallest edge crossing the cut c-max max(C.E) -the largest edge crossing the cut c-avg avg(C.E) -the average of the edges crossing the cut c-hmean hmean(C.E) -the harmonic mean of the edges crossing the cut c-hmeax hmeax(C.E) -a variant of the harmonic mean."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-89",
"text": "hmeax(C.E) = 1 \u2212 hmean(C.E \u2032 ) where each edge from E \u2032 has the weight equal to 1 minus the corresponding edge from E lt-c-avg-ratio how many edges from the cut are less than the average of the cut (as a ratio) lt-c-hmeanratio how many edges from the cut are less than the harmonic mean of the cut (as a ratio) st-avg avg(S.E + T.E) -the average of the edges from the graph when the edges from the cut are not considered g-avg avg(G.E) -the average of the edges from the graph st-wrong-avgratio how many vertexes are in the wrong part of the cut using the average measure for the 'wrong' (as a ratio) st-wrongmax-ratio how many vertexes are in the wrong part of the cut using the max measure for the 'wrong' (as a ratio) lt-c-avg-ratio < st-lt-c-avgratio 1 if r1 < r2, 0 otherwise; r1 is the ratio of the edges from C.E that are smaller than the average of the cut; r2 is the ratio of the edges from S.E + T.E that are smaller than the average of the cut g-avg > stavg 1 if the avg(G.E) > avg(S.E + T.E), and 0 otherwise tity."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-90",
"text": "We also generated negative examples for all pairs (S = {e i }, T = E \\ S) -each entity must be separated from all the other entities considered together."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-91",
"text": "To generate positive examples, we simulated the cut on a graph corresponding to a single entity e j ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-92",
"text": "Every partial cut of the mentions of e j was considered as a positive example for our stop model."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-93",
"text": "We chose not to include pronouns in the BEST-CUT initial graphs, because, since most features are oriented towards Named Entities and common nouns, the learning algorithm (maxent) links pronouns with very high probability to many possible antecedents, of which not all are in the same chain."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-94",
"text": "Thus, in the clusterization phase the pronouns would act as a bridge between different entities that should not be linked."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-95",
"text": "To prevent this, we solved the pronouns separately (at the end of"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-96",
"text": "entities.add(NewEntity(G)) 9 else 10 queue.push back(S) 11"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-97",
"text": "queue.push back(T ) 12 return entities the BESTCUT algorithm) by linking them to their antecedent with the best coreference confidence."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-98",
"text": "Figure 2 details the main procedure of the BESTCUT algorithm."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-99",
"text": "The algorithm receives as input a weighted graph having a vertex for each mention considered and outputs the list of entities created."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-100",
"text": "In each stage, a cut is proposed for all subgraphs in the queue."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-101",
"text": "In case StopTheCut decides that the cut must be performed on the subgraph, the two sides of the cut are added to the queue (lines 10-11); if the graph is well connected and breaking the graph in two parts would be a bad thing, the current graph will be used to create a single entity (line 8)."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-102",
"text": "The algorithm ends when the queue becomes empty."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-103",
"text": "ProposeCut (FigProposeCut ure 3) returns a cut of the graph obtained with an algorithm similar to the Min-Cut algorithm's procedure called MinimumCut."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-104",
"text": "The differences between our algorithm and the Min-Cut procedure are that the most tightly connected vertex in each step of the ProposeCutPhase procedure, z, is found using expression 2:"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-105",
"text": "where w a (A, y) = 1 |A| x\u2208A w(x, y), and the islighter test function uses the correctness score presented before: the partial cut with the larger correctness score is better."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-106",
"text": "The ProposeCutPhase function is presented in Figure 4 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-107",
"text": "last \u2190 the most tightly connected vertex 4 add last to A 5 store the cut-of-the-phase and shrink G by merging the two vertexes added last 6 return (G.V \\ {last}, last) Figure 4 : The algorithm for ProposeCutPhase."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-109",
"text": "**AN EXAMPLE**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-110",
"text": "Let us consider an example of how the BESTCUT algorithm works on two simple sentences (Figure 5) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-111",
"text": "The entities present in this example are: {Mary 1 , the girl 5 } and {a brother 2 , John 3 , The boy 4 }."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-112",
"text": "Since they are all PERSONs, the algorithm The initial graph is illustrated in Figure 6 , with the coreference relation marked through a different coloring of the nodes."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-113",
"text": "Each node number corresponds to the mention with the same index in Figure 5 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-114",
"text": "The strongest confidence score is between a brother 2 and John 3 , because they are connected through an apposition relation."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-115",
"text": "The graph was simplified by eliminating the edges that have an insignificant weight, e.g. the edges between John 3 and the girl 5 or between Mary 1 and a brother 2 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-116",
"text": "Function BESTCUT starts with the whole graph."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-117",
"text": "The first cut of the phase, obtained by function ProposeCutPhase, is the one in Figure 7 .a."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-118",
"text": "This In calculating the score of the cut (using the algorithm from Figure 1) , we obtain an average number of three correctly placed mentions."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-119",
"text": "This can be verified intuitively on the drawing: mentions 1, 2 and 5 are correctly placed, while 3 and 4 are not."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-120",
"text": "The score of this cut is therefore 3."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-121",
"text": "The second, the third and the fourth cuts of the phase, in Figures 7.b, 7.c and 7.d, have the scores 4, 5 and 3.5 respectively."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-122",
"text": "An interesting thing to note at the fourth cut is that the score is no longer an integer."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-123",
"text": "This happens because it is calculated as an average between corrects-avg = 4 and correctsmax = 3."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-125",
"text": "The average of the outgoing weights of mention 1 is 0.225, less than 0.5 (the default weight assigned to a single mention) therefore the first method declares it is correctly placed."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-126",
"text": "The second considers only the maximum; 0.6 is greater than 0.5, so the mention appears to be more strongly connected with the outside than the inside."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-127",
"text": "As we can see, the contradiction is because of the uneven distribution of the weights of the outgoing edges."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-128",
"text": "The first proposed cut is the cut with the great-"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-130",
"text": "**FRANKENSTEIN#2 OIL_TYCOON#1 WORKER#1**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-131",
"text": "Figure 8: Part of the hierarchy containing 42 WordNet equivalent concepts for the five entity types, with all their synonyms and hyponyms."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-132",
"text": "The hierarchy has 31,512 word-sense pairs in total est score, which is Cut 3 (Figure 7 .c)."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-133",
"text": "Because this is also the correct cut, all cuts proposed after this one will be ignored-the machine learning algorithm that was trained when to stop a cut will always declare against further cuts."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-134",
"text": "In the end, the cut returned by function BESTCUT is the correct one: it divides mentions Mary 1 and the girl 5 from mentions a brother 2 , John 3 and The boy 4 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-136",
"text": "**MENTION DETECTION**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-137",
"text": "Because our BESTCUT algorithm relies heavily on knowing entity types, we developed a method for recognizing entity types for nominal mentions."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-138",
"text": "Our statistical approach uses maximum entropy classification with a few simple lexical and syntactic features, making extensive use of WordNet (Fellbaum, 1998) hierarchy information."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-139",
"text": "We used the ACE corpus, which is annotated with mention and entity information, as data in a supervised machine learning method to detect nominal mentions and their entity types."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-140",
"text": "We assigned six entity types: PERSON, ORGANIZATION, LOCA-TION, FACILITY, GPE and UNK (for those who are in neither of the former categories) and two genericity outcomes: GENERIC and SPECIFIC."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-141",
"text": "We only considered the intended value of the mentions from the corpus."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-142",
"text": "This was motivated by the fact that we need to classify mentions according to the context in which they appear, and not in a general way."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-143",
"text": "Only contextual information is useful further in coreference resolution."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-144",
"text": "We have experimentally discovered that the use of word sense disambiguation improves the performance tremendously (a boost in score of 10%), therefore all the features use the word senses from a previously-applied word sense disambiguation program, taken from (Mihalcea and Csomai, 2005) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-145",
"text": "For creating training instances, we associated an outcome to each markable (NP) detected in the training files: the markables that were present in the key files took their outcome from the key file annotation, while all the other markables were associated with outcome UNK."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-146",
"text": "We then created a training example for each of the markables, with the feature vector described below and as target function the outcome."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-147",
"text": "The aforementioned outcome can be of three different types."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-170",
"text": "We experimented on the ACE Phase 2 (NIST, 2003) and MUC6 (MUC-6, 1995) corpora."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-148",
"text": "The first type of outcome that we tried was the entity type (one member of the set PERSON, ORGANIZATION, LO-CATION, FACILITY, GPE and UNK); the second type was the genericity information (GENERIC or SPECIFIC), whereas the third type was a combination between the two (pairwise combinations of the entity types set and the genericity set, e.g. PERSON SPECIFIC)."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-149",
"text": "The feature set consists of WordNet features, lexical features, syntactic features and intelligent context features, briefly described in Table 3 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-150",
"text": "With the WordNet features we introduce the WordNet equivalent concept."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-151",
"text": "A WordNet equivalent concept for an entity type is a word-sense pair from WordNet whose gloss is compatible with the definition of that entity type."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-152",
"text": "Figure 8 enumerates a few WordNet equivalent concepts for entity class PERSON (e.g. CHARACTER#1), with their hierarchy of hyponyms (e.g. Frankenstein#2)."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-153",
"text": "The lexical feature is useful because some words are almost always of a certain type (e.g. \"company\")."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-154",
"text": "The intelligent context set of features are an improvement on basic context features that use the stems of the words that are within a window of a certain size around the word."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-155",
"text": "In addition to this set of features, we created more features by combining them into pairs."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-156",
"text": "Each pair contains two features from two different classes."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-157",
"text": "For instance, we will have features like: is-a- true if the mention is a modifier to a TYPE mention in-apposition-with TYPE of the mention our mention is in apposition with intelligent context all-mods the nominal, adjectival and pronominal modifiers in the mention's parse tree preps the prepositions right before and after the mention's parse tree Table 3 : The features for the mention detection system."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-159",
"text": "**PERSON\u223cIN-APPOSITION-WITH(PERSON).**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-160",
"text": "All these features apply to the \"true head\" of a noun phrase, i.e. if the noun phrase is a partitive construction (\"five students\", \"a lot of companies\", \"a part of the country\"), we extract the \"true head\", the whole entity that the part was taken out of (\"students\", \"companies\", \"country\") , and apply the features to that \"true head\" instead of the partitive head."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-161",
"text": "For combining the mention detection module with the BESTCUT coreference resolver, we also generated classifications for Named Entities and pronouns by using the same set of features minus the WordNet ones (which only apply to nominal mentions)."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-162",
"text": "For the Named Entity classifier, we added the feature Named-Entity-type as obtained by the Named Entity Recognizer."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-163",
"text": "We generated a list of all the markable mentions and their entity types and presented it as input to the BEST-CUT resolver instead of the list of perfect mentions."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-164",
"text": "Note that this mention detection does not contain complete anaphoricity information."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-165",
"text": "Only the mentions that are a part of the five considered classes are treated as anaphoric and clustered, while the UNK mentions are ignored, even if an outside anaphoricity classifier might categorize some of them as anaphoric."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-166",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-167",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-168",
"text": "The clusterization algorithms that we implemented to evaluate in comparison with our method are (Luo et al., 2004) 's Belltree and Link-Best (best-first clusterization) from (Ng and Cardie, 2002) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-169",
"text": "The features used were described in section 2.2."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-171",
"text": "Since we aimed to measure the performance of coreference, the metrics used for evaluation are the ECM-F (Luo et al., 2004) and the MUC P, R and F scores (Vilain et al., 1995) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-172",
"text": "In our first experiment, we tested the three coreference clusterization algorithms on the development-test set of the ACE Phase 2 corpus, first on true mentions (i.e. the mentions annotated in the key files), then on detected mentions (i.e. the mentions output by our mention detection system presented in section 3) and finally without any prior knowledge of the mention types."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-173",
"text": "The results obtained are tabulated in Table 4 ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-174",
"text": "As can be observed, when it has prior knowledge of the mention types BESTCUT performs significantly better than the other two systems in the ECM-F score and slightly better in the MUC metrics."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-175",
"text": "The more knowledge it has about the mentions, the better it performs."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-176",
"text": "This is consistent with the fact that the first stage of the algorithm divides the graph into subgraphs corresponding to the five entity types."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-177",
"text": "If BESTCUT has no information about the mentions, its performance ranks significantly under the LinkBest and Belltree algorithms in ECM-F and MUC R. Surprisingly enough, the Belltree algorithm, a globally optimized algorithm, performs similarly to Link-Best in most of the scores."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-178",
"text": "Despite not being as dramatically affected as BESTCUT, the other two algorithms also decrease in performance with the decrease of the mention information available, which empirically proves that mention detection is a very important module for coreference resolution."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-179",
"text": "Even with an F-score of 77.2% for detecting entity types, our mention detection system boosts the scores of all three algorithms when compared to the case where no information is available."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-180",
"text": "It is apparent that the MUC score does not vary significantly between systems."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-181",
"text": "This only shows that none of them is particularly poor, but it is not a relevant way of comparing methods-the MUC metric has been found too indulgent by researchers ( (Luo et al., 2004) , (Baldwin et al., 1998) Table 4 : Comparison of results between three clusterization algorithms on ACE Phase 2."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-182",
"text": "The learning algorithms are maxent for coreference and SVM for stopping the cut in BESTCUT."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-183",
"text": "In turn, we obtain the mentions from the key files, detect them with our mention detection algorithm or do not use any information about them."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-184",
"text": "annotation keys and the system output, while the ECM-F metric aligns the detected entities with the key entities so that the number of common mentions is maximized."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-185",
"text": "The ECM-F scorer overcomes two shortcomings of the MUC scorer: not considering single mentions and treating every error as equally important (Baldwin et al., 1998) , which makes the ECM-F a more adequate measure of coreference."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-186",
"text": "Our second experiment evaluates the impact that the different categories of our added features have on the performance of the BESTCUT system."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-187",
"text": "The experiment was performed with a maxent classifier on the MUC6 corpus, which was priorly converted into ACE format, and employed mention information from the key annotations."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-188",
"text": "Table 5 : Impact of feature categories on BEST-CUT on MUC6."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-189",
"text": "Baseline system has the (Luo et al., 2004) features."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-190",
"text": "The system was tested on key mentions."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-192",
"text": "**MUC SCORE MODEL**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-193",
"text": "From Table 5 we can observe that the lexical features (head-match, type-pair, name-alias) have the most influence on the ECM-F and MUC scores, succeeded by the syntactic features (samegoverning-category, path, coll-comm) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-194",
"text": "Despite what intuition suggests, the improvement the grammatical feature gn-agree brings to the system is very small."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-196",
"text": "**RELATED WORK**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-197",
"text": "It is of interest to discuss why our implementation of the Belltree system (Luo et al., 2004 ) is comparable in performance to Link-Best (Ng and Cardie, 2002) ."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-198",
"text": "(Luo et al., 2004) do the clusterization through a beam-search in the Bell tree using either a mention-pair or an entity-mention model, the first one performing better in their experiments."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-199",
"text": "Despite the fact that the Bell tree is a complete representation of the search space, the search in it is optimized for size and time, while potentially losing optimal solutions-similarly to a Greedy search."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-200",
"text": "Moreover, the fact that the two implementations are comparable is not inconceivable once we consider that (Luo et al., 2004 ) never compared their system to another coreference resolver and reported their competitive results on true mentions only."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-201",
"text": "(Ng, 2005) treats coreference resolution as a problem of ranking candidate partitions generated by a set of coreference systems."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-202",
"text": "The overall performance of the system is limited by the performance of its best component."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-203",
"text": "The main difference between this approach and ours is that (Ng, 2005) 's approach takes coreference resolution one step further, by comparing the results of multiple systems, while our system is a single resolver; furthermore, he emphasizes the global optimization of ranking clusters obtained locally, whereas our focus is on globally optimizing the clusterization method inside the resolver."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-204",
"text": "( DaumeIII and Marcu, 2005a) use the Learning as Search Optimization framework to take into account the non-locality behavior of the coreference features."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-205",
"text": "In addition, the researchers treat mention detection and coreference resolution as a joint problem, rather than a pipeline approach like we do."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-206",
"text": "By doing so, it may be easier to detect the entity type of a mention once we have additional clues (expressed in terms of coreference features) about its possible antecedents."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-207",
"text": "For example, labeling Washington as a PERSON is more probable after encountering George Washington previously in the text."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-208",
"text": "However, the coreference problem does not immediately benefit from the joining."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-210",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-211",
"text": "We have proposed a novel coreference clusterization method that takes advantage of the efficiency and simplicity of graph algorithms."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-212",
"text": "The approach is top-down and globally optimized, and takes into account cataphora resolution in addition to anaphora resolution."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-213",
"text": "Our system compares favorably to two other implemented coreference systems and achieves state-of-the-art performance on the ACE Phase 2 corpus on true and detected mentions."
},
{
"sent_id": "4cb16f436d910d82c3661052c1fa30-C001-214",
"text": "We have also briefly described our mention detection system whose output we used in conjunction with the BESTCUT coreference system to achieve better results than when no mention information was available."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"4cb16f436d910d82c3661052c1fa30-C001-9"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-15",
"4cb16f436d910d82c3661052c1fa30-C001-16"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-20",
"4cb16f436d910d82c3661052c1fa30-C001-21",
"4cb16f436d910d82c3661052c1fa30-C001-22"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-180",
"4cb16f436d910d82c3661052c1fa30-C001-181"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-198",
"4cb16f436d910d82c3661052c1fa30-C001-199"
]
],
"cite_sentences": [
"4cb16f436d910d82c3661052c1fa30-C001-9",
"4cb16f436d910d82c3661052c1fa30-C001-15",
"4cb16f436d910d82c3661052c1fa30-C001-22",
"4cb16f436d910d82c3661052c1fa30-C001-181"
]
},
"@MOT@": {
"gold_contexts": [
[
"4cb16f436d910d82c3661052c1fa30-C001-15",
"4cb16f436d910d82c3661052c1fa30-C001-16"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-198",
"4cb16f436d910d82c3661052c1fa30-C001-199"
]
],
"cite_sentences": [
"4cb16f436d910d82c3661052c1fa30-C001-15"
]
},
"@DIF@": {
"gold_contexts": [
[
"4cb16f436d910d82c3661052c1fa30-C001-20",
"4cb16f436d910d82c3661052c1fa30-C001-21",
"4cb16f436d910d82c3661052c1fa30-C001-22"
]
],
"cite_sentences": [
"4cb16f436d910d82c3661052c1fa30-C001-22"
]
},
"@EXT@": {
"gold_contexts": [
[
"4cb16f436d910d82c3661052c1fa30-C001-20",
"4cb16f436d910d82c3661052c1fa30-C001-21",
"4cb16f436d910d82c3661052c1fa30-C001-22"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-51",
"4cb16f436d910d82c3661052c1fa30-C001-52",
"4cb16f436d910d82c3661052c1fa30-C001-53",
"4cb16f436d910d82c3661052c1fa30-C001-54"
]
],
"cite_sentences": [
"4cb16f436d910d82c3661052c1fa30-C001-22",
"4cb16f436d910d82c3661052c1fa30-C001-51",
"4cb16f436d910d82c3661052c1fa30-C001-54"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"4cb16f436d910d82c3661052c1fa30-C001-27"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-56"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-197"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-200"
]
],
"cite_sentences": [
"4cb16f436d910d82c3661052c1fa30-C001-27",
"4cb16f436d910d82c3661052c1fa30-C001-197",
"4cb16f436d910d82c3661052c1fa30-C001-200"
]
},
"@USE@": {
"gold_contexts": [
[
"4cb16f436d910d82c3661052c1fa30-C001-48"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-168"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-171"
],
[
"4cb16f436d910d82c3661052c1fa30-C001-189",
"4cb16f436d910d82c3661052c1fa30-C001-190"
]
],
"cite_sentences": [
"4cb16f436d910d82c3661052c1fa30-C001-48",
"4cb16f436d910d82c3661052c1fa30-C001-168",
"4cb16f436d910d82c3661052c1fa30-C001-171",
"4cb16f436d910d82c3661052c1fa30-C001-189"
]
}
}
},
"ABC_715cba53c376e50b76a0966ff16a6a_5": {
"x": [
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-96",
"text": "**SEMI-SUPERVISED LEARNING**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-44",
"text": "**SENTIMENT CLASSIFICATION**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-194",
"text": "For the other four data sets, the ADN structure is 50-50-200-2."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-94",
"text": "After training, we can determine y by the trained ADN while a new sample x is fed."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-2",
"text": "This paper presents a novel semisupervised learning algorithm called Active Deep Networks (ADN), to address the semi-supervised sentiment classification problem with active learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-3",
"text": "First, we propose the semi-supervised learning method of ADN."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-4",
"text": "ADN is constructed by Restricted Boltzmann Machines (RBM) with unsupervised learning using labeled data and abundant of unlabeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-5",
"text": "Then the constructed structure is finetuned by gradient-descent based supervised learning with an exponential loss function."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-6",
"text": "Second, we apply active learning in the semi-supervised learning framework to identify reviews that should be labeled as training data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-7",
"text": "Then ADN architecture is trained by the selected labeled data and all unlabeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-8",
"text": "Experiments on five sentiment classification datasets show that ADN outperforms the semi-supervised learning algorithm and deep learning techniques applied for sentiment classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-11",
"text": "In recent years, sentiment analysis has received considerable attentions in Natural Language Processing (NLP) community (Blitzer et al., 2007; Dasgupta and Ng, 2009; Pang et al., 2002) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-12",
"text": "Polarity classification, which determine whether the sentiment expressed in a document is positive or negative, is one of the most popular tasks of sentiment analysis (Dasgupta and Ng, 2009 )."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-13",
"text": "Sentiment classification is a special type of text categorization, where the criterion of classification is the attitude expressed in the text, rather than the subject or topic."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-14",
"text": "Labeling the reviews with their sentiment would provide succinct summaries to readers, which makes it possible to focus the text mining on areas in need of improvement or on areas of success (Gamon, 2004) and is helpful in business intelligence applications, recommender systems, and message filtering (Pang, et al., 2002) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-15",
"text": "While topics are often identifiable by keywords alone, sentiment classification appears to be a more challenge task (Pang, et al., 2002) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-16",
"text": "First, sentiment is often conveyed with subtle linguistic mechanisms such as the use of sarcasm and highly domain-specific contextual cues (Li et al., 2009 )."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-17",
"text": "For example, although the sentence \"The thief tries to protect his excellent reputation\" contains the word \"excellent\", it tells us nothing about the author's opinion and in fact could be well embedded in a negative review."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-18",
"text": "Second, sentiment classification systems are typically domain-specific, which makes the expensive process of annotating a large amount of data for each domain and is a bottleneck in building high quality systems (Dasgupta and Ng, 2009 )."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-19",
"text": "This motivates the task of learning robust sentiment models from minimal supervision (Li, et al., 2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-20",
"text": "Recently, semi-supervised learning, which uses large amount of unlabeled data together with labeled data to build better learners (Raina et al., 2007; Zhu, 2007) , has drawn more attention in sentiment analysis (Dasgupta and Ng, 2009; Li, et al., 2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-21",
"text": "As argued by several researchers (Bengio, 2007; Salakhutdinov and Hinton, 2007) , deep architecture, composed of multiple levels of non-linear operations (Hinton et al., 2006) , is expected to perform well in semi-supervised learning because of its capability of modeling hard artificial intelligent tasks."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-22",
"text": "Deep Belief Networks (DBN) is a representative deep learning algorithm achieving notable success for semi-supervised learning (Hinton, et al., 2006) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-23",
"text": "Ranzato and Szummer (2008) propose an algorithm to learn text document representations based on semi-supervised auto-encoders that are combined to form a deep network."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-24",
"text": "Active learning is another way that can minimize the number of required labeled data while getting competitive result."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-25",
"text": "Usually, the training set is chosen randomly."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-26",
"text": "However, active learning choose the training data actively, which reduce the needs of labeled data (Tong and Koller, 2002) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-27",
"text": "Recently, active learning had been applied in sentiment classification (Dasgupta and Ng, 2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-28",
"text": "Inspired by the study of semi-supervised learning, active learning and deep architecture, this paper proposes a novel semi-supervised polarity classification algorithm called Active Deep Networks (ADN) that is based on a representative deep learning algorithm Deep Belief Networks (DBN) (Hinton, et al., 2006) and active learning (Tong and Koller, 2002) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-29",
"text": "First, we propose the ADN architecture, which utilizes a new deep architecture for classification, and an exponential loss function aiming to maximize the separability of the classifier."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-30",
"text": "Second, we propose the ADN algorithm."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-31",
"text": "It firstly identifies a small number of manually labeled reviews by an active learner, and then trains the ADN classifier with the identified labeled data and all of the unlabeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-32",
"text": "Our paper makes several important contributions: First, this paper proposes a novel ADN architecture that integrates the abstraction ability of deep belief nets and the classification ability of backpropagation strategy."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-33",
"text": "It improves the generalization capability by using abundant unlabeled data, and directly optimizes the classification results in training dataset using back propagation strategy, which makes it possible to achieve attractive classification performance with few labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-34",
"text": "Second, this paper proposes an effective active learning method that integrates the labeled data selection ability of active learning and classification ability of ADN architecture."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-35",
"text": "Moreover, the active learning is also based on the ADN architecture, so the labeled data selector and the classifier are based on the same architecture, which provides an unified framework for semi-supervised classification task."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-36",
"text": "Third, this paper applies semisupervised learning and active learning to sentiment classification successfully and gets competitive performance."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-37",
"text": "Our experimental results on five sentiment classification datasets show that ADN outperforms previous sentiment classification methods and deep learning methods."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-38",
"text": "The rest of the paper is organized as follows."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-39",
"text": "Section 2 gives an overview of sentiment classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-40",
"text": "The proposed semi-supervised learning method ADN is described in Section 3."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-41",
"text": "Section 4 shows the empirical validation of ADN by comparing its classification performance with previous sentiment classifiers and deep learning methods on sentiment datasets."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-42",
"text": "The paper is closed with conclusion."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-45",
"text": "Sentiment classification can be performed on words, sentences or documents, and is generally categorized into lexicon-based and corpus-based classification method (Wan, 2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-46",
"text": "The detailed survey about techniques and approaches of sentiment classification can be seen in the book (Pang and Lee, 2008) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-47",
"text": "In this paper we focus on corpus-based classification method."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-48",
"text": "Corpus-based methods use a labeled corpus to train a sentiment classifier (Wan, 2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-49",
"text": "Pang et al. (2002) apply machine learning approach to corpus-based sentiment classification firstly."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-50",
"text": "They found that standard machine learning techniques outperform human-produced baselines."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-51",
"text": "Pang and Lee (2004) apply text-categorization techniques to the subjective portions of the sentiment document."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-52",
"text": "These portions are extracted by efficient techniques for finding minimum cuts in graphs."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-53",
"text": "Gamon (2004) demonstrate that using large feature vectors in combination with feature reduction, high accuracy can be achieved in the very noisy domain of customer feedback data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-54",
"text": "Xia et al. (2008) propose the sentiment vector space model to represent song lyric document, assign the sentiment labels such as light-hearted and heavy-hearted."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-55",
"text": "Supervised sentiment classification systems are domain-specific and annotating a large scale corpus for each domain is very expensive (Dasgupta and Ng, 2009 )."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-56",
"text": "There are several solutions for this corpus annotation bottleneck."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-57",
"text": "The first type of solution is using old domain labeled examples to new domain sentiment clas-sification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-58",
"text": "Blitzer et al. (2007) investigate domain adaptation for sentiment classifiers, which could be used to select a small set of domains to annotate and their trained classifiers would transfer well to many other domains."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-59",
"text": "Li and Zong (2008) study multi-domain sentiment classification, which aims to improve performance through fusing training data from multiple domains."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-60",
"text": "The second type of solution is semisupervised sentiment classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-61",
"text": "Sindhwani and Melville (2008) propose a semi-supervised sentiment classification algorithm that utilizes lexical prior knowledge in conjunction with unlabeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-62",
"text": "Dasgupta and Ng (2009) firstly mine the unambiguous reviews using spectral techniques, and then exploit them to classify the ambiguous reviews via a novel combination of active learning, transductive learning, and ensemble learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-63",
"text": "The third type of solution is unsupervised sentiment classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-64",
"text": "Zagibalov and Carroll (2008) describe an automatic seed word selection for unsupervised sentiment classification of product reviews in Chinese."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-65",
"text": "However, unsupervised learning of sentiment is difficult, partially because of the prevalence of sentimentally ambiguous reviews (Dasgupta and Ng, 2009 )."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-66",
"text": "Using multi-domain sentiment corpus to sentiment classification is also hard to apply, because each domain has a very limited amount of training data, due to annotating a large corpus is difficult and time-consuming (Li and Zong, 2008) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-67",
"text": "So in this paper we focus on semi-supervised approach to sentiment classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-69",
"text": "**ACTIVE DEEP NETWORKS**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-70",
"text": "In this part, we propose a semi-supervised learning algorithm, Active Deep Networks (ADN), to address the sentiment classification problem with active learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-71",
"text": "Section 3.1 formulates the ADN problem."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-72",
"text": "Section 3.2 proposes the semi-supervised learning of ADN without active learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-73",
"text": "Section 3.3 proposes the active learning method of ADN."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-74",
"text": "Section 3.4 gives the ADN procedure."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-76",
"text": "**PROBLEM FORMULATION**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-77",
"text": "There are many review documents in the dataset."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-78",
"text": "We preprocess these reviews to be classified, which is similar with Dasgupta and Ng (2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-79",
"text": "Each review is represented as a vector of unigrams, using binary weight equal to 1 for terms present in a vector."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-80",
"text": "Moreover, the punctuations, numbers, and words of length one are removed from the vector."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-81",
"text": "Finally, we sort the vocabulary by document frequency and remove the top 1.5%."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-82",
"text": "It is because that many of these high document frequency words are stopwords or domain specific general-purpose words."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-83",
"text": "After preprocess, every review can be represented by a vector."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-84",
"text": "Then the dataset can be represented as a matrix:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-85",
"text": "where R is the number of training samples, T is the number of test samples, D is the number of feature words in the dataset."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-86",
"text": "Every column of X corresponds to a sample x, which is a representation of a review."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-87",
"text": "A sample that has all features is viewed as a vector in D , where the i th coordinate corresponds to the i th feature."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-88",
"text": "The L labeled samples are chosen randomly from R training samples, or chosen actively by active learning, which can be seen as:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-89",
"text": "where S is the index of selected training reviews to be labeled manually."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-90",
"text": "Let Y be a set of labels corresponds to L labeled training samples and is denoted as:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-91",
"text": "where C is the number of classes."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-92",
"text": "We intend to seek the mapping function"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-93",
"text": "Y using the L labeled data and R+T-L unlabeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-97",
"text": "To address the problem formulated in section 3.1, we propose a novel deep architecture for ADN method, as show in Figure 1 2 ,\u2026,w N } for the deep architecture."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-98",
"text": "The semi-supervised learning method based on ADN architecture can be divided into two stages: First, AND architecture is constructed by greedy layer-wise unsupervised learning using RBMs as building blocks."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-99",
"text": "All the unlabeled data together with L labeled data are utilized to find the parameter space W with N layers."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-100",
"text": "Second, ADN architecture is trained according to the exponential loss function using gradient descent method."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-101",
"text": "The parameter space W is retrained by an exponential loss function using L labeled data. ,; The derivative of the log-likelihood with respect to the model parameter w k can be obtained by the CD method (Hinton, 2002) : The above discussion is based on the training of the parameters between two hidden layers with one sample data x. For unsupervised learning, we construct the deep architecture using all labeled data with unlabeled data by inputting them one by one from layer h 0 , train the parameter between h 0 and h 1 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-102",
"text": "Then h 1 is constructed, we 1518 can use it to construct the up one layer h 2 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-103",
"text": "The deep architecture is constructed layer by layer from bottom to top, and in each time, the parameter space w k is trained by the calculated data in the k-1 th layer."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-104",
"text": "According to the w k calculated above, the layer h k can be got as below when a sample x is fed from layer h 0 :"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-105",
"text": "The parameter space w N is initialized randomly, just as backpropagation algorithm."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-106",
"text": "Then ADN architecture is constructed."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-107",
"text": "The top hidden layer is formulated as:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-108",
"text": "For supervised learning, the ADN architecture is trained by L labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-109",
"text": "The optimization problem is formulized as:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-110",
"text": "and the loss function is defined as"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-111",
"text": "In the supervised learning stage, the stochastic activities are replaced by deterministic, real valued probabilities."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-112",
"text": "We use gradient-descent through the whole deep architecture to retrain the weights for optimal classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-114",
"text": "**ACTIVE LEARNING**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-115",
"text": "Semi-supervised learning allows us to classify reviews with few labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-116",
"text": "However, annotating the reviews manually is expensive, so we want to get higher performance with fewer labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-117",
"text": "Active learning can help to choose those reviews that should be labeled manually in order to achieving higher classification performance with the same number of labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-118",
"text": "For such purpose, we incorporate pool-based active learning with the ADN method, which accesses to a pool of unlabeled instances and requests the labels for some number of them (Tong and Koller, 2002) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-119",
"text": "Given an unlabeled pool X R and a initial labeled data set X L (one positive, one negative), the ADN architecture h N will decide which instance in X R to query next."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-120",
"text": "Then the parameters of h N are adjusted after new reviews are labeled and inserted into the labeled data set."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-121",
"text": "The main issue for an active learner is the choosing of next unlabeled instance to query."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-122",
"text": "In this paper, we choose the reviews whose labels are most uncertain for the classifier."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-123",
"text": "Following previous work on active learning for SVMs (Dasgupta and Ng, 2009; Tong and Koller, 2002) , we define the uncertainty of a review as its distance from the separating hyperplane."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-124",
"text": "In other words, reviews that are near the separating hyperplane are chosen as the labeled training data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-125",
"text": "After semi-supervised learning, the parameters of ADN are adjusted."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-126",
"text": "Given an unlabeled pool X R , the next unlabeled instance to be que-"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-127",
"text": "The selected training reviews to be labeled manually are given by:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-128",
"text": "We can select a group of most uncertainty reviews to label at each time."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-129",
"text": "The experimental setting is similar with Dasgupta & Ng (2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-130",
"text": "We perform active learning for five iterations and select twenty of the most uncertainty reviews to be queried each time."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-131",
"text": "Then the ADN is re-trained on all of labeled and unlabeled reviews so far with semisupervised learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-132",
"text": "At last, we can decide the label of reviews x according to the output h N (x) of the ADN architecture as below:"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-133",
"text": "As shown by Tong and Koller (2002) , the BalanceRandom method, which randomly sample an equal number of positive and negative instances from the pool, has much better performance than the regular random method."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-134",
"text": "So we incorporate this \"Balance\" idea with ADN method."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-135",
"text": "However, to choose equal number of positive and negative instances without labeling the entire pool of instances in advance may not be practicable."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-136",
"text": "So we present a simple way to approximate the balance of positive and negative reviews."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-137",
"text": "At first, count the number of positive and negative labeled data respectively."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-138",
"text": "Second, for each iteration, classify the unlabeled reviews in the pool and choose the appropriate number of positive and negative reviews to let them equally."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-140",
"text": "**ADN PROCEDURE**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-141",
"text": "The procedure of ADN is shown in Figure 2 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-142",
"text": "For the training of ADN architecture, the parameters are random initialized with normal distribution."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-143",
"text": "All the training data and test data are used to train the ADN with unsupervised learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-144",
"text": "The training set X R can be seen as an unlabeled pool."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-145",
"text": "We randomly select one positive and one negative review in the pool to input as the initial labeled training set that are used for supervised learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-146",
"text": "The number of units in hidden layer D 1 _D N and the number of epochs Q are set manually based on the dimension of the input data and the size of training dataset."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-147",
"text": "The iteration times I and the number G of active choosing data for each iteration can be set manually based on the number of labeled data in the experiment."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-148",
"text": "For each iteration, the ADN architecture is trained by all the unlabeled data and labeled data in existence with unsupervised learning and supervised learning firstly."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-149",
"text": "Then we choose G reviews from the unlabeled pool based on the distance of these data from the separating line."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-150",
"text": "At last, label these data manually and add them to the labeled data set."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-151",
"text": "For the next iteration, the ADN architecture can be trained on the new labeled data set."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-152",
"text": "At last, ADN architecture is retrained by all the unlabeled data and existing labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-153",
"text": "After training, the ADN architecture is tested based on Equation (21)."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-154",
"text": "The proposed ADN method can active choose the labeled data set and classify the data with the same architecture, which avoid the barrier between choosing and training with different architecture."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-155",
"text": "More importantly, the parameters of ADN are trained iteratively on the label data selection process, which improve the performance of ADN."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-156",
"text": "For the ADN training process: in unsupervised learning stage, the reviews can be abstracted; in supervised learning stage, ADN is trained to map the samples belong to different classes into different regions."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-157",
"text": "We combine the unsupervised and supervised learning, and train parameter space of ADN iteratively."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-158",
"text": "The proper data that should be labeled are chosen in each iteration, which improves the classification performance of ADN."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-160",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-162",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-163",
"text": "We evaluate the performance of the proposed ADN method using five sentiment classification datasets."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-164",
"text": "The first dataset is MOV (Pang, et al., 2002) , which is a widely-used movie review dataset."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-165",
"text": "The other four dataset contain reviews of four different types of products, including books (BOO), DVDs (DVD), electronics (ELE), and kitchen appliances (KIT) (Blitzer, et al., 2007; Dasgupta and Ng, 2009 )."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-166",
"text": "Each dataset includes 1,000 positive and 1,000 negative reviews."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-167",
"text": "Similar with Dasgupta and Ng (2009), we divide the 2,000 reviews into ten equal-sized folds randomly and test all the algorithms with crossvalidation."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-168",
"text": "In each folds, 100 reviews are random selected as training data and the remaining 100 data are used for test."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-169",
"text": "Only the reviews in the training data set are used for the selection of labeled data by active learning."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-170",
"text": "The ADN architecture has different number of hidden units for each hidden layer."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-171",
"text": "For greedy"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-173",
"text": "**ACTIVE DEEP NETWORKS PROCEDURE**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-174",
"text": "Input: data X number of units in every hidden layer D1_DN number of epochs Q number of training data R number of test data T number of iterations I number of active choose data for every iteration G Initialize: W = normally distributed random numbers X L = one positive and one negative reviews for i = 1 to I"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-175",
"text": "Step 1."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-176",
"text": "Greedy layer-wise training hidden layers using RBM for n = 1 to N-1 for q = 1 to Q for k = 1 to R+T Calculate the non-linear positive and negative phase according to (10) and (11)."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-177",
"text": "Update the weights and biases by (13)."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-178",
"text": "end for end for end for"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-179",
"text": "Step 2."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-180",
"text": "Supervised learning the ADN with gradient descent Minimize f(h"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-181",
"text": ", update the parameter space W according to (16)."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-182",
"text": "Step 3."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-183",
"text": "Choose instances for labeled data set Choose G instances which near the separating line by (20) Add G instances into the labeled data set X L end Train ADN with Step 1 and Step 2."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-184",
"text": "layer-wise unsupervised learning, we train the weights of each layer independently with the fixed number of epochs equal to 30 and the learning rate is set to 0.1."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-185",
"text": "The initial momentum is 0.5 and after 5 epochs, the momentum is set to 0.9."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-186",
"text": "For supervised learning, we run 10 epochs, three times of linear searches are performed in each epoch."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-187",
"text": "We compare the classification performance of ADN with five representative classifiers, i.e., Semi-supervised spectral learning (Spectral) (Kamvar et al., 2003) , Transductive SVM (TSVM), Active learning (Active) (Tong and Koller, 2002) , Mine the Easy Classify the Hard (MECH) (Dasgupta and Ng, 2009) , and Deep Belief Networks (DBN) (Hinton, et al., 2006) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-188",
"text": "Spectral learning, TSVM, and Active learning method are three baseline methods for sentiment classification."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-189",
"text": "MECH is a new semi-supervised method for sentiment classification (Dasgupta and Ng, 2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-190",
"text": "DBN (Hinton, et al., 2006) is the classical deep learning method proposed recently."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-192",
"text": "**ADN PERFORMANCE**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-193",
"text": "For MOV dataset, the ADN structure used in this experiment is 100-100-200-2, which represents the number of units in output layer is 2, in 3 hidden layers are 100, 100, and 200 respectively."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-195",
"text": "The number of unit in input layer is the same as the dimensions of each datasets."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-196",
"text": "All theses parameters are set based on the dimension of the input data and the scale of the dataset."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-197",
"text": "Because that the number of vocabulary in MOV dataset is more than other four datasets, so the number of units in previous two hidden layers for MOV dataset are more than other four datasets."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-198",
"text": "We perform active learning for 5 iterations."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-199",
"text": "In each iteration, we select and label 20 of the most uncertain points, and then re-train the ADN on all of the unlabeled data and labeled data annotated so far."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-200",
"text": "After 5 iterations, 100 labeled data are used for training."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-201",
"text": "The classification accuracies on test data in cross validation for five datasets and six methods are shown in Table 1 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-202",
"text": "The results of previous four methods are reported by Dasgupta and Ng (2009) ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-203",
"text": "For ADN method, the initial two labeled data are selected randomly, so we repeat thirty times for each fold and the results are averaged."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-204",
"text": "For the randomness involved in the choice of labeled data, all the results of other five methods are achieved by repeating ten times for each fold and then taking average on results."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-205",
"text": "Through Table 1 , we can see that the performance of DBN is competitive with MECH."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-206",
"text": "Since MECH is the combination of spectral clustering, TSVM and Active learning, DBN is just a classification method based on deep neural network, this result proves the good learning ability of deep architecture."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-207",
"text": "ADN is a combination of semi-supervised learning and active learning based on deep architecture, the performance of ADN is better than all other five methods on five datasets."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-208",
"text": "This could be contributed by: First, ADN uses a new architecture to guide the output vector of samples belonged to different regions of new Euclidean space, which can abstract the useful information that are not accessible to other learners; Second, ADN use an exponential loss function to maximize the separability of labeled data in global refinement for better discriminability; Third, ADN fully exploits the embedding information from the large amount of unlabeled data to improve the robustness of the classifier; Fourth, ADN can choose the useful training data actively, which also improve the classification performance."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-210",
"text": "**EFFECT OF ACTIVE LEARNING**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-211",
"text": "To test the performance of our proposed active learning method, we conduct following additional experiments."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-212",
"text": "Passive learning: We random select 100 reviews from the training fold and use them as labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-213",
"text": "Then the proposed semi-supervised learning method of ADN is used to train and test the performance."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-214",
"text": "Because of randomness, we repeat 30 times for each fold and take average on results."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-215",
"text": "The test accuracies of passive learning for five datasets are shown in Table 2 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-216",
"text": "In comparison with ADN method in Table 1 , we can see that the proposed active learning method yields significantly better results than randomly chosen points, which proves effectiveness of proposed active learning method."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-217",
"text": "Fully supervised learning: We train a fully supervised classifier using all 1,000 training reviews based on the ADN architecture, results are also shown in Table 2 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-218",
"text": "Comparing with the ADN method in Table 1 , we can see that employing only 100 active learning points enables us to almost reach fully-supervised performance for three datasets."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-219",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-220",
"text": "**SEMI-SUPERVISED LEARNING WITH VARIANCE OF LABELED DATA**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-221",
"text": "To verify the performance of semi-supervised learning with different number of labeled data, we conduct another series of experiments on five datasets and show the results on Figure 3 ."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-222",
"text": "We run ten-fold cross validation for each dataset."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-223",
"text": "Each fold is repeated ten times and the results are averaged."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-224",
"text": "We can see that ADN can also get a relative high accuracy even by using just 20 labeled reviews for training."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-225",
"text": "For most of the sentiment datasets, the test accuracy is increasing slowly while the number of labeled review is growing."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-226",
"text": "This proves that ADN reaches good performance even with few labeled reviews."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-227",
"text": "----------------------------------"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-228",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-229",
"text": "This paper proposes a novel semi-supervised learning algorithm ADN to address the sentiment classification problem with a small number of labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-230",
"text": "ADN can choose the proper training data to be labeled manually, and fully exploits the embedding information from the large amount of unlabeled data to improve the robustness of the classifier."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-231",
"text": "We propose a new architecture to guide the output vector of samples belong to different regions of new Euclidean space, and use an exponential loss function to maximize the separability of labeled data in global refinement for better discriminability."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-232",
"text": "Moreover, ADN can make the right decision about which training data should be labeled based on existing unlabeled and labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-233",
"text": "By using unsupervised and supervised learning iteratively, ADN can choose the proper training data to be labeled and train the deep architecture at the same time."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-234",
"text": "Finally, the deep architecture is re-trained using the chosen labeled data and all the unlabeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-235",
"text": "We also conduct experiments to verify the effectiveness of ADN method with different number of labeled data, and demonstrate that ADN can reach very competitive classification performance just by using few labeled data."
},
{
"sent_id": "715cba53c376e50b76a0966ff16a6a-C001-236",
"text": "This results show that the proposed ADN method, which only need fewer manual labeled reviews to reach a relatively higher accuracy, can be used to train a high performance sentiment classification system."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"715cba53c376e50b76a0966ff16a6a-C001-12"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-18",
"715cba53c376e50b76a0966ff16a6a-C001-19"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-20",
"715cba53c376e50b76a0966ff16a6a-C001-21"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-27",
"715cba53c376e50b76a0966ff16a6a-C001-28"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-55",
"715cba53c376e50b76a0966ff16a6a-C001-56",
"715cba53c376e50b76a0966ff16a6a-C001-57",
"715cba53c376e50b76a0966ff16a6a-C001-60",
"715cba53c376e50b76a0966ff16a6a-C001-63"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-65",
"715cba53c376e50b76a0966ff16a6a-C001-66",
"715cba53c376e50b76a0966ff16a6a-C001-67"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-121",
"715cba53c376e50b76a0966ff16a6a-C001-122",
"715cba53c376e50b76a0966ff16a6a-C001-123",
"715cba53c376e50b76a0966ff16a6a-C001-124"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-187"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-189"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-202"
]
],
"cite_sentences": [
"715cba53c376e50b76a0966ff16a6a-C001-12",
"715cba53c376e50b76a0966ff16a6a-C001-18",
"715cba53c376e50b76a0966ff16a6a-C001-20",
"715cba53c376e50b76a0966ff16a6a-C001-27",
"715cba53c376e50b76a0966ff16a6a-C001-55",
"715cba53c376e50b76a0966ff16a6a-C001-65",
"715cba53c376e50b76a0966ff16a6a-C001-123",
"715cba53c376e50b76a0966ff16a6a-C001-187",
"715cba53c376e50b76a0966ff16a6a-C001-189",
"715cba53c376e50b76a0966ff16a6a-C001-202"
]
},
"@MOT@": {
"gold_contexts": [
[
"715cba53c376e50b76a0966ff16a6a-C001-18",
"715cba53c376e50b76a0966ff16a6a-C001-19"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-27",
"715cba53c376e50b76a0966ff16a6a-C001-28"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-55",
"715cba53c376e50b76a0966ff16a6a-C001-56",
"715cba53c376e50b76a0966ff16a6a-C001-57",
"715cba53c376e50b76a0966ff16a6a-C001-60",
"715cba53c376e50b76a0966ff16a6a-C001-63"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-65",
"715cba53c376e50b76a0966ff16a6a-C001-66",
"715cba53c376e50b76a0966ff16a6a-C001-67"
]
],
"cite_sentences": [
"715cba53c376e50b76a0966ff16a6a-C001-18",
"715cba53c376e50b76a0966ff16a6a-C001-27",
"715cba53c376e50b76a0966ff16a6a-C001-55",
"715cba53c376e50b76a0966ff16a6a-C001-65"
]
},
"@SIM@": {
"gold_contexts": [
[
"715cba53c376e50b76a0966ff16a6a-C001-77",
"715cba53c376e50b76a0966ff16a6a-C001-78"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-121",
"715cba53c376e50b76a0966ff16a6a-C001-122",
"715cba53c376e50b76a0966ff16a6a-C001-123",
"715cba53c376e50b76a0966ff16a6a-C001-124"
]
],
"cite_sentences": [
"715cba53c376e50b76a0966ff16a6a-C001-78",
"715cba53c376e50b76a0966ff16a6a-C001-123"
]
},
"@EXT@": {
"gold_contexts": [
[
"715cba53c376e50b76a0966ff16a6a-C001-129",
"715cba53c376e50b76a0966ff16a6a-C001-130",
"715cba53c376e50b76a0966ff16a6a-C001-131",
"715cba53c376e50b76a0966ff16a6a-C001-132"
],
[
"715cba53c376e50b76a0966ff16a6a-C001-167",
"715cba53c376e50b76a0966ff16a6a-C001-168",
"715cba53c376e50b76a0966ff16a6a-C001-169",
"715cba53c376e50b76a0966ff16a6a-C001-170"
]
],
"cite_sentences": [
"715cba53c376e50b76a0966ff16a6a-C001-129",
"715cba53c376e50b76a0966ff16a6a-C001-167"
]
}
}
},
"ABC_1ac16c74cc5bb4099ae07f89d7f148_5": {
"x": [
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-2",
"text": "Although the vast majority of knowledge bases (KBs) are heavily biased towards English, Wikipedias do cover very different topics in different languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-3",
"text": "Exploiting this, we introduce a new multilingual dataset (X-WikiRE), framing relation extraction as a multilingual machine reading problem."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-4",
"text": "We show that by leveraging this resource it is possible to robustly transfer models cross-lingually and that multilingual support significantly improves (zero-shot) relation extraction, enabling the population of low-resourced KBs from their well-populated counterparts."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-7",
"text": "It is a widely lamented fact that linguistic and encyclopedic resources are heavily biased towards English."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-8",
"text": "Even multilingual knowledge bases (KBs) such as Wikidata (Vrande\u010di\u0107 and Kr\u00f6tzsch, 2014) are predominantly English-based (Kaffee and Simperl, 2018) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-9",
"text": "This means that coverage is higher for English, and that facts of interest to English-speaking communities are more likely included in a KB."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-10",
"text": "This work introduces a novel multilingual dataset (X-WikiRE) and explores techniques for automatically filling such language gaps by learning, from X-WikiRE, to add facts in other languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-11",
"text": "Finally, we show that multilingual sharing is beneficial for knowledge base completion across all languages, including English."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-12",
"text": "The task of identifying potential KB entries in running text -i.e., relations that hold between two or more entities, is called relation extraction (RE)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-13",
"text": "In the traditional, supervised setting (Bach and Badaskar, 2007) , RE models are trained to identify a pre-specified set of relation types, which are observed during training."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-14",
"text": "Models are meant to generalize to new entities, but not new relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-99",
"text": "**Q U E S T I O N**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-15",
"text": "An alternative flavor is open RE (Fader et al., 2011; Yates et al., 2007) , which detects subjectverb-object triples and clusters semantically related verbs into coarse-grained semantic relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-16",
"text": "In this paper, we consider the middle ground, in which models are trained on a subset of prespecified relations and applied to both seen and unseen entities, and unseen relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-17",
"text": "The latter scenario is known as zero-shot RE (Rockt\u00e4schel et al., 2015) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-18",
"text": "Levy et al. (2017) present a reformulation of RE, where the task is framed as reading comprehension."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-19",
"text": "In this formulation, each relation type (e.g. author, occupation) is mapped to at least one natural language question template (e.g. \"Who is the author of x?\"), where x is filled with an entity (e.g. \"Inferno\")."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-20",
"text": "The model is then tasked with finding an answer (\"Dante Alighieri\") to this question with respect to a given context."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-21",
"text": "They show that this formulation of the problem both outperforms off-the-shelf RE systems in the typical RE setting and, in addition, enables generalization to unspecified and unseen types of relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-22",
"text": "X-WikiRE enables exploration of this reformulation of RE in a multilingual setting."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-23",
"text": "Contributions We introduce a new, largescale multilingual dataset (X-WikiRE) of reading comprehension-based RE for English, German, French, Spanish, and Italian, facilitating research on multilingual methods for RE."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-24",
"text": "Our dataset covers more languages (five) and is at least an order of magnitude larger than existing multilingual RE datasets, e.g., TAC 2016 (Ellis et al., 2015) , which covers three languages and consists of \u2248 90k examples."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-25",
"text": "We also a) perform cross-lingual RE showing that models pretrained on one language can be effectively transferred to others with minimal in-language finetuning; b) leverage multilingual representations to train a model capable of simultaneously performing (zero-shot) RE in all five languages, rivaling or outperforming its monolingually trained counterparts in many cases while requiring far fewer parameters per language; c) obtain considerable improvements by employing a more carefully designed nil-aware machine comprehension model."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-27",
"text": "**P R O P E R T Y / L A N G U A G E O C C U P A T I O N L O C A T E D I N . . . D A T E O F B I R T H C O U N T R Y P L A C E O F B I R T H D A T E O F D E A T H C A S T ME MB E R C O U N T R Y O F C I T I Z E N S H I P P L A C E O F D E A T HP A R E N T T A X O N D EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE SF R I TD EE NE**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-29",
"text": "**BACKGROUND**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-30",
"text": "Relation extraction We begin with a brief description of our terminology."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-31",
"text": "Given raw text, relation extraction is the task of identifying instances of relations relation(entity 1 , entity 2 )."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-32",
"text": "We refer to these instances of relation and entity pairs as triples."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-33",
"text": "Furthermore, throughout this work, we use the term property interchangeably with relation."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-73",
"text": "Figure 1 shows the overlap in the number of triples between different languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-34",
"text": "A large part of previous work on relation extraction has been concerned with extracting relations between unseen entities for a pre-defined set of relations seen during training (Zelenko et al., 2003; Zhou et al., 2005; Miwa and Bansal, 2016) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-35",
"text": "For example, the instances (Barack Obama, Hawaii), (Niels Bohr, Copenhagen), and (Jacques Brel, Schaerbeek) of the relation born in(x, y) would be seen during the training phase, and then the model would be expected to correctly identify other instances of the relation such as (Jean-Paul Sartre, Paris) in running text."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-36",
"text": "This is useful in closeddomain settings where it is possible to pre-select a set of relations of interest."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-37",
"text": "In an open-domain setting, however, we are interested in the far more difficult problem of extracting unseen relation types."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-38",
"text": "Open RE methods (Yates et al., 2007; Fader et al., 2011) do not require relationspecific data, but treat different phrasings of the same relation as different relations and rely on a combination of syntactic features (e.g. dependency parses) and normalisation rules, and so have limited generalization capacity."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-39",
"text": "Zero-shot relation extraction Levy et al. (2017) propose a novel approach towards achieving this generalization by transforming relations into natural language question templates."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-40",
"text": "For instance, the relation born in(x, y) can be expressed as \"Where was x born?\" or \"In which place was x born?\"."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-41",
"text": "Then, a reading comprehension model (Seo et al., 2016; Chen et al., 2017) can be trained on question, answer, and context examples where the x slot is filled with an entity and the y slot is either an answer if the answer is present in the context, or NIL."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-42",
"text": "The model is then able to extract relation instances (given expressions of the relations as questions) from raw text."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-43",
"text": "To test this \"harsh zero-shot\" setting of relation extraction, they build a dataset for RE as machine comprehension from WikiReading (Hewlett et al., 2016) ing comprehension model is able to use linguistic cues to identify relation paraphrases and lexicosyntactic patterns of textual deviation from questions to answers, enabling it to identify instances of new relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-44",
"text": "Similar work (Obamuyide and Vlachos, 2018) recently also showed that RE can be framed as natural language inference."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-46",
"text": "**X-WIKIRE**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-47",
"text": "X-WikiRE is a multilingual reading comprehension-based relation extraction dataset."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-48",
"text": "Each example in the dataset consists of a question, a context, and an answer, where the question is a querified relation and the context may contain the answer or an indication that it is not present (NIL)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-49",
"text": "Questions are obtained by transforming relations into question templates with slots where an entity is inserted."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-50",
"text": "Within the RE framework described in Section 2, entity 1 is filled into a slot in the question template and entity 2 is the answer."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-51",
"text": "Each triple 1 in the dataset can be identified uniquely across all languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-52",
"text": "We construct X-WikiRE using the relevant parts of Wikidata and Wikipedia for each language."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-97",
"text": "First, we employ fastText embeddings (Bojanowski et al., 2017) mapped to a multilingual space in a supervised fashion (Conneau et al.,"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-53",
"text": "Wikidata is an open KB where the knowledge contained in each document is expressed as a set of statements, and each statement is a tuple (property id, value id) (e.g. statement (P50, Q1067) where P50 refers to author and Q1067 to \"Dante Alighieri\")."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-54",
"text": "We perform data integration on Wikidata, as described by Hewlett et al. (2016) : for each entity in Wikipedia we take the corresponding Wikidata document, add the Wikipedia page text, and denormalize the statements."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-55",
"text": "This consists of replacing the property and value ids of each statement in the document with the text label for values which are entities, and with the human readable form for numeric values (e.g. timestamps are converted to natural forms like \"25 May 1994\") obtaining a tuple (property, entity)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-56",
"text": "2 Slot-filling data To extract the contexts for each triple in our dataset we use the distant supervision method described by Levy et al. (2017) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-57",
"text": "For each Wikidata document belonging to a given entity 1 we take all the denormalized tuples (property, entity 2 ) and extract the first sentence in the text containing both entity 1 and entity 2 ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-58",
"text": "Negatives (contexts without answers) are constructed by finding pairs of triples with common entity 2 type (to ensure they contain good distractors), swapping their context if entity 2 is not present in the context of the other triple."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-59",
"text": "Querification Levy et al. (2017) created 1192 question templates for 120 Wikidata properties."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-60",
"text": "A template contains a placeholder for an entity x (e.g. for property \"author\", some templates are \"Who wrote the novel x?\" and \"Who is the author of x?\"), which can be automatically filled in to create questions so that question \u2248 template(property, x))."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-61",
"text": "For our multilingual dataset, we had these templates translated by human translators."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-62",
"text": "The translators attempted to translate each of the original 1192 templates."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-63",
"text": "If a template was difficult to translate, they were in- structed to discard it."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-64",
"text": "They were also instructed to create their own templates, paraphrasing the original ones when possible."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-65",
"text": "This resulted in a varying number of templates for each of the properties across languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-66",
"text": "In addition to the entity placeholder, some languages with richer morphology (Spanish, Italian, and German) required extra placeholders in the templates because of agreement phenomena (gender)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-67",
"text": "We added a placeholder for definite articles, as well as one for gender-dependent filler words."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-68",
"text": "The gender is automatically inferred from the Wikipedia page statistics and a few heuristics."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-69",
"text": "Table 1 shows the same example across five languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-70",
"text": "Table 2 shows the number of positive and negative triples and examples (i.e with and without consideration of the templates)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-71",
"text": "As expected (due to the size of its Wikidata), English has the highest number of triples for most properties."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-72",
"text": "However, as Figure 2 shows, there are properties where it has fewer triples than other languages (e.g. French has more triples for film related properties such as cast member and nominated f or)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-74",
"text": "While it can be seen that English, once again, has the highest overall overlap with the other languages, there are interesting deviations from this pattern where for certain properties other languages share a larger intersection (see Appendix A for examples)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-76",
"text": "**DATASET STATISTICS**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-78",
"text": "**METHOD**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-79",
"text": "In our framework, a machine comprehension model sees a question-context pair and is tasked with selecting an answer span within the context, or indicating that the context does not contain an answer (returning NIL)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-80",
"text": "This 'nil-awareness' goes beyond the traditional reading comprehension setup where it is not required."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-81",
"text": "It has, however, recently been incorporated into newer datasets (Trischler et al., 2017; Rajpurkar et al., 2018; Saha et al., 2018) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-82",
"text": "We employ the architecture described in Kundu and Ng (2018) as our standard reading comprehension model for all the experiments."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-83",
"text": "This nil-aware answer extraction framework (NAMANDA) is briefly described below."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-84",
"text": "In a set of initial trials (see Table 3 ), we found that this model far outperformed the bias-augmented BiDAF model (Seo et al., 2016) used by Levy et al. (2017)"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-86",
"text": "**ON THEIR DATASET.**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-87",
"text": "A Nil-aware machine comprehension model The reading comprehension model we employ, seen in Figure 3 , encodes the question and context sequences and computes a similarity matrix between them."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-88",
"text": "A column-wise softmax of the similarity matrix is multiplied with the question encoding to aggregate the most relevant parts of the question with respect to the context."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-89",
"text": "Next, a jointencoding of the question and context is created and a multi-factor self-attentive encoding is applied to accumulate evidence from the entire context."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-90",
"text": "These representations are called the evidence vectors."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-91",
"text": "Lastly, the evidence vectors are decomposed for every context word with orthogonal decomposition."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-92",
"text": "The parallel components represent the relevant parts of the context and the orthogonal parts represent the irrelevant parts."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-93",
"text": "These decompositions bias the decoder to either output a span or NIL."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-95",
"text": "**MULTILINGUAL REPRESENTATIONS**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-96",
"text": "We compare two methods of obtaining multilingual representations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-100",
"text": "C o n t e x t Mu l t i l i n g u a l R e p r e s e n t a t i o n S o u r c e L a n g u a g e NI L -A wa r e Q A Mo d e l T a r g e t L a n g u a g e NI L -A wa r e Q A Mo d e l L o r e m I p s u m (a) Cross-lingual model transfer."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-101",
"text": "In step (1), a source language model is trained until convergence."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-102",
"text": "In step (2), it is finetuned on a limited amount of target language data."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-104",
"text": "**1**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-105",
"text": "A n s we r Q u e s t i o n C o n t e x t 1 L 1 Mu l t i l i n g u a l R e p r e s e n t a t i o n 1 A n s we r Q u e s t i o n C o n t e x t 1 L 2"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-106",
"text": "A n s we r Q u e s t i o n C o n t e x t L n L a n g u a g e 1 L a n g u a g e 2 . . ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-107",
"text": ". . . 2017)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-108",
"text": "Second, we employ the newly released multilingual BERT (Devlin et al., 2018) which is trained on the concatenation of the wikipedia corpora of 104 languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-109",
"text": "3 For BERT, we take the contexualized word representations from the final layer as input to our machine comprehension model's question and context Bi-LSTM encoders."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-110",
"text": "We do not fine-tune the pre-trained model."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-112",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-113",
"text": "Following Levy et al. (2017) , we distinguish between the traditional RE setting where the aim is to generalize to unseen entities (UnENT) and the zero-shot setting (UnREL) where the aim is to do so for unseen relation types (see Section 2)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-114",
"text": "Our goal is to answer these three questions: A) how well can RE models be transferred across languages? B) in the difficult UnREL setting, can the variance between languages in the number of instances of relations (see Figure 2 ) be exploited to enable more robust RE ? C) can one jointly-trained multilingual model which performs RE in multiple languages perform comparably to or outperform its individual monolingual counterparts?"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-115",
"text": "For all experiments, we take the multiple templates approach where a model sees different paraphrases of the same question during training."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-116",
"text": "This approach was shown by Levy et al. (2017) to have significantly better paraphrasing abilities than when only one question template or simpler relation descriptions are employed."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-117",
"text": "Evaluation Our evaluation methodology follows Levy et al. (2017) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-118",
"text": "We compute precision, recall and F1 by comparing spans predicted by the 3 https://github.com/google-research/ bert/blob/master/multilingual.md models with gold answers."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-119",
"text": "Precision is equal to the true positives divided by total number of nonnil answers predicted by a system."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-120",
"text": "Recall is equal to the true positives divided by the total number of instances that are non-nil in the ground truth answers."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-121",
"text": "Word order and punctuation are not considered."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-122",
"text": "4"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-124",
"text": "**MONOLINGUAL BASELINES**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-125",
"text": "A baseline model is trained on the full monolingual training set (1 million instances) for each of the languages in both the UnENT and UnREL settings, which serve as a point of comparison for the cross-lingual transfer and multilingual models."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-126",
"text": "Comparison with Levy et al. (2017) In Table 3 , the comparison between the nil-aware machine comprehension framework we employ (Mono) and the results reported by Levy et al. (2017) using the bias-augmented BiDAF model on their dataset (and splits) can be seen."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-127",
"text": "The clear improvements obtained are in line with those reported by Kundu and Ng (2018) of NAMANDA over BiDAF on reading comprehension tasks."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-128",
"text": "Results Table 3 shows the results of the monolingual baselines."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-129",
"text": "For the cross-lingual transfer experiments, these results can be viewed as a performance ceiling."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-130",
"text": "Observe that the results on our dataset are in general lower than those reported in Levy et al. (2017) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-131",
"text": "This can be attributed to three factors: a) on average, the context length in our dataset is longer compared to theirs (see Appendix C); b) the fastText word embeddings we employ to facilitate multilingual sharing have a lower coverage of the vocabularies of each language than the GloVe word embeddings employed in that work; c) in the UnREL setting, we employ a more challenging setup of 5-fold cross-validation (as opposed to 10-fold in their experiments), meaning that a lower number of relations is seen at training time and the test set contains a higher number of unseen relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-133",
"text": "**CROSS-LINGUAL MODEL TRANSFER**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-134",
"text": "In this set of experiments, seen in Figure 4a , we test how well RE models can be transferred from a source language with a large number of training examples to target languages with no or minimal training data."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-135",
"text": "In the UnENT experiments, we construct pairwise parallel test and development sets between English and each of the languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-136",
"text": "An English RE model (built on top of the multilingual representations described in sub-section 4) is trained on a full English training set (1 million instances)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-137",
"text": "We then evaluate how well this model can transfer to each of the four other languages in the following cases: with no finetuning or when 1000, 2000, 5000 or 10000 target language training examples are used for finetuning."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-138",
"text": "Note that entities in the target languages' test and development sets are not seen in the English training data."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-139",
"text": "We compare transfer performance with monolingual performance when a target language's full training set is employed."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-140",
"text": "A similar approach is followed for UnREL experiments."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-141",
"text": "However, since the number of relations is relatively small, cross-validation with five folds is employed instead of fixed splits."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-142",
"text": "Moreover, because this is a substantially more challenging setting we are interested in evaluating along another dimension (Question B): when relations are seen in the source language but not in the target language."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-143",
"text": "Furthermore, unlike for UnENT, we directly use 10k examples for finetuning."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-144",
"text": "Results Figure 5 shows the results of the crosslingual transfer experiments for UnENT, where transfer is accomplished through multilingually aligned fastText embeddings."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-145",
"text": "In a parallel set of experiments, transfer was performed through the multilingual BERT encoder."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-146",
"text": "The results of this (see Appendix D) showed a clear advantage for the former over the latter."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-147",
"text": "5 This is primarily due to the low vocabulary coverage of multilingual BERT which has a total vocabulary size of 100k tokens for 104 languages (see Appendix C for coverage statistics)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-148",
"text": "While it is clear that the models suffer from rather low recall when no finetuning is performed, the results show considerable improvements when finetuning with only 1000 target language examples."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-149",
"text": "With 10K target language examples, it is possible to nearly match the performance of a model trained on the full target language monolingual training set."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-150",
"text": "Similarly, in the UnREL experiments, our results ( Figure 6 ) show that it's possible to recover a large part of the fully-supervised monolingual models' performance."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-151",
"text": "It can be seen, however, that with 10k target language examples, a lower proportion of the performance is recovered when compared to the UnENT setting."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-152",
"text": "This indicates that it is more difficult to transfer the ability to identify relation paraphrases and entity types through global cues 6 which Levy et al. (2017) suggested are important for generalizing to new rela- 5 We therefore continue the rest of our experiments in the paper using the multilingual fastText embeddings."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-153",
"text": "6 When context phrasing deviates from the question in a way that is common between relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-154",
"text": "L a n g / Me a s u r e tions in this framework."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-156",
"text": "**ONE MODEL, MULTIPLE LANGUAGES**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-157",
"text": "We now examine the possibility of training one multilingual model which is able to perform relation extraction across multiple languages, as shown in Figure 4b ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-158",
"text": "We are interested in the case when an entity may be seen in another language's training data, as this is a realistic cross-lingual KB completion scenario where different languages' KBs are better populated for different topics."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-159",
"text": "To control for training set size we include 200k training instances per language, so that the total size of the training set is equal to that of the monolingual baseline."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-160",
"text": "However, an additional benefit of multilingual training is that extra overall training data becomes available."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-161",
"text": "To test the effect of that we also run an experiment where the full training set of each of the languages is employed (adding up to 5 million training examples)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-162",
"text": "In the UnREL experiments, 5-fold crossvalidation is performed."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-163",
"text": "We are once again interested in exploiting the fact that KBs are better populated for different properties across different languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-164",
"text": "Our setup is therefore as follows: in each of the 5 folds, a test set relation for a particular language is not seen in that language's training set, but may be seen in any of the other languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-165",
"text": "This amounts to maintaining the original zero-shot setting (where a relation is not seen) monolingually, but providing supervision by allowing the models to peek across languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-166",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-167",
"text": "**RESULTS**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-168",
"text": "In the UnENT setting the multilingual models trained on just 200k instances per language perform slightly below the monolingual baselines."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-169",
"text": "This excludes for French where, surprisingly, the baseline performance is actually exceeded."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-170",
"text": "When the full training sets of all languages are combined, the multilingual model outperforms the monolingual baselines for three (English, Spanish, and French) out of five languages and is slightly worse for two (German and Italian)."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-171",
"text": "This demonstrates that not only is it possible to utilize a single model to perform RE in multiple languages, but that the multilingual supervision signal will often lead to improvements in performance."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-172",
"text": "These results are shown in the third and fourth columns of Table 3 ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-173",
"text": "The multilingual UnREL model outperforms its monolingual counterparts by large margins for all languages reaching a near 100% F1-score improvement for most languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-174",
"text": "This is largely in line with our premise that the natural topicality of KBs across languages can be exploited to provide cross-lingual supervision for relation extraction models."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-176",
"text": "**HYPERPARAMETERS**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-177",
"text": "In all experiments, models were trained for five epochs with a learning rate of 1.0 using Adam (Kingma and Ba, 2014) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-178",
"text": "For finetuning in the cross-lingual transfer experiments, the learning rate was lowered to 0.001 to prevent forgetting and a maximum of 30 finetuning iterations over the small target language training set were performed with model selection using the target language development set F1-score."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-179",
"text": "All monolingual models' word embeddings were initialised using fastText embeddings trained on each language's Wikipedia and common crawl corpora, 7 except for the comparison experiments described in sub-section 5.1 where GloVe (Pennington et al., 2014) was used for comparability with Levy et al. (2017) ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-181",
"text": "**RELATED WORK**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-182",
"text": "Multilingual NLU Advances in natural language understanding tasks have been as impressive as they have been fast-paced."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-183",
"text": "Until recently, however, the multilingual aspect of such tasks has not received as much attention."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-184",
"text": "This is pri-Lang."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-185",
"text": "UnENT UnREL Levy et al. (2017) Faruqui and Kumar (2015) employed a pipeline of machine translation systems to translate to English, then Open RE systems to perform RE on the translated text, followed by crosslingual projection back to source language."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-186",
"text": "Verga et al. (2016) apply the universal schema framework (Riedel et al., 2013) on top of multilingual embeddings to extract relations from Spanish text without using Spanish training data."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-187",
"text": "This approach, however, only enables generalization to unseen entities and does not have the flexibility to predict unseen relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-188",
"text": "Furthermore, both of these works faced a fundamental difficulty with evaluation."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-189",
"text": "The former resort to manual annotation of a small number of examples (1000) in each language and the latter use the 2012 TAC Spanish slot-filling evaluation dataset in which \"the coverage of facts in the available annotation is very small\"."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-190",
"text": "With the introduction of X-WikiRE, this work provides the first large-scale dataset and benchmark for the evaluation of multilingual RE spanning five languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-191",
"text": "While this paves the way for a wide range of research on multilingual relation extraction and knowledge base population, we hope to extend this to a larger variety of languages in future work, particularly as we have been able to show that the amount of training data required for cross-lingual model transfer is minimal, meaning that a small dataset (when only that is available) can go a long way."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-193",
"text": "**CONCLUSION**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-194",
"text": "We introduced X-WikiRE, a new, large-scale multilingual relation extraction dataset in which relation extraction is framed as a problem of reading comprehension to allow for generalization to unseen relations."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-195",
"text": "Using this, we demonstrated that a) multilingual training can be employed to exploit the fact that KBs are better populated in different areas for different languages, providing a strong cross-lingual supervision signal which leads to considerably better zero-shot relation extraction; b) models can be transferred cross-lingually with a minimal amount of target language data for finetuning; c) better modelling of nil-awareness in reading comprehension models leads to improvements on the task."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-196",
"text": "Our work is a step towards making KBs equally well-resourced across languages."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-197",
"text": "To encourage future work in this direction, we release our code and dataset."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-198",
"text": "A X-WikiRE descriptive statistics P r o p e r t y : s t a r t t i me ( P 5 8 0 ) 1 8 7 6 2 9"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-199",
"text": "Figure 10: Property start time."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-200",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-201",
"text": "**B CONTEXT SIZE**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-202",
"text": "We computed the average length of the context in out dataset and Levy et al. (2017) Table 4 : Avarage number of tokens in the context."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-203",
"text": "Table 6 shows the results for our model in the UnENT scenario using both multilingual BERT and fastText."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-204",
"text": "BERT performs poorly compared to fastText in every language and almost for each of the finetuning settings."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-205",
"text": "This is likely due to the lower coverage of our dataset's vocabulary as can be seen in Table 5 ."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-206",
"text": "Table 6 : Precision, Recall and F1-scores for UnENT comparing scores using BERT and fastText multilingual embeddings."
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-207",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-208",
"text": "**C VOCABULARY COVERAGE**"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "1ac16c74cc5bb4099ae07f89d7f148-C001-210",
"text": "**D BERT VS FASTTEXT**"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-18",
"1ac16c74cc5bb4099ae07f89d7f148-C001-19",
"1ac16c74cc5bb4099ae07f89d7f148-C001-20",
"1ac16c74cc5bb4099ae07f89d7f148-C001-21"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-38",
"1ac16c74cc5bb4099ae07f89d7f148-C001-39",
"1ac16c74cc5bb4099ae07f89d7f148-C001-40",
"1ac16c74cc5bb4099ae07f89d7f148-C001-41",
"1ac16c74cc5bb4099ae07f89d7f148-C001-42"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-179"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-185"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-18",
"1ac16c74cc5bb4099ae07f89d7f148-C001-39",
"1ac16c74cc5bb4099ae07f89d7f148-C001-179",
"1ac16c74cc5bb4099ae07f89d7f148-C001-185"
]
},
"@MOT@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-18",
"1ac16c74cc5bb4099ae07f89d7f148-C001-19",
"1ac16c74cc5bb4099ae07f89d7f148-C001-20",
"1ac16c74cc5bb4099ae07f89d7f148-C001-21"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-38",
"1ac16c74cc5bb4099ae07f89d7f148-C001-39",
"1ac16c74cc5bb4099ae07f89d7f148-C001-40",
"1ac16c74cc5bb4099ae07f89d7f148-C001-41",
"1ac16c74cc5bb4099ae07f89d7f148-C001-42"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-113",
"1ac16c74cc5bb4099ae07f89d7f148-C001-114"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-18",
"1ac16c74cc5bb4099ae07f89d7f148-C001-39",
"1ac16c74cc5bb4099ae07f89d7f148-C001-113"
]
},
"@USE@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-56",
"1ac16c74cc5bb4099ae07f89d7f148-C001-57",
"1ac16c74cc5bb4099ae07f89d7f148-C001-58"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-115",
"1ac16c74cc5bb4099ae07f89d7f148-C001-116"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-117",
"1ac16c74cc5bb4099ae07f89d7f148-C001-118",
"1ac16c74cc5bb4099ae07f89d7f148-C001-119",
"1ac16c74cc5bb4099ae07f89d7f148-C001-120",
"1ac16c74cc5bb4099ae07f89d7f148-C001-121"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-126"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-56",
"1ac16c74cc5bb4099ae07f89d7f148-C001-116",
"1ac16c74cc5bb4099ae07f89d7f148-C001-117",
"1ac16c74cc5bb4099ae07f89d7f148-C001-126"
]
},
"@EXT@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-59",
"1ac16c74cc5bb4099ae07f89d7f148-C001-60",
"1ac16c74cc5bb4099ae07f89d7f148-C001-61",
"1ac16c74cc5bb4099ae07f89d7f148-C001-62",
"1ac16c74cc5bb4099ae07f89d7f148-C001-63",
"1ac16c74cc5bb4099ae07f89d7f148-C001-64",
"1ac16c74cc5bb4099ae07f89d7f148-C001-65",
"1ac16c74cc5bb4099ae07f89d7f148-C001-66",
"1ac16c74cc5bb4099ae07f89d7f148-C001-67",
"1ac16c74cc5bb4099ae07f89d7f148-C001-68"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-59"
]
},
"@DIF@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-84"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-126",
"1ac16c74cc5bb4099ae07f89d7f148-C001-127"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-84",
"1ac16c74cc5bb4099ae07f89d7f148-C001-126"
]
},
"@SIM@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-113",
"1ac16c74cc5bb4099ae07f89d7f148-C001-114"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-113"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-130"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-151",
"1ac16c74cc5bb4099ae07f89d7f148-C001-152"
],
[
"1ac16c74cc5bb4099ae07f89d7f148-C001-202"
]
],
"cite_sentences": [
"1ac16c74cc5bb4099ae07f89d7f148-C001-130",
"1ac16c74cc5bb4099ae07f89d7f148-C001-152",
"1ac16c74cc5bb4099ae07f89d7f148-C001-202"
]
}
}
},
"ABC_7ac01a84ab696e7fa9d0ce336a393e_5": {
"x": [
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-53",
"text": "**ATTENTION MECHANISM**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-2",
"text": "We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-3",
"text": "Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-4",
"text": "We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-5",
"text": "Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-6",
"text": "For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-7",
"text": "Index Termsgrounded language learning, attention mechanism, cross-lingual speech retrieval, recurrent neural networks."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-10",
"text": "Over the past few years, there has been an increasing interest in research gathering the Language and Vision (LaVi) communities."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-11",
"text": "Multimodal corpora such as Flickr30k [1] or MSCOCO [2] containing images along with natural language captions were made available for research."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-29",
"text": "Fig. 2 : Attention weights over an English (2a) and Japanese caption (2c), both describing the same picture (2b)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-12",
"text": "They were soon extended with speech modality: speech recordings for the captions of Flickr8k were collected by [3] via crowdsourcing; spoken captions for MSCOCO were generated using Google Text-To-Speech (TTS) by [4] and using Voxygen TTS by [5] ; extensions of these corpora to other languages than English, such as Japanese, were also introduced by [6] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-13",
"text": "These corpora, as well as deep learning models, lead to contributions in multilingual language grounding and learning of shared and multimodal representations with neural networks [4, 7, 8, 9, 10, 11, 12, 13] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-14",
"text": "This paper focuses on computational models of visually grounded speech that were introduced by [14, 4] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-15",
"text": "Learned representations of such models were analyzed by [11, 7, 4] : [11] introduced novel methods for interpreting the activation patterns of recurrent neural networks (RNN) in a model of visually grounded meaning representation from textual and visual input and showed that RNN pay attention to word tokens belonging to specific lexical categories."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-16",
"text": "[4] found that final layers tend to encode semantic information whereas lower layers tend to encode form-related information."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-17",
"text": "[7] showed that a non trivial amount of phonological information is preserved in higher layers, and suggested that the attention layer focuses on semantic information."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-18",
"text": "Such computational models can be used to emulate child language acquisition and could shed light on the inner cognitive pro-This work was supported by grants from NeuroCoG IDEX UGA as part of of the \"Investissements d'avenir\" program (ANR-15-IDEX-02) cesses at work in humans as suggested by [15] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-19",
"text": "While [11, 7, 4] focused on analyzing speech representations learnt by speech-image neural models from a phonological and semantic point of view, the present work focuses on lexical acquisition and the way speech utterances are segmented into lexical units and processed by a computational model of visually grounded speech."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-20",
"text": "We analyze a key component of the neural model -the attention mechanism -and we observe its behaviour and draw parallels between artificial neural attention and human attention."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-21",
"text": "Attention indeed plays a key role in human perceptual learning, as stated by [16] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-22",
"text": "Contributions."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-23",
"text": "We enrich an existing speech-image corpus in English with forced alignments and part-of-speech (POS) tags and analyse which parts of the spoken utterances the neural model attends to."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-24",
"text": "In order to put these experiments in a cross-lingual perspective, we also experiment on a similar corpus in Japanese."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-25",
"text": "1 We show that the attention mechanism mostly focuses on nouns for both languages."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-26",
"text": "We also show that our Japanese model developed a language-specific behaviour to detect relevant information by paying attention to particles, as Japanese toddlers do."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-27",
"text": "Moreover, the bilingual corpus allows us to demonstrate that images can be used as pivots to automatically align spoken utterances in two different languages (English and Japanese) without using any transcripts."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-28",
"text": "This preliminary result, in line with previous findings of [8] , confirms that neural speech-image models can capture a cross-lingual semantic signal, a first step in the perspective of learning speech-to-speech translation systems without text supervision."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-30",
"text": "Attention peaks in the English caption are located above \"AIRPORT\" and \"JETS\"."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-31",
"text": "Attention peaks in the Japanese caption are located above \"NI\" (particle indicating location) and \"GA\" (particule indicating the subject of the sentence)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-32",
"text": "Red dotted lines show token boundaries."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-33",
"text": "Large orange markers show automatically detected peaks."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-34",
"text": "Japanese caption reads: \"Several planes are stopped at the airport\""
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-36",
"text": "**MODEL OF VISUALLY GROUNDED SPEECH**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-37",
"text": "The model we use for our experiments is based on that of [4] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-38",
"text": "It is trained to solve an image retrieval task: given a spoken description it retrieves the closest image that matches the description."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-39",
"text": "To do so, the model projects an image and its spoken description in a common representation space, so that matching image/utterance pairs lie near while mismatching image/utterance pairs lie apart."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-41",
"text": "**GENERAL ARCHITECTURE**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-42",
"text": "The model (see figure 1 ) has two components: an image encoder, and a speech encoder."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-43",
"text": "At training time, the network is presented with images and their corresponding spoken descriptions and tries to minimise the following loss function:"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-44",
"text": "This loss function encourages the network to minimise by a margin \u03b1 the distance d(u, i) between the encoded image i and the encoded utterance u belonging to matching image/utterance pairs while making the distance greater for mismatching image/utterance pairs."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-46",
"text": "**ENCODERS**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-47",
"text": "The image encoder takes VGG-16 ([17]) pre-calculated vectors as input 2 instead of raw images."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-48",
"text": "It only consists of a dense layer that learns how to shrink the 4096 dimensional VGG-16 input vector to a 512 dimensional vector, which is then L2 normalised."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-49",
"text": "The speech encoder (input is 13 MFCC vectors instead of raw speech) consists of a convolutional layer followed by 5 stacked recurrent layers."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-50",
"text": "Contrary to the original model ( [4] ), we used GRU units instead of RHN units."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-51",
"text": "3 Results are still acceptable (see Table 1 ) even if GRU architecture scores worse than original RHN one."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-54",
"text": "One of the key component of the model is its attention mechanism."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-55",
"text": "The model computes a weighted sum of the GRU activations at all timesteps as following: t \u03b1tht."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-56",
"text": "Knowing by how much a given vector has been weighted gives us an insight on which portions of the speech signal the network relies to make its predictions (see Figure 2)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-57",
"text": "In the original architecture ( [4] ), attention follows the last recurrent layer."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-58",
"text": "To have more insight on the representation learnt by the network, we added an attention mechanism after the first recurrent layer."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-59",
"text": "Final vector produced by the speech encoder is a dot product of the vectors produced by both attentions."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-60",
"text": "However, for the sake of clarity, we will only report in this paper results on the attention weights of the top attention mechanism GRU5 (after the fifth recurrent layer)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-61",
"text": "4"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-63",
"text": "**ENGLISH AND JAPANESE CORPORA**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-64",
"text": "The corpora we use for our experiments are based on MSCOCO [2] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-65",
"text": "MSCOCO is a dataset initially thought for computer vision purposes, mainly automatic image captioning."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-66",
"text": "The dataset consists of a set of images, each paired with 5 written captions describing the image."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-67",
"text": "All captions were written in English by humans and faithfully describe the content of the image."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-68",
"text": "The Japanese corpus we use is based on the newly created STAIR dataset [6] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-69",
"text": "Using the same methodology as [2] , [6] collected 5 Japanese captions for each image of the original MSCOCO dataset."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-70",
"text": "As for the original MSCOCO dataset, Japanese captions were written by native Japanese speakers."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-71",
"text": "It is worth insisting on the fact that these Japanese captions are original captions and not plain translations of their English equivalents."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-72",
"text": "MSCOCO and STAIR are thus comparable corpora."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-73",
"text": "We trained our model on extended versions of MSCOCO and STAIR."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-74",
"text": "Spoken COCO dataset was introduced by [4] for English."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-75",
"text": "We followed the same methodology as [4] and generated synthetic speech for each caption in the Japanese STAIR dataset."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-76",
"text": "We created the spoken STAIR dataset so it would follow the exact same train/val/test 5 split as [4] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-77",
"text": "We thus have two comparable corpora: one featuring images and spoken captions in English, and another one featuring the same images and spoken captions in Japanese."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-78",
"text": "This allowed us to compare the behaviour of the same architecture on two typologically different languages."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-79",
"text": "We forced aligned each spoken caption to its transcription (using the Montreal Forced Aligner [18] and Maus Forced Aligner [19] for English and Japanese respectively), resulting in alignments at word and phone level."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-80",
"text": "We also tagged each dataset using TreeTagger [20] for English and KyTea [21] for Japanese."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-81",
"text": "As the tagset of both taggers differs, we mapped each POS to its Universal POS equivalent [22] enabling us to compare the POS distribution of each corpus."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-82",
"text": "6 Model R@1 R@5 R@10 r English 0.060 0.195 0.301 25 Japanese 0.054 0.180 0.283 28 Table 1 : Recall at 1, 5, and 10 results as well as median rank r on a speech-image retrieval task (test part of our datasets with 5k images)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-83",
"text": "Original implementation by [4] with RHN reports median rank r = 13 on English dataset."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-84",
"text": "Chance for median rank r is 2500.5."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-86",
"text": "**WHAT DO MODELS PAY ATTENTION TO?**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-87",
"text": "We first train two monolingual models for English and Japanese on the train set (566 435 spoken captions) of the corpora for 15 epochs."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-88",
"text": "Baseline results are similar for English and Japanese (see Table 1 )."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-89",
"text": "To analyse the behaviour of the attention mechanism of our model, we encoded each caption of the test set and extracted the attention weights \u03b1t, resulting in an array of t weights."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-90",
"text": "We then used a peak detection algorithm 7 to detect local maxima in the attention weights and thus know which timesteps were given the highest weights (large orange markers in Fig. 2 )."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-91",
"text": "We only considered peaks that were at least 60% as high as the highest detected peak in the utterance."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-92",
"text": "Having a timestep aligned speech signal for each language enables us to see above which words (and thus POS) attention focuses on."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-93",
"text": "Table 2 shows the top ten words located under peaks for both languages (and their corresponding frequency in the training corpus)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-94",
"text": "In order to see if the attention mechanism does any better than learning corpus statistics, we need a baseline POS distribution for comparison."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-95",
"text": "One possibility would be to simply compare the proportion of peaks under a given POS to the frequency of the same POS computed on tokens (as provided in Table 2 )."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-117",
"text": "**CHILD LANGUAGE ACQUISITION AND NOUN-BIAS**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-118",
"text": "When learning their native language, it has been theorised that children exhibit a noun-bias [23] : 9 that is, in most languages children learn nouns before any other catagory."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-119",
"text": "We notice that both models exhibit such language-general behaviour and favour nouns over other categories."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-120",
"text": "Also, we showed that our Japanese model develops a language-specific behaviour when mainly focusing on GA particles."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-121",
"text": "[24] demonstrated that Japanese toddlers also make use of GA to segment speech before any other particle."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-122",
"text": "The noun-bias phenomenon in our corpus can be explained by two factors: first, images in our corpus display many objects, thus prompting annotator to use more nouns than verbs; second, VGG vectors (used to encode images) are only trained to detect objects and not actions."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-124",
"text": "**ATTENTION ABOVE WORD BEGINNINGS OR WORD ENDINGS?**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-125",
"text": "Beginning Table 3 : Position of attention peaks above words for English (EN) and Japanese (JA)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-126",
"text": "We analysed above which part of words peaks are located."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-127",
"text": "We divided each word beneath a peak into 4 equal parts and counted the percentage of peaks located above a given category (see Table 3 )."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-128",
"text": "We notice that peaks in our English model are mainly located on the second half of the words."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-129",
"text": "This phenomenon is coherent with Slobin's [25] Operating Principles favoring language acquisition stating that children \"pay attention to the ends of words\"."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-130",
"text": "Peaks in Japanese are located at word endings but also at word beginnings."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-131",
"text": "It seems the very beginning of some particles is able to trigger an attention peak."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-133",
"text": "**IMAGES AS PIVOTS FOR CROSS-LINGUAL SPEECH RETRIEVAL?**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-134",
"text": "We have seen in previous section that attention focuses on nouns and Table 2 suggests that these nouns correspond to the main concept of the paired image."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-135",
"text": "To confirm this trend, we experiment on a crosslingual speech-to-speech retrieval task using images as pivots."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-136",
"text": "This possibility was introduced in [8] , but required training jointly or alternatively two speech encoders within the same architecture and a parallel bilingual speech dataset while we experiment with separately trained models for both languages."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-137",
"text": "In [8] , a parallel corpus was needed as the loss functions adopted try to minimise either the distance between captions in two languages or the distance between captions in two languages and the associated image as pivot."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-138",
"text": "As our approach uses two monolingual models, we do not need a parallel corpus."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-139",
"text": "Each monolingual model can be trained on its own dataset featuring images and their spoken description."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-140",
"text": "The approach is the following: we first select a set of pivot images never seen by any of the monolingual models before."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-141",
"text": "We encode these images with the image encoder of each language."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-142",
"text": "10 Then, for each speech utterance query in a source language usrc (English for instance), we find the nearest speech utterance in the target language utgt (Japanese for instance) which minimises the cumulated distance d(usrc, i) + d(i, utgt) among all pivot images i."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-143",
"text": "To make sure no parallel dataset is used, we trained a new English model on the first half of the train set, and a new Japanese model on the second half."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-144",
"text": "We evaluated our approach on 1k captions of our test corpus to be comparable with [8] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-145",
"text": "11 At the time of the evaluation, given a speech query in language src which we know 10 Since both image encoders (from English and Japanese) are trained separately, they do not lead to the same representation of an image."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-146",
"text": "11 We did not perform evaluation on the full 25000 EN \u00d7 25000 JP distance matrices where each source query is associated with 5 target captions."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-147",
"text": "Instead, we randomly sub-sampled ten 1000 EN \u00d7 1000 JP distance matri- Table 4 : Results on English (EN) to Japanese (JP) and Japanese to English speech-to-speech retrieval (subset of 1k captions)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-148",
"text": "For comparison, we report [8] 's results on English to Hindi (HI) and Hindi to English speech-to-speech retrieval."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-149",
"text": "Chance scores are R@1=.001, R@5=.005, and R@10=.01."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-150",
"text": "Chance for median rank r is 500.5."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-151",
"text": "is paired with image I, we assess the ability of our approach to rank the matching spoken caption in language tgt paired with image I in the top 1, 5, and 10 results and give its median rank r. We report our results in Table 4 as well as results from [8] who performed speechto-speech retrieval using crowd-sourced spoken captions in English and Hindi."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-152",
"text": "Our results are surprisingly high given the fact we did not train a bilingual model but used the output of two monolingual models never trained to solve such a task."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-153",
"text": "Nevertheless, it is also important to mention that [8] experimented on real speech with multiple speakers while we used synthetic speech with only one voice."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-154",
"text": "Table 5 shows an example of top-1 retrieved Japanese sentences for 2 English queries."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-156",
"text": "**EN**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-157",
"text": "this is a display of donuts on a couple shelves JA \u3044\u308d\u3044\u308d\u306a\u7a2e\u985e\u306e\u30c9\u30fc\u30ca\u30c4\u304c\u4e26\u3079\u3089\u308c\u3066\u3044\u308b Trans."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-158",
"text": "Different kinds of donuts are lined up EN a living room with some brick walls and a fireplace JA \u30bd\u30d5\u30a1\u30fc\u3084\u30c6\u30fc\u30d6\u30eb\u3084\u6696\u7089\u306e\u3042\u308b\u897f\u6d0b\u98a8\u306e\u90e8\u5c4b Trans."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-159",
"text": "Western-style room with sofa, table and fireplace Table 5 : Example of semantically related captions."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-160",
"text": "English (EN) query and retrieved Japanese caption (JA) and its translation (TRANS)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-162",
"text": "**CONCLUSION**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-163",
"text": "In this paper we showed that attention in a neural model of visually grounded speech mainly focuses on nouns."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-164",
"text": "We also showed that this behaviour holds true for two very typologically different languages such as English and Japanese and that attention could also develop language-specifc mechanisms to detect relevant information in one of the languages."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-165",
"text": "We also provided evidence that it is possible to perform speech-to-speech retrieval with images as pivots using the output of two independently trained monolingual models."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-166",
"text": "In future work, we would like to validate our methodology on a bilingual dataset featuring real voices and try to extract a bilingual speech-tospeech dictionary using attention peaks as anchor points."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-167",
"text": "Ultimatly, we would like to emphasise the paramount importance of using other languages than English when trying to analyse the linguistic representations learnt by neural networks so as to understand if the models encode language specific or language general information, and thus better understand their strengths and weaknesses."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-169",
"text": "**ACKNOWLEDGEMENTS**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-170",
"text": "We thank G. Chrupa\u0142a and his team for sharing their code and dataset, as well as for helping us with technical issues."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-171",
"text": "ces so that there would be only one target caption for each query in order to compare our results with [8] ."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-172",
"text": "Results are averaged over 10 random samples."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-96",
"text": "However, by doing so, we would assume that all tokens have the same length in the speech signal, which is not the case (verbs are longer than determiners for instance)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-97",
"text": "Thus, for each spoken utterance of the test set, we sampled 50 * p random peak positions (p number of true detected peaks per utterance), and computed the POS distribution over such peaks (see 3a)."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-98",
"text": "We consider this as our baseline corpus distribution if attention peaks were to occur randomly."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-99",
"text": "sal VERB tag, thus the high proportion of verbs in the Japanese dataset."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-100",
"text": "7 Uses the first order difference of the input array -see https://github.com/ lucashn/peakutils."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-102",
"text": "**ENGLISH**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-103",
"text": "We notice (Fig. 3b ) that the attention mechanism of the English model primarily focuses on NOUNS: 82% of the peaks are located above nouns."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-104",
"text": "This is far above corpus frequency, which is 47%."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-105",
"text": "The attention mechanism considers neither determiners (DET) nor adpositions (ADP) nor adjectives (ADJ) as relevant as only 0.6%, 3%, and 2.85% are highlighted, where corpus frequencies would predict 7%, 8%, and 8% respectively."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-106",
"text": "Verbs (VERB) are half as often highlighted as corpus frequency would predict, meaning attention barely relies on such words to make its prediction."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-108",
"text": "**JAPANESE**"
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-109",
"text": "The Japanese attention mechanism clearly makes use of particles 8 (PRT): 45.77% of the peaks are located above such words where corpus frequency would predict 16.9%."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-110",
"text": "In fact, 6 of the top ten words are particles (see Table 2 )."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-111",
"text": "Moreover, 17.83% of the peak highlight speech segments corresponding to the GA particle, well before nouns: GA is a particle that is used to indicate that the preceding word is the subject of the sentence."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-112",
"text": "Thus, detecting such a particle is most useful, as the preceding word surely is the main object of the target image."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-113",
"text": "The Japanese attention mechanism also seems to rely on nouns as 47.79% of peaks are located above nouns."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-114",
"text": "One could argue this value is not very different from corpus frequency: 47.42%."
},
{
"sent_id": "7ac01a84ab696e7fa9d0ce336a393e-C001-115",
"text": "However, if such POS were to hinder prediction, we would expect the attention mechanism to lower the number of peaks above such words, such as the model did for verbs or adjectives, which is not the case here, meaning NOUNS are useful for the model's prediction."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-11",
"7ac01a84ab696e7fa9d0ce336a393e-C001-12"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-13"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-14"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-15",
"7ac01a84ab696e7fa9d0ce336a393e-C001-16"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-83"
]
],
"cite_sentences": [
"7ac01a84ab696e7fa9d0ce336a393e-C001-12",
"7ac01a84ab696e7fa9d0ce336a393e-C001-13",
"7ac01a84ab696e7fa9d0ce336a393e-C001-14",
"7ac01a84ab696e7fa9d0ce336a393e-C001-15",
"7ac01a84ab696e7fa9d0ce336a393e-C001-83"
]
},
"@DIF@": {
"gold_contexts": [
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-19",
"7ac01a84ab696e7fa9d0ce336a393e-C001-20"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-50"
]
],
"cite_sentences": [
"7ac01a84ab696e7fa9d0ce336a393e-C001-19",
"7ac01a84ab696e7fa9d0ce336a393e-C001-50"
]
},
"@USE@": {
"gold_contexts": [
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-37",
"7ac01a84ab696e7fa9d0ce336a393e-C001-38",
"7ac01a84ab696e7fa9d0ce336a393e-C001-39"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-74"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-75"
],
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-76",
"7ac01a84ab696e7fa9d0ce336a393e-C001-77",
"7ac01a84ab696e7fa9d0ce336a393e-C001-78"
]
],
"cite_sentences": [
"7ac01a84ab696e7fa9d0ce336a393e-C001-37",
"7ac01a84ab696e7fa9d0ce336a393e-C001-74",
"7ac01a84ab696e7fa9d0ce336a393e-C001-75",
"7ac01a84ab696e7fa9d0ce336a393e-C001-76"
]
},
"@EXT@": {
"gold_contexts": [
[
"7ac01a84ab696e7fa9d0ce336a393e-C001-57",
"7ac01a84ab696e7fa9d0ce336a393e-C001-58"
]
],
"cite_sentences": [
"7ac01a84ab696e7fa9d0ce336a393e-C001-57"
]
}
}
},
"ABC_b71321a9252376308d627c439e85b7_5": {
"x": [
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-2",
"text": "Semantic representations have long been argued as potentially useful for enforcing meaning preservation and improving generalization performance of machine translation methods."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-3",
"text": "In this work, we are the first to incorporate information about predicate-argument structure of source sentences (namely, semantic-role representations) into neural machine translation."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-4",
"text": "We use Graph Convolutional Networks (GCNs) to inject a semantic bias into sentence encoders and achieve improvements in BLEU scores over the linguistic-agnostic and syntaxaware versions on the English-German language pair."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-128",
"text": "In the second sentence, the BiRNN's translation is ungrammatical, whereas semantic GCN is able to correctly translate the source sentence."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-7",
"text": "It has long been argued that semantic representations may provide a useful linguistic bias to machine translation systems (Weaver, 1955; BarHillel, 1960) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-8",
"text": "Semantic representations provide an abstraction which can generalize over different surface realizations of the same underlying 'meaning'."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-9",
"text": "Providing this information to a machine translation system, can, in principle, improve meaning preservation and boost generalization performance."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-10",
"text": "Though incorporation of semantic information into traditional statistical machine translation has been an active research topic (e.g., (Baker et al., 2012; Liu and Gildea, 2010; Wu and Fung, 2009; Bazrafshan and Gildea, 2013; Aziz et al., 2011; Jones et al., 2012) ), we are not aware of any previous work considering semantic structures in neural machine translation (NMT)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-11",
"text": "In this work, we aim to fill this gap by showing how information about predicate-argument structure of source sentences can be integrated into standard attentionbased NMT models (Bahdanau et al., 2015) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-12",
"text": "We consider PropBank-style (Palmer et al., 2005) semantic role structures, or more specifi- cally their dependency versions (Surdeanu et al., 2008) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-13",
"text": "The semantic-role representations mark semantic arguments of predicates in a sentence and categorize them according to their semantic roles."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-14",
"text": "Consider Figure 1 , the predicate gave has three arguments: 1 John (semantic role A0, 'the giver'), wife (A2, 'an entity given to') and present (A1, 'the thing given')."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-15",
"text": "Semantic roles capture commonalities between different realizations of the same underlying predicate-argument structures."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-16",
"text": "For example, present will still be A1 in sentence \"John gave a nice present to his wonderful wife\", despite different surface forms of the two sentences."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-17",
"text": "We hypothesize that semantic roles can be especially beneficial in NMT, as 'argument switching' (flipping arguments corresponding to different roles) is one of frequent and severe mistakes made by NMT systems (Isabelle et al., 2017) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-18",
"text": "There is a limited amount of work on incorporating graph structures into neural sequence models."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-19",
"text": "Though, unlike semantics in NMT, syntactically-aware NMT has been a relatively hot topic recently, with a number of approaches claiming improvements from using treebank syntax Eriguchi et al., 2016; Nadejde et al., 2017; Bastings et al., 2017; Aharoni and Goldberg, 2017) , our graphs are different from syntactic structures."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-20",
"text": "Unlike syntactic dependency graphs, they are not trees and thus cannot be processed in a bottom-up fashion as in Eriguchi et al. (2016) or easily linearized as in Aharoni and Goldberg (2017) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-21",
"text": "Luckily, the modeling approach of Bastings et al. (2017) does not make any assumptions about the graph structure, and thus we build on their method."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-22",
"text": "Bastings et al. (2017) used Graph Convolutional Networks (GCNs) to encode syntactic structure."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-23",
"text": "GCNs were originally proposed by Kipf and Welling (2016) and modified to handle labeled and automatically predicted (hence noisy) syntactic dependency graphs by ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-24",
"text": "Representations of nodes (i.e. words in a sentence) in GCNs are directly influenced by representations of their neighbors in the graph."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-25",
"text": "The form of influence (e.g., transition matrices and parameters of gates) are learned in such a way as to benefit the end task (i.e. translation)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-26",
"text": "These linguistically-aware word representations are used within a neural encoder."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-27",
"text": "Although recent research has shown that neural architectures are able to learn some linguistic phenomena without explicit linguistic supervision (Linzen et al., 2016; Vaswani et al., 2017) , informing word representations with linguistic structures can provide a useful inductive bias."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-28",
"text": "We apply GCNs to the semantic dependency graphs and experiment on the English-German language pair (WMT16)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-29",
"text": "We observe an improvement over the semantics-agnostic baseline (a BiRNN encoder; 23.3 vs 24.5 BLEU)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-30",
"text": "As we use exactly the same modeling approach as in the syntactic method of Bastings et al. (2017) , we can easily compare the influence of the types of linguistic structures (i.e., syntax vs. semantics)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-31",
"text": "We observe that when using full WMT data we obtain better results with semantics than with syntax (23.9 BLEU for syntactic GCN)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-32",
"text": "Using syntactic and semantic GCN together, we obtain a further gain (24.9 BLEU) that suggests the complementarity of syntax and semantics."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-34",
"text": "**MODEL**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-36",
"text": "**ENCODER-DECODER MODELS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-37",
"text": "We use a standard attention-based encoderdecoder model (Bahdanau et al., 2015) as a starting point for constructing our model."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-38",
"text": "In encoderdecoder models, the encoder takes as input the source sentence x and calculates a representation of each word x t in x. The decoder outputs a translation y relying on the representations of the source sentence."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-39",
"text": "Traditionally, the encoder is parametrized as a Recurrent Neural Network (RNN), but other architectures have also been successful, such as Convolutional Neural Networks (CNN) (Gehring et al., 2017) and hierarchical selfattention models (Vaswani et al., 2017) , among others."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-40",
"text": "In this paper we experiment with RNN and CNN encoders."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-41",
"text": "We explore the benefits of incorporating information about semantic-role structures into such encoders."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-42",
"text": "More formally, RNNs (Elman, 1990) can be defined as a function RNN(x 1:t ) that calculates the hidden representation h t of a word x t based on its left context."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-43",
"text": "Bidirectional RNNs use two RNNs: one runs in the forward direction and another one in the backward direction."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-44",
"text": "The forward RNN(x 1:t ) represents the left context of word x t , whereas the backward RNN(x n:t ) computes a representation of the right context."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-45",
"text": "The two representations are concatenated in order to incorporate information about the entire sentence:"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-46",
"text": "In contrast to BiRNNs, CNNs (LeCun et al., 2001) calculate a representation of a word x t by considering a window of words w around x t , such as"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-47",
"text": "where f is usually an affine transformation followed by a nonlinear function."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-48",
"text": "Once the sentence has been encoded, the decoder takes as input the induced sentence representation and generates the target sentence y. The target sentence y is predicted word by word using an RNN decoder."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-49",
"text": "At each step, the decoder calculates the probability of generating a word y t conditioning on a context vector c t and the previous state of the RNN decoder."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-50",
"text": "The context vector c t is calculated based on the representation of the source sentence computed by the encoder, using an attention mechanism (Bahdanau et al., 2015) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-51",
"text": "Such a model is trained end-to-end on a parallel corpus to maximize the conditional likelihood of the target sentences."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-53",
"text": "**GRAPH CONVOLUTIONAL NETWORKS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-54",
"text": "Graph neural networks are a family of neural architectures (Scarselli et al., 2009; Gilmer et al., 2017) specifically devised to induce representation of nodes in a graph relying on its graph structure."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-55",
"text": "Graph convolutional networks (GCNs) belong to this family."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-56",
"text": "While GCNs were introduced BiRNN CNN Baseline (Bastings et al., 2017) 14.9 12.6 +Sem 15.6 13.4 +Syn (Bastings et al., 2017) 16.1 13.7 +Syn + Sem 15.8 14.3 for modeling undirected unlabeled graphs (Kipf and Welling, 2016) , in this paper we use a formulation of GCNs for labeled directed graphs, where the direction and the label of an edge are incorporated."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-57",
"text": "In particular, we follow the formulation of and Bastings et al. (2017) for syntactic graphs and apply it to dependency-based semantic-role structures (Hajic et al., 2009 ) (as in Figure 1 )."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-58",
"text": "More formally, consider a directed graph G = (V, E), where V is a set of nodes, and E is a set of edges."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-59",
"text": "Each node v \u2208 V is represented by a feature vector x v \u2208 R d , where d is the latent space dimensionality."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-60",
"text": "The GCN induces a new representation h v \u2208 R d of a node v while relying on representations h u of its neighbors:"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-61",
"text": "where N (v) is the set of neighbors of v, W dir(u,v) \u2208 R d\u00d7d is a direction-specific parameter matrix."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-62",
"text": "There are three possible directions (dir(u, v) \u2208 {in, out, loop}): self-loop edges were added in order to ensure that the initial representation of node h v directly affects its new representation h v ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-63",
"text": "The vector b lab(u,v) \u2208 R d is an embedding of a semantic role label of the edge (u, v) (e.g., A0)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-64",
"text": "The functions g u,v are scalar gates which weight the importance of each edge."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-65",
"text": "Gates are particularly useful when the graph is predicted BiRNN Baseline (Bastings et al., 2017) 23.3 +Sem 24.5 +Syn (Bastings et al., 2017) 23.9 +Syn + Sem 24.9 and thus may contain errors, i.e., wrong edges."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-66",
"text": "In this scenario gates can down weight the influence of such edges."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-67",
"text": "\u03c1 is a non-linearity (ReLU)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-68",
"text": "2 As with CNNs, GCN layers can be stacked in order to incorporate higher order neighborhoods."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-69",
"text": "In our experiments, we used GCNs on top of a standard BiRNN encoder and a CNN encoder (Figure 2) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-70",
"text": "In other words, the initial representations of words fed into GCN were either RNN states or CNN representations."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-72",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-73",
"text": "We experimented with the English-to-German WMT16 dataset (\u223c4.5 million sentence pairs for training)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-74",
"text": "We use its subset, News Commentary v11, for development and additional experiments (\u223c226.000 sentence pairs)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-75",
"text": "For all these experiments, we use newstest2015 and newstest2016 as a validation and test set, respectively."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-76",
"text": "We parsed the English partitions of these datasets with a syntactic dependency parser (Andor et al., 2016) and dependency-based semantic role labeler ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-77",
"text": "We constructed the English vocabulary by taking all words with frequency higher than three, while for German we used byte-pair encodings (BPE) (Sennrich et al., 2016)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-78",
"text": "All hyperparameter selection was performed on the validation set (see Appendix A)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-79",
"text": "We measured the performance of the models with (cased) BLEU scores (Papineni et al., 2002) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-80",
"text": "The settings and the framework (Neural Monkey (Helcl and Libovick\u00fd, 2017) ) used for experiments are the ones used in Bastings et al. (2017) , which we use as baselines."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-81",
"text": "As RNNs, we use GRUs (Cho et al., 2014) ."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-82",
"text": "We now discuss the impact that different architectures and linguistic information have on the translation quality."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-84",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-85",
"text": "First, we start with experiments with the smaller News Commentary training set (See Table 1 )."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-86",
"text": "As in Bastings et al. (2017) , we used the standard attention-based encoder-decoder model as a baseline."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-87",
"text": "We tested the impact of semantic GCNs when used on top of CNN and BiRNN encoders."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-88",
"text": "As expected, BiRNN results are stronger than CNN ones."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-89",
"text": "In general, for both encoders we observe the same trend: using semantic GCNs leads to an improvement over the baseline model."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-90",
"text": "The improvements is 0.7 BLEU for BiRNN and 0.8 for CNN."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-91",
"text": "This is slightly surprising as the potentially non-local semantic information should in principle be more beneficial within a less powerful and local CNN encoder."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-113",
"text": "These results (e.g., BiRNN+SelfLoop) show that the linguisticagnostic GCNs perform on par with the baseline, and thus using linguistic structure is genuinely beneficial in translation."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-114",
"text": "Since syntax and semantic structures seem to be individually beneficial and, though related, capture different linguistic phenomena, it is natural to try combining them."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-115",
"text": "When syntax and semantic are combined together in the same GCN layer (SemSyn), we do not observe any improvement with respect to having semantic and syntactic information alone."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-116",
"text": "4 We argue that the reason for this is that the two linguistic signals do not interact much when encoded into the same GCN layer with a simpler aggregation function."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-117",
"text": "We thus stacked a semantic GCN on top of a syntactic one and varied the number of layers."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-118",
"text": "Though this approach is more successful, we manage to obtain only very moderate improvements over the singlerepresentation models."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-120",
"text": "**QUALITATIVE ANALYSIS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-121",
"text": "We analyzed the behavior of the BiRNN baseline and the semantic GCN model trained on the full WMT16 training set."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-122",
"text": "In Table 4 we show three examples where there is a clear difference between translations produced by the two models."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-123",
"text": "Besides the two translations, we show the dependency SRL structure predicted by the labeler and exploited by our GCN model."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-124",
"text": "In the first sentence, the only difference is in the choice of the preposition for the argument Mark. Note that the argument is correctly assigned to role A2 ('Buyer') by the semantic role labeler."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-125",
"text": "The BiRNN model translates to with nach, which in German expresses directionality and would be a correct translation should the argument refer to a location."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-126",
"text": "In contrast, semantic GCN correctly translates to as an."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-127",
"text": "We hypothesize that the semantic structure, namely the assignment of the argument to A2 rather than AM-DIR ('Directionality'), helps the model to choose the right preposition."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-92",
"text": "The syntactic GCNs (Bastings et al., 2017) appear stronger than semantic GCNs."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-93",
"text": "As exactly the same model and optimization are used for both GCNs, the differences should be due to the type of linguistic representations used."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-94",
"text": "3 When syntactic and semantic GCNs are used together, we observe a further improvement with respect to the semantic GCN model, and a substantial improvement with respect to the syntactic GCN model with a CNN encoder."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-95",
"text": "Now we turn to the full WMT experiments."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-96",
"text": "Though we expected that the linguistic bias should more valuable in a resource-poor setting, the improvement from using semantic-role structures is larger here (+1.2 BLEU)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-97",
"text": "It is surprising but perhaps more data is beneficial for accurately modeling influence of semantics on the translation task."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-98",
"text": "Interestingly, the semantic GCN now outperforms the syntactic one by 0.6 BLEU."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-99",
"text": "Again, it is hard to pinpoint exact reasons for this."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-100",
"text": "One may speculate though that, given enough data, RNNs were able to capture syntactic dependency and thus reducing the benefits from using treebank syntax, whereas (often less local and harder) semantic dependencies were more complementary."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-101",
"text": "Finally, when syntactic and semantic GCN are trained together, we obtain a further improvement reaching 24.9 BLEU."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-102",
"text": "These results suggest that syntactic and semantic dependency structures are complementary information when it comes to translation."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-103",
"text": "3 Note that the SRL system we use does not use syntax and is faster than the syntactic parser of Andor et al. (2016) , so semantic GCNs may still be preferable from the engineering perspective even in this setting."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-104",
"text": "BiRNN CNN Baseline (Bastings et al., 2017) 14.1 12.1 +Sem (1L) 14.3 12.5 +Sem (2L) 14.4 12.6 +Sem (3L) 14.4 12.7 +Syn (2L) (Bastings et al., 2017) 14.8 13.1 +SelfLoop (1L) 14.1 12.1 +SelfLoop (2L) 14.2 11.5 +SemSyn (1L) 14.1 12.7 +Syn (1L) + Sem (1L) 14.7 12.7 +Syn (1L) + Sem (2L) 14.6 12.8 +Syn (2L) + Sem (1L) 14.9 13.0 +Syn (2L) + Sem (2L) 14.9 13.5"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-106",
"text": "**ABLATION AND SYNTAX-SEMANTICS GCNS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-107",
"text": "We used the validation set to perform extra experiments, as well as to select hyper parameters (e.g., the number of GCN layers) for the experiments presented above."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-108",
"text": "Table 3 presents the results."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-109",
"text": "The annotation 1L, 2L and 3L refers to the number of GCN layers used."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-110",
"text": "First, we tested whether the gain we observed is an effect of an extra layer of non-linearity or an effect of the linguistic structures encoded with GCNs."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-111",
"text": "In order to do so, we used the GCN layer without any structural information."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-112",
"text": "In this way, only the self-loop edge is used within the GCN node updates."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-129",
"text": "Again, the arguments, correctly identified by semantic role labeler, may have been useful in translating this somewhat tricky sentence."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-130",
"text": "Finally, in the third case, we can observe that both translations are problematic."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-131",
"text": "BiRNN and Semantic GCN ignored verbs sit and play, respectively."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-132",
"text": "However, BiRNN's translation for this sentence is preferable, as it is grammatically correct, even if not fluent or particularly precise."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-134",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-135",
"text": "In this work we propose injecting information about predicate-argument structures of sentences in NMT models."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-136",
"text": "We observe that the semantic structures are beneficial for the English-German language pair."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-137",
"text": "So far we evaluated the model performance in terms of BLEU only."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-138",
"text": "It would be interesting in future work to both understand when semantics appears beneficial, and also to see which components of semantic structures play a role."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-139",
"text": "Experiments on other language pairs are also left for future work."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-141",
"text": "**A HYPERPARAMETERS**"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-142",
"text": "For experiments on the News Commentary data we used 8000 BPE merges, whereas we used 16000 BPE merges for En-De experiments on the full dataset."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-143",
"text": "For all the experiments, we used bidirectional GRUs and we set the embedding size to 256, we used word dropout with retain probability of 0.8 and edge dropout with the same probability, we used L2 regularization on all the parameters with value of 10 \u22128 , translations are obtained using a greedy decoder."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-144",
"text": "We placed residual connections (He et al., 2016) before every GCN layer."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-145",
"text": "For the experiments on News Commentary data, we set GRU (for both encoder and decoder) and CNN hidden states to 512, we use Adam (Kingma and Ba, 2015) as optimizer with an initial learning rate of 0.0002, and we trained the models for 50 epochs."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-146",
"text": "For large scale experiments on En-De, we set the GRU hidden states to 800, and instead of greedy decoding we employed beam search (beam 12)."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-147",
"text": "We trained the model for 20 epochs with the same hyperparameters."
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "b71321a9252376308d627c439e85b7-C001-149",
"text": "**B DATASETS STATISTICS**"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"b71321a9252376308d627c439e85b7-C001-18",
"b71321a9252376308d627c439e85b7-C001-19"
],
[
"b71321a9252376308d627c439e85b7-C001-21"
],
[
"b71321a9252376308d627c439e85b7-C001-22",
"b71321a9252376308d627c439e85b7-C001-23",
"b71321a9252376308d627c439e85b7-C001-24",
"b71321a9252376308d627c439e85b7-C001-25",
"b71321a9252376308d627c439e85b7-C001-26"
],
[
"b71321a9252376308d627c439e85b7-C001-54",
"b71321a9252376308d627c439e85b7-C001-55",
"b71321a9252376308d627c439e85b7-C001-56"
],
[
"b71321a9252376308d627c439e85b7-C001-64",
"b71321a9252376308d627c439e85b7-C001-65"
]
],
"cite_sentences": [
"b71321a9252376308d627c439e85b7-C001-19",
"b71321a9252376308d627c439e85b7-C001-21",
"b71321a9252376308d627c439e85b7-C001-22",
"b71321a9252376308d627c439e85b7-C001-56",
"b71321a9252376308d627c439e85b7-C001-65"
]
},
"@DIF@": {
"gold_contexts": [
[
"b71321a9252376308d627c439e85b7-C001-18",
"b71321a9252376308d627c439e85b7-C001-19"
],
[
"b71321a9252376308d627c439e85b7-C001-54",
"b71321a9252376308d627c439e85b7-C001-55",
"b71321a9252376308d627c439e85b7-C001-56"
]
],
"cite_sentences": [
"b71321a9252376308d627c439e85b7-C001-19",
"b71321a9252376308d627c439e85b7-C001-56"
]
},
"@SIM@": {
"gold_contexts": [
[
"b71321a9252376308d627c439e85b7-C001-21"
],
[
"b71321a9252376308d627c439e85b7-C001-30",
"b71321a9252376308d627c439e85b7-C001-31",
"b71321a9252376308d627c439e85b7-C001-32"
]
],
"cite_sentences": [
"b71321a9252376308d627c439e85b7-C001-21",
"b71321a9252376308d627c439e85b7-C001-30"
]
},
"@MOT@": {
"gold_contexts": [
[
"b71321a9252376308d627c439e85b7-C001-30",
"b71321a9252376308d627c439e85b7-C001-31",
"b71321a9252376308d627c439e85b7-C001-32"
]
],
"cite_sentences": [
"b71321a9252376308d627c439e85b7-C001-30"
]
},
"@USE@": {
"gold_contexts": [
[
"b71321a9252376308d627c439e85b7-C001-30",
"b71321a9252376308d627c439e85b7-C001-31",
"b71321a9252376308d627c439e85b7-C001-32"
],
[
"b71321a9252376308d627c439e85b7-C001-57"
],
[
"b71321a9252376308d627c439e85b7-C001-80"
],
[
"b71321a9252376308d627c439e85b7-C001-86",
"b71321a9252376308d627c439e85b7-C001-87"
]
],
"cite_sentences": [
"b71321a9252376308d627c439e85b7-C001-30",
"b71321a9252376308d627c439e85b7-C001-57",
"b71321a9252376308d627c439e85b7-C001-80",
"b71321a9252376308d627c439e85b7-C001-86"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"b71321a9252376308d627c439e85b7-C001-92"
],
[
"b71321a9252376308d627c439e85b7-C001-104"
]
],
"cite_sentences": [
"b71321a9252376308d627c439e85b7-C001-92",
"b71321a9252376308d627c439e85b7-C001-104"
]
}
}
},
"ABC_1baddfeea7d11fc02cc26ff698a601_5": {
"x": [
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-2",
"text": "The dominant language models (LMs) such as n-gram and neural network (NN) models represent sentence probabilities in terms of conditionals."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-3",
"text": "In contrast, a new trans-dimensional random field (TRF) LM has been recently introduced to show superior performances, where the whole sentence is modeled as a random field."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-4",
"text": "In this paper, we further develop the TDF LMs with two technical improvements, which are a new method of exploiting Hessian information in parameter optimization to further enhance the convergence of the training algorithm and an enabling method for training the TRF LMs on large corpus which may contain rare very long sentences."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-5",
"text": "Experiments show that the TRF LMs can scale to using training data of up to 32 million words, consistently achieve 10% relative perplexity reductions over 5-gram LMs, and perform as good as NN LMs but with much faster speed in calculating sentence probabilities."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-6",
"text": "Moreover, we examine how the TRF models can be interpolated with the NN models, and obtain 12.1% and 17.9% relative error rate reductions over 6-gram LMs for English and Chinese speech recognition respectively through log-linear combination."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-9",
"text": "Language modeling (LM) involves determining the joint probability of words in a sentence."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-10",
"text": "The conditional approach is dominant, representing the joint probability in terms of conditionals."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-11",
"text": "Examples include n-gram LMs [1] and neural network (NN) LMs [2, 3] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-12",
"text": "We have recently introduced a new transdimensional random field (TRF 1 ) LM [4] , where the whole sentence is modeled as a random field."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-13",
"text": "As the random field approach avoids local normalization which is required in the conditional approach, it is computationally more efficient in computing sentence probabilities and has the potential advantage of being able to flexibly integrating a richer set of features."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-14",
"text": "We developed an effective training algorithm using joint stochastic approximation (SA) and trans-dimensional mixture sampling."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-15",
"text": "We found that the TRF models significantly outperformed the modified Kneser-Ney (KN) smoothed 4-gram LM with 9.1% relative reduction in speech recognition word error rates (WERs) and performed slightly better than the recurrent neural network LMs but with 200x faster speed in re-scoring n-best lists of hypothesized sentences."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-16",
"text": "To our knowledge, this result represents the first strong empirical evidence supporting the power of using the whole-sentence random field approach for LMs [5] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-17",
"text": "In this paper, we further develop the TDF LMs with two technical improvements."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-18",
"text": "Moreover, we perform more experiments to investigate whether the TRF models can scale to using larger corpus and how the TRF models can be interpolated with NN models to further improve the performance."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-19",
"text": "Improvements: First, in [4] , the diagonal elements of the Hessian matrices are online estimated during the SA iterations to rescale the gradients, which is shown to benefit the convergence of the training algorithm."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-20",
"text": "In this paper, inspired from [6, 7] , we propose a simpler but more effective method which directly uses the empirical variances to rescale the gradients."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-21",
"text": "As the empirical variances are calculated offline, this also reduces computational and memory cost during the SA iterations."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-22",
"text": "The experimental results show that this new manner of exploiting second-order information in parameter optimization further enhances the convergence of the training algorithm."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-23",
"text": "Second, to enable the training of TRF LMs on large corpus which may contain rare very long sentences, we introduce a special sub-model to model the sequences longer than a fixed length."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-24",
"text": "Experiments: First, we examine the scalability of the TRF LMs, by incrementally increasing the size of the training set."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-25",
"text": "Based on training data of up to 32 million words from Google 1-billion word corpus [8] , we build TRF LMs with up to 36 million features."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-26",
"text": "The TRF LMs consistently achieve 10% relative perplexity reductions over the KN 5-gram LMs."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-27",
"text": "Then speech recognition experiments are conducted for both English and Chinese."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-28",
"text": "Three kinds of interpolation schemes, namely linear interpolation (word-level or sentence-level) and log-linear interpolation, are evaluated for model combinations between KN n-gram LMs, NN LMs and our TRF LMs."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-29",
"text": "Through loglinearly combining TRF LMs with NN LMs, we obtain 12.1% and 17.9% relative error rate reductions over the KN 6-gram LMs for English and Chinese speech recognition respectively."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-31",
"text": "**IMPROVEMENTS**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-33",
"text": "**BACKGROUND OF TRF MODEL DEFINITION AND TRAINING**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-34",
"text": "Throughout, we denote by x l = (x1, . . . , x l ) a sentence (i.e., word sequence) of length l ranging from 1 to m. Each element of x l corresponds to a single word."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-35",
"text": "D denotes the whole training corpus and D l denotes the collection of length l in the training corpus."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-36",
"text": "n l denotes the size of D l and n = m l=1 n l ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-37",
"text": "As defined in [4] , a trans-dimensional random field model represents the joint probability of the pair (l, x l ) as"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-38",
"text": "where n l /n is the empirical probability of length l. f ("
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-39",
"text": "T is the feature vector, which is usually defined to be position-independent and length-independent, e.g."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-40",
"text": "is the normalization constant of length l. By making explicit the role of length in model definition, it is clear that the model in (1) is a mixture of random fields on sentences of different lengths (namely on subspaces of different dimensions), and hence will be called a trans-dimensional random field (TRF)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-41",
"text": "In the joint SA training algorithm [4] , we define another form of mixture distribution as follows:"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-42",
"text": "where \u03b6 = {\u03b61, . . . , \u03b6m} with \u03b61 = 0 and \u03b6 l is the hypothesized value of the log ratio of Z l (\u03bb) with respect to Z1(\u03bb), namely log"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-43",
"text": ". Z1(\u03bb) is chosen as the reference value and can be calculated exactly."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-44",
"text": "An important observation is that if and only if \u03b6 were equal to the true log ratios, then the marginal probability of length l under distribution (2) equals to n l /n."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-45",
"text": "We then use this property to construct the joint SA algorithm, which jointly estimates the model parameters and normalization constants."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-47",
"text": "**IMPROVED STOCHASTIC APPROXIMATION**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-48",
"text": "In order to make use of Hessian information in parameter optimization, we use the online estimated Hessian diagonal elements to rescale the gradients in [4] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-49",
"text": "In this paper, inspired from [6, 7] , we propose a simpler but more effective method which directly uses the empirical variances to rescale the gradients."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-50",
"text": "The improved SA algorithm is described as follows."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-51",
"text": "At each iteration t (from 1 to tmax), we perform two steps."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-52",
"text": "Step I: MCMC sampling: Generate a sample set B (t) with p(l, x l ; \u03bb (t\u22121) , \u03b6 (t\u22121) ) as the stationary distribution, using the trans-dimensional mixture sampling method (See Section 3.3 in [4] )."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-53",
"text": "Step II: SA updating: Compute"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-54",
"text": "and"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-55",
"text": "nm/n (4)"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-56",
"text": "where \u03b3 \u03bb,t ,\u03b3 \u03b6,t are the learning rate of \u03bb and \u03b6."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-57",
"text": "is the relative frequency of length l appearing in B (t) ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-58",
"text": "In Eq.3, \u03c3 = diag(\u03c31, . . . , \u03c3 d ) is a diagonal matrix and the element \u03c3i (1 \u2264 i \u2264 d) is the empirical variance of feature fi:"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-59",
"text": ". It is shown in [6] that the convergence speed of log-linear model training is improved when the means and variances of the input features are normalized."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-60",
"text": "Our update rule in Eq.3 performs the normalization on model side instead of explicitly normalizing the features, which is similar to [7] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-61",
"text": "As the empirical variances are calculated offline, this also reduces computational and memory cost during the SA iterations."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-62",
"text": "Our empirical result shows that compared to using the online estimated Hessian elements which are noisy, using the empirical variances which are exactly calculated can improve the convergence significantly, especially on large dataset with millions of features."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-63",
"text": "Fig.1 show an example of convergence curves of the SA training algorithm in [4] and the new improved SA."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-65",
"text": "**MODELING RARE VERY LONG SENTENCES**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-66",
"text": "Training TRF models on large corpus needs to address the challenge to handling rare very long sentences."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-67",
"text": "For example in our experiments, the maximum length in Google 1-billion word corpus [8] is more than 1000."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-68",
"text": "To reduce the sub-model number and the size of the model space, we set the maximum length of TRF m to a medium length (such as 100) and introduce a special sub-model p(> m, x j ; \u03bb, \u03b6) to represent the sentences longer than m as follows:"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-69",
"text": "where \u03b6 = {\u03b61, . . . , \u03b6m, \u03b6>m} and n>m is the number of the sentence longer than m in the training set."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-70",
"text": "As the features f (x l ) used in our study are usually both length-independent and position-independent (such as the counts of the n-grams observed in a sentence), calculating f (x >m ) is straightforward and few modifications to the training algorithm are needed."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-71",
"text": "To estimate \u03b6>m, the maximum length in the trans-dimensional mixture sampling is set to be slightly larger than m (we set m + 2 in our experiments) and the length expectation on sampling set \u03b4>m("
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-72",
"text": "is calculated to update \u03b6>m based Eq.4 and Eq.5."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-74",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-76",
"text": "**CONFIGURATION OF TRFS**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-77",
"text": "In the following experiments, we consider a variety of features for our TRF LMs as shown in Tab.1, mainly based word and class information."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-78",
"text": "Each word in the vocabulary is deterministically assigned to a single class, by running the word clustering algorithm proposed in [9] on the training data."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-79",
"text": "In Tab.1, wi, ci, i = 0, \u22121, . . . , \u22129 denote the word and its class at different position offset i, e.g. w0, c0 denotes the current word and its class."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-80",
"text": "We first introduce the classic word/class n-gram features (denoted by \"w\"/\"c\") and the word/class skipping n-gram features (denoted by \"ws\"/\"cs\") l ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-81",
"text": "p1(xi|hi) and p2(xi|hi) are the conditional probabilities of xi given history hi estimated by two LMs; p1(x l ) and p2(x l ) are the joint probabilities of the whole sentence x l estimated by two LMs."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-82",
"text": "0 < \u03b1 < 1 is the interpolation weight which is tuned on the development set."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-83",
"text": "[10] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-84",
"text": "Second, to demonstrate that long-span features can be naturally integrated in TRFs, we introduce higher-order features \"wsh\"/\"csh\", by considering two words/classes separated with longer distance."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-85",
"text": "Third, as an example of supporting heterogenous features that combine different information, the crossing features \"cpw\" (meaning class-predict-word) are introduced."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-86",
"text": "In the end, we introduce the tied long-skip-bigram features \"tied\" [11] , in which the skip-bigrams with skipping distances from 6 to 9 share the same parameter."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-87",
"text": "In this way we can leverage long distance context without increasing the model size."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-88",
"text": "Note that for all the feature types in Tab.1, only the features observed in the training data are used."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-89",
"text": "The improved SA algorithm (in Section 2.2) is used to train the TRF LMs, in conjunction with the trans-dimensional mixture sampling proposed in Section 3.3 of [4] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-90",
"text": "The learning rates of \u03bb and \u03b6 are set as suggested in [4] :"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-91",
"text": "where tc, t0 are constants and 0.5 < \u03b2 \u03bb , \u03b2 \u03b6 < 1."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-92",
"text": "The class information is also used to accelerate the sampling, and more than one CPU cores are used to parallelize the algorithm, as described in [4] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-93",
"text": "Also, we examine how the TRF models can be interpolated with NN models to further improve the performance."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-94",
"text": "Linear combination can be done at either word-level (W) or sentencelevel (S)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-95",
"text": "In contrast, log-linear combination has no such differentiation."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-96",
"text": "The three schemes are detailed in Tab.2."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-97",
"text": "As TRF models only output sentence probabilities, the \"W\" scheme is not applicable in combining TRF models with other models."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-99",
"text": "**USING TRAINING DATA OF UP TO 32 MILLION WORDS**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-100",
"text": "In this section, we show the scalability of TRF LMs as well as the effectiveness of the improved SA training algorithm to handle tens of millions of features."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-101",
"text": "The experiments are performed on part of Google 1-billion word corpus 2 Table 3 : The perplexities (PPL) of various LMs with different sizes of training data (8M, 16M, 32M) from Google 1-billion word corpus."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-102",
"text": "The cutoff settings of n-gram LMs are 0002 (KN4) and 00002 (KN5)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-103",
"text": "The feature type of TRF is \"w+c+ws+cs+wsh+csh\". \"#feat\" is the feature size (million)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-104",
"text": "words."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-105",
"text": "The whole held-out corpus contains 50 files and each file contains about 160K words."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-106",
"text": "In our experiments, we choose the first and second file (i.e. \"news.en.heldout-00000-of-00050\" and \"news.en.heldout-00001-of-00050\") in the held-out corpus as the development set and test set respectively, and incrementally increase the training set as used in our experiments."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-107",
"text": "First, we use one training file (about 8 million words) as the training set."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-108",
"text": "We exact the 20K most frequent words from the training set to construct the lexicon."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-109",
"text": "Then for the training, development and test set, all the words out of the lexicon are mapped to an auxiliary token . The word clustering algorithm in [9] is performed on the training set to cluster the words into 200 classes."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-110",
"text": "We train a TRF model with feature type \"w+c+ws+cs+wsh+csh\" (Tab.1)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-111",
"text": "Then we increase the training size to around 16 million words (2 training files)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-112",
"text": "We still exact the 20K most frequent words from the current training set to construct a new lexicon."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-113",
"text": "After re-clustering the words into 200 classes, a new TRF model with the same feature type \"w+c+ws+cs+wsh+csh\" is trained over the new training set."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-114",
"text": "We repeat the above process and train the third TRF model on 32 million words (4 training files)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-115",
"text": "In the experiments, the maximum length of TRFs is m = 100."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-116",
"text": "At each iteration, we generate K = 300 samples with length ranging from 1 to 102."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-117",
"text": "The learning rate \u03b3 \u03bb and \u03b3 \u03b6 are configured as Eq.8 with \u03b2 \u03bb = 0.8, \u03b2 \u03b6 = 0.6 and tc = 1000, t0 = 2000, tmax = 50, 000."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-118",
"text": "L2 regularization with constant 10 \u22125 is used to avoid over-fitting."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-119",
"text": "12 CPU cores are used to parallelize the training algorithm."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-120",
"text": "The perplexities on the test set are shown in Tab.3 and Fig.2 ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-121",
"text": "These results show that the TRF LMs can scale to using training data of up to 32 million words."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-122",
"text": "They consistently achieve 10% relative perplexity reductions over the modified KneserNey (KN) smoothed 5-gram LMs [1] , and notably the model sizes in comparisons are close."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-123",
"text": "Table 4 : The WERs and PPLs on the WSJ'92 test data. \"#feat\" denotes the feature size (million). \"(W),(S),(Log)\" denote different model combination schemes as defined in Tab.2."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-124",
"text": "\"TRF*\" denotes the TRF with features \"w+c+ws+cs+wsh+csh+tied\"."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-126",
"text": "**ENGLISH SPEECH RECOGNITION ON WSJ0 DATASET**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-127",
"text": "In this section, speech recognition and 1000-best list rescoring experiments are conducted as configured in [4] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-128",
"text": "The maximum length of TRFs is m = 82, which is equal to the maximum length of the training sentences."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-129",
"text": "The other configurations are: K = 300, \u03b2 \u03bb = 0.8, \u03b2 \u03b6 = 0.6, tc = 3000, t0 = 2000, tmax = 20, 000."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-130",
"text": "L2 regularization with constant 4 \u00d7 10 \u22125 is used to avoid over-fitting."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-131",
"text": "6 CPU cores are used to parallelize the algorithm."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-132",
"text": "The word error rates (WERs) and perplexities (PPLs) on WSJ'92 test set are shown in Tab.4."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-133",
"text": "The TRF LMs are compared with the classic KN n-gram LMs, the RNN LM [3] and the results reported in [4] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-134",
"text": "Compared to the results in [4] , the improved SA proposed in Section 2.2 gives the same WERs but lower PPLs in using the same feature types."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-135",
"text": "Introducing the tied skip-bigram features further reduce the WER."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-136",
"text": "Combining TRF and KN5 provides no WER reduction."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-137",
"text": "Different schemes give close WERs for combining TRF and RNN."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-138",
"text": "The log-linear interpolation performs more stable when considering both English and Chinese experiments (as shown later)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-139",
"text": "For English, the obtained WER 7.57% indicates 12.1% and 3.6% relative reductions, when compared to the result of using KN6 (8.61%) and the best result of combining RNN and KN5 (7.85%) respectively."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-141",
"text": "**CHINESE SPEECH RECOGNITION ON TOSHIBA DATASET**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-142",
"text": "In this section we report the results from using TRF LMs in a large vocabulary Mandarin speech recognition experiment."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-143",
"text": "Different LMs are evaluated by rescoring 30000-best list from a Toshiba's internal test set (2975 utterances)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-144",
"text": "The oracle character error rate (CER) of the 30000-best lists is 1.61%, which are generated with a DNN-based acoustic model."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-145",
"text": "The LM corpus is from Toshiba, which contains about 20M words."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-146",
"text": "We randomly select 1% from the corpus as the development set and others as the training set."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-147",
"text": "The vocabulary contains 82K words, with one special token . The NN LM used here is the feedforward neural network (FNN) LM [2, 12] trained by CSLM toolkit 3 ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-148",
"text": "The number of hidden units is 512 and the projection layer units is 3\u00d7128."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-149",
"text": "The TRF models are trained using the fea- 4.0 Table 5 : The CERs and PPLs on the test set in Chinese speech recognition."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-150",
"text": "\"TRF*\" denotes the TRF trained with 400 classes."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-151",
"text": "ture set \"w+c+ws+cs+cpw\" with different numbers of classes (200,400,600)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-152",
"text": "The configurations are: m = 100, \u03b2 \u03bb = 0.8, \u03b2 \u03b6 = 0.6, tc = 1000, t0 = tmax = 20000."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-153",
"text": "The sample number K is increased from 300 until no improvements on the development set are observed, and finally set to be 8000."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-154",
"text": "20 CPU cores are used to parallelize the algorithm."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-155",
"text": "The CERs and PPLs on the test set are shown in Tab."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-156",
"text": "5 ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-157",
"text": "The results again demonstrate that the TRF LMs significantly outperform the n-gram LMs and are able to match the NN LM."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-158",
"text": "Log-linear combination of TRF and FNN further reduces the CER."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-159",
"text": "The obtained CER 4.0% indicates 17.9% and 1.5% relative reductions, when compared to the result of using KN6 (4.87%) and the best result of combining FNN and KN6 (4.06%) respectively."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-161",
"text": "**RELATED WORK AND DISCUSSION**"
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-162",
"text": "Currently, most attention of LM research is attracted by using neural networks, which has been shown to surpass the classic n-gram LMs."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-163",
"text": "Two basic classes of NN LMs are based on FNN [2] and RNN [3] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-164",
"text": "Recent extensions involve the use of sumproduct networks [13] , deep recurrent neural networks [14] and feedforward sequential memory networks [15] ; only perplexity results are reported in these studies."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-165",
"text": "Crucially, no matter what form the networks take, various NN LMs follow the conditional approach and thus suffer from the expensive softmax computations due to the requirement of local normalization."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-166",
"text": "Lots of studies aim to alleviate this deficiency."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-167",
"text": "Initial efforts include using hierarchical output layer structure with word clustering [3] , converting NNs to n-gram LMs [16] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-168",
"text": "Recently a number of studies [17, 18, 19, 20 ] make use of noise contrastive estimation (NCE) [21] to build unnormalized variants of NN LMs through trickily avoiding local normalization in training and heuristically fixing the normalizing term in testing."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-169",
"text": "In contrast, TRF LMs eliminate local normalization from the root and thus are much more efficient in testing with theoretical guarantee."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-170",
"text": "Empirically in our experiments reported in Section 3.3, the average time costs for re-ranking of the 1000-best list for a sentence are 0.16 sec vs. 40 sec, based on TRF and RNN respectively (no GPU used)."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-171",
"text": "Equally importantly, evaluations in this paper and also in [4] have shown that TRF LMs are able to perform as good as NN LMs (either RNN or FNN) on a variety of tasks."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-172",
"text": "Encouragingly, the random field approach may open a new door to language modeling in addition to the dominant conditional approach, as once envisioned in [5] ."
},
{
"sent_id": "1baddfeea7d11fc02cc26ff698a601-C001-173",
"text": "Integrating richer features and introducing hidden variables are worthwhile future works."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"1baddfeea7d11fc02cc26ff698a601-C001-12",
"1baddfeea7d11fc02cc26ff698a601-C001-13"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-19"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-37",
"1baddfeea7d11fc02cc26ff698a601-C001-38",
"1baddfeea7d11fc02cc26ff698a601-C001-39",
"1baddfeea7d11fc02cc26ff698a601-C001-40"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-48"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-52"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-63"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-89"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-90",
"1baddfeea7d11fc02cc26ff698a601-C001-91"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-92"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-171"
]
],
"cite_sentences": [
"1baddfeea7d11fc02cc26ff698a601-C001-12",
"1baddfeea7d11fc02cc26ff698a601-C001-19",
"1baddfeea7d11fc02cc26ff698a601-C001-37",
"1baddfeea7d11fc02cc26ff698a601-C001-48",
"1baddfeea7d11fc02cc26ff698a601-C001-52",
"1baddfeea7d11fc02cc26ff698a601-C001-63",
"1baddfeea7d11fc02cc26ff698a601-C001-89",
"1baddfeea7d11fc02cc26ff698a601-C001-90",
"1baddfeea7d11fc02cc26ff698a601-C001-92",
"1baddfeea7d11fc02cc26ff698a601-C001-171"
]
},
"@MOT@": {
"gold_contexts": [
[
"1baddfeea7d11fc02cc26ff698a601-C001-19"
]
],
"cite_sentences": [
"1baddfeea7d11fc02cc26ff698a601-C001-19"
]
},
"@EXT@": {
"gold_contexts": [
[
"1baddfeea7d11fc02cc26ff698a601-C001-41",
"1baddfeea7d11fc02cc26ff698a601-C001-42",
"1baddfeea7d11fc02cc26ff698a601-C001-43"
]
],
"cite_sentences": [
"1baddfeea7d11fc02cc26ff698a601-C001-41"
]
},
"@USE@": {
"gold_contexts": [
[
"1baddfeea7d11fc02cc26ff698a601-C001-48"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-52"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-89"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-90",
"1baddfeea7d11fc02cc26ff698a601-C001-91"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-92"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-127",
"1baddfeea7d11fc02cc26ff698a601-C001-128",
"1baddfeea7d11fc02cc26ff698a601-C001-129",
"1baddfeea7d11fc02cc26ff698a601-C001-130",
"1baddfeea7d11fc02cc26ff698a601-C001-131",
"1baddfeea7d11fc02cc26ff698a601-C001-132"
]
],
"cite_sentences": [
"1baddfeea7d11fc02cc26ff698a601-C001-48",
"1baddfeea7d11fc02cc26ff698a601-C001-52",
"1baddfeea7d11fc02cc26ff698a601-C001-89",
"1baddfeea7d11fc02cc26ff698a601-C001-90",
"1baddfeea7d11fc02cc26ff698a601-C001-92",
"1baddfeea7d11fc02cc26ff698a601-C001-127"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"1baddfeea7d11fc02cc26ff698a601-C001-133"
],
[
"1baddfeea7d11fc02cc26ff698a601-C001-134"
]
],
"cite_sentences": [
"1baddfeea7d11fc02cc26ff698a601-C001-133",
"1baddfeea7d11fc02cc26ff698a601-C001-134"
]
}
}
},
"ABC_05b53f9e0a347c4f47d0fd066538c7_5": {
"x": [
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-2",
"text": "Event factuality prediction (EFP) is the task of assessing the degree to which an event mentioned in a sentence has happened."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-3",
"text": "For this task, both syntactic and semantic information are crucial to identify the important context words."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-4",
"text": "The previous work for EFP has only combined these information in a simple way that cannot fully exploit their coordination."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-5",
"text": "In this work, we introduce a novel graph-based neural network for EFP that can integrate the semantic and syntactic information more effectively."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-6",
"text": "Our experiments demonstrate the advantage of the proposed model for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-9",
"text": "Events are often presented in sentences via the indication of anchor/trigger words (i.e., the main words to evoke the events, called event mentions) (Nguyen et al., 2016a) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-10",
"text": "Event mentions can appear with varying degrees of uncertainty/factuality to reflect the intent of the writers."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-11",
"text": "In order for the event mentions to be useful (i.e., for knowledge extraction tasks), it is important to determine their factual certainty so the actual event mentions can be retrieved (i.e., the event factuality prediction problem (EFP))."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-12",
"text": "In this work, we focus on the recent regression formulation of EFP that aims to predict a real score in the range of [-3,+3 ] to quantify the occurrence possibility of a given event mention (Stanovsky et al., 2017; Rudinger et al., 2018) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-13",
"text": "This provides more meaningful information for the downstream tasks than the classification formulation of EFP (Lee et al., 2015) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-14",
"text": "For instance, the word \"left\" in the sentence \"She left yesterday.\" would express an event that certainly happened (i.e., corresponding to a score of +3 in the benchmark datasets) while the event mention associated with \"leave\" in the sentence \"She forgot to leave yesterday.\" would certainly not happen (i.e., a score of -3)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-15",
"text": "EFP is a challenging problem as different context words might jointly participate to reveal the factuality of the event mentions (i.e., the cue words), possibly located at different parts of the sentences and scattered far away from the anchor words of the events."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-16",
"text": "There are two major mechanisms that can help the models to identify the cue words and link them to the anchor words, i.e., the syntactic trees (i.e., the dependency trees) and the semantic information (Rudinger et al., 2018) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-17",
"text": "For the syntactic trees, they can connect the anchor words to the functional words (i.e., negation, modal auxiliaries) that are far away, but convey important information to affect the factuality of the event mentions."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-18",
"text": "For instance, the dependency tree of the sentence \"I will, after seeing the treatment of others, go back when I need medical care.\" will be helpful to directly link the anchor word \"go\" to the modal auxiliary \"will\" to successfully predict the non-factuality of the event mention."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-37",
"text": "**RELATED WORK**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-19",
"text": "Regarding the semantic information, the meaning of the some important context words in the sentences can contribute significantly to the factuality of an event mention."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-20",
"text": "For example, in the sentence \"Knight lied when he said I went to the ranch.\", the meaning represented by the cue word \"lied\" is crucial to classify the event mention associated with the anchor word \"went\" as non-factual."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-21",
"text": "The meaning of such cue words and their interactions with the anchor words can be captured via their distributed representations (i.e., with word embeddings and long-short term memory networks (LSTM)) (Rudinger et al., 2018) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-22",
"text": "The current state-of-the-art approach for EFP has involved deep learning models (Rudinger et al., 2018 ) that examine both syntactic and semantic information in the modeling process."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-23",
"text": "However, in these models, the syntactic and semantic information are only employed separately in the different deep learning architectures to generate syntactic and semantic representations."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-24",
"text": "Such representations are only concatenated in the final stage to perform the factuality prediction."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-25",
"text": "A major problem with this approach occurs in the event mentions when the syntactic and semantic information cannot identify the important structures for EFP individually (i.e., by itself)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-26",
"text": "In such cases, both the syntactic and semantic representations from the separate deep learning models would be noisy and/or insufficient, causing the poor quality of their simple combination for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-27",
"text": "For instance, consider the previous example with the anchor word \"go\": \"I will, after seeing the treatment of others, go back when I need medical care.\"."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-28",
"text": "On the one hand, while syntactic information (i.e., the dependency tree) can directly connect \"will\" to \"go\", it will also promote some noisy words (i.e., \"back\") at the same time due to the direct links (see the dependency tree in Figure 1 )."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-29",
"text": "On the other hand, while deep learning models with the sequential structure can help to downgrade the noisy words (i.e., \"back\") based on the semantic importance and the close distance with \"go\", these models will struggle to capture \"will\" for the factuality of \"go\" due to their long distance."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-30",
"text": "From this example, we also see that the syntactic and semantic information can complement each other to both promote the important context words and blur the irrelevant words."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-31",
"text": "Consequently, we argue that the syntactic and semantic information should be allowed to interact earlier in the modeling process to produce more effective representations for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-32",
"text": "In particular, we propose a novel method to integrate syntactic and semantic structures of the sentences based on the graph convolutional neural networks (GCN) (Kipf and Welling, 2016) for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-33",
"text": "The modeling of GCNs involves affinity matrices to quantify the connection strength between pairs of words, thus facilitating the integration of syntactic and semantic information."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-34",
"text": "In the proposed model, the semantic affinity matrices of the sentences are induced from Long Short-Term Memory networks (LSTM) that are then linearly integrated with the syntactic affinity matrices of the dependency trees to produce the enriched affinity matrices for GCNs in EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-35",
"text": "The extensive experiments show that the proposed model is very effective for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-38",
"text": "EFP is one of the fundamental tasks in Information Extraction."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-39",
"text": "The early work on this problem has employed the rule-based approaches (Nairn et al., 2006; Saur\u00ed, 2008; Lotan et al., 2013) or the machine learning approaches (with manually designed features) (Diab et al., 2009; Prabhakaran et al., 2010; De Marneffe et al., 2012; Lee et al., 2015) , or the hybrid approaches of both (Saur\u00ed and Pustejovsky, 2012; Qian et al., 2015) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-40",
"text": "Recently, deep learning has been applied to solve EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-41",
"text": "(Qian et al., 2018) employ Generative Adversarial Networks (GANs) for EFP while (Rudinger et al., 2018) utilize LSTMs for both sequential and dependency representations of the input sentences."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-42",
"text": "Finally, deep learning has also been considered for the related tasks of EFP, including event detection (Nguyen and Grishman, 2015b; Nguyen et al., 2016b; Lu and Nguyen, 2018; Nguyen and Nguyen, 2019) , event realis classification (Mitamura et al., 2015; Nguyen et al., 2016g) , uncertainty detection (Adel and Sch\u00fctze, 2017) , modal sense classification (Marasovic and Frank, 2016) and entity detection (Nguyen et al., 2016d) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-43",
"text": "the current event mention has happened."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-44",
"text": "There are three major components in the EFP model proposed in this work, i.e., (i) sentence encoding, (ii) structure induction, and (iii) prediction."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-46",
"text": "**SENTENCE ENCODING**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-47",
"text": "The first step is to convert each word in the sentences into an embedding vector."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-48",
"text": "In this work, we employ the contextualized word representations BERT in (Devlin et al., 2018) for this purpose."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-49",
"text": "BERT is a pre-trained language representation model with multiple computation layers that has been shown to improve many NLP tasks."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-50",
"text": "In particular, the sentence (x 1 , x 2 , ..., x n ) would be first fed into the pre-trained BERT model from which the contextualized embeddings of the words in the last layer are used for further computation."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-51",
"text": "We denote such word embeddings for the words in (x 1 , x 2 , . . . , x n ) as (e 1 , e 2 , . . . , e n ) respectively."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-52",
"text": "In the next step, we further abstract (e 1 , e 2 , . . . , e n ) for EFP by feeding them into two layers of bidirectional LSTMs (as in (Rudinger et al., 2018) )."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-53",
"text": "This produces (h 1 , h 2 , . . . , h n ) as the hidden vector sequence in the last bidirectional LSTM layer (i.e., the second one)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-54",
"text": "We consider (h 1 , h 2 , . . . , h n ) as a rich representation of the input sentence (x 1 , x 2 , . . . , x n ) where each vector h i encapsulates the context information of the whole input sentence with a greater focus on the current word x i ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-56",
"text": "**STRUCTURE INDUCTION**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-57",
"text": "Given the hidden representation (h 1 , h 2 , . . . , h n ), it is possible to use the hidden vector corresponding to the anchor word h k as the features to perform factuality prediction (as done in (Rudinger et al., 2018) )."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-58",
"text": "However, despite the rich context information over the whole sentence, the features in h k are not directly designed to focus on the import context words for factuality prediction."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-59",
"text": "In order to explicitly encode the information of the cue words into the representations for the anchor word, we propose to learn an importance matrix A = (a ij ) i,j=1..n in which the value in the cell a ij quantifies the contribution of the context word x i for the hidden representation at x j if the representation vector at x j is used to form features for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-60",
"text": "The importance matrix A would then be used as the adjacent/weight matrix in the graph convolutional neural networks (GCNs) (Kipf and Welling, 2016; Nguyen and Grishman, 2018 ) to accumulate the current hidden representations of the context words into the new hidden representations for each word in the sentence."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-61",
"text": "In order to learn the weight matrix A, as presented in the introduction, we propose to leverage both semantic and syntactic structures of the input sentence."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-62",
"text": "In particular, for the semantic structure, we use the representation vectors from LSTMs for x i and x j (i.e., h i and h j ) as the features to compute the contribution score in the cell a sem ij of the semantic weight matrix"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-63",
"text": "Note that we omit the biases in the equations of this paper for convenience."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-64",
"text": "In the equations above,"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-65",
"text": "is a scalar to determine the amount of information that should be sent from the context word x i to the representation at x j based on the semantic relevance for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-66",
"text": "In the next step for the syntactic structure, we employ the dependency tree for the input sentence to generate the adjacent/weight matrix A syn = (a syn ij ) i,j=1..n , where a syn ij is set to 1 if x i is connected to x j in the tree, and 0 otherwise."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-67",
"text": "Note that we augment the dependency trees with the selfconnection and reverse edges to improve the coverage of the weight matrix."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-68",
"text": "Finally, the weight matrix A for GCNs would be the linear combination of the sematic structure A sem and the syntactic structure A syn with the trade-off \u03bb:"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-69",
"text": "Given the weight matrix A, the GCNs (Kipf and Welling, 2016) are applied to augment the representations of the words in the input sentence with the contextual representations for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-70",
"text": "In particular, let H 0 be the the matrix with (h 1 , h 2 , . . . , h n ) as the rows:"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-71",
"text": "One layer of GCNs would take an input matrix H i (i \u2265 0) and produce the output matrix H i+1 :"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-72",
"text": "where g is a non-linear function."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-73",
"text": "In this work, we employ two layers of GCNs (optimized on the development datasets) on the input matrix H 0 , resulting in the semantically and syntactically enriched matrix H 2 with the rows of (h"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-75",
"text": "**PREDICTION**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-76",
"text": "This component predicts the factuality degree of the input event mention based on the context- . . , h g n )."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-77",
"text": "In particular, as the anchor word is located at the k-th position (i.e., the word x k ), we first use the vector h These attention weights would then be employed to obtain the weighted sum of (h g 1 , h g 2 , . . . , h g n ) to produce the feature vector V :"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-78",
"text": "where W a 1 , W a 2 and W a 3 are the model parameters."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-79",
"text": "The attention weights \u03b1 \u2032 i would help to promote the contribution of the important context words for the feature vector V for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-80",
"text": "Finally, similar to (Rudinger et al., 2018) , the feature vector V is fed into a regression model with two layers of feed-forward networks to produce the factuality score."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-81",
"text": "Following (Rudinger et al., 2018) , we train the proposed model by optimizing the Huber loss with \u03b4 = 1 and the Adam optimizer with learning rate = 1.0."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-83",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-85",
"text": "**DATASETS, RESOURCES AND PARAMETERS**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-86",
"text": "Following the previous work (Stanovsky et al., 2017; Rudinger et al., 2018) , we evaluate the proposed EFP model using four benchmark datasets: FactBack (Saur\u00ed and Pustejovsky, 2009 ), UW (Lee et al., 2015) , Meantime (Minard et al., 2016) and UDS-IH2 (Rudinger et al., 2018) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-87",
"text": "The first three datasets (i.e., FactBack, UW, and Meantime) are the unified versions described in (Stanovsky et al., 2017) where the original annotations for these datasets are scaled to a number in [-3, +3] ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-88",
"text": "For the fourth dataset (i.e., UDS-IH2), we follow the instructions in (Rudinger et al., 2018) to scale the scores to the range of [-3, +3] ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-89",
"text": "Each dataset comes with its own training data, test data and development data."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-90",
"text": "Table 2 shows the numbers of examples in all data splits for each dataset used in this paper."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-91",
"text": "We tune the parameters for the proposed model on the development datasets."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-92",
"text": "The best values we find in the tuning process include: 300 for the number of hidden units in the bidirectional LSTM layers, 1024 for the dimension of the projected vector h \u2032 i in the structure induction component, 300 for the number of feature maps for the GCN layers, 600 for the dimention of the transformed vectors for attention based on (W a 1 , W a 2 , W a 3 ), and 300 for the number of hidden units in the two layers of the final regression model."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-93",
"text": "For the tradeoff parameter \u03bb between the semantic and syntactic structures, the best value for the datasets FactBack, UW and Meantime is \u03bb = 0.6 while this value for UDS-IH2 is \u03bb = 0.8."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-95",
"text": "**COMPARING TO THE STATE OF THE ART**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-96",
"text": "This section evaluates the effectiveness of the proposed model for EFP on the benchmark datasets."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-97",
"text": "We compare the proposed model with the best reported systems in the literature with linguistic features (Lee et al., 2015; Stanovsky et al., 2017) and deep learning (Rudinger et al., 2018) ."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-98",
"text": "Table 1 shows the performance."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-99",
"text": "Importantly, to achieve a fair comparison, we obtain the actual implementation of the current state-of-the-art EFP models from (Rudinger et al., 2018) , introduce the BERT embeddings as the inputs for those models and compare them with the proposed models (i.e., the rows with \"+BERT\")."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-100",
"text": "Following the prior work, we use MAE (Mean Absolute Error), and r (Pearson Correlation) as the performance measures."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-101",
"text": "In the table, we distinguish two methods to train the models investigated in the previous work: (i) training and evaluating the models on separate datasets (i.e., the rows associated with *), and (ii) training the models on the union of FactBank, UW and Meantime, resulting in single models to be evaluated on the separate datasets (i.e., the rows with **)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-102",
"text": "It is also possible to train the models on the union of all the four datasets (i.e., FactBank, UW, Meantime and UDS-IH2) (corresponding to the rows with w/UDS-IH2 in the table)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-103",
"text": "From the table, we can see that in the first method to train the models the proposed model is significantly better than all the previous models on FactBank, UW and UDS-IH2 (except for the MAE measure on UW), and achieves comparable performance with the best model (Stanovsky et al., 2017) on Meantime."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-104",
"text": "In fact, the proposed model trained on the separate datasets also significantly outperforms the current best models on FactBank, UW and UDS-IH2 when these models are trained on the union of the datasets with multi-task learning (except for MAE on Factbank where the performance is comparable)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-105",
"text": "Regarding the second method with multiple datasets for training, the proposed model (only trained on the union of FactBank, UW and Meantime) is further improved, achieving better performance than all the other models in this setting for different datasets and performance measures."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-106",
"text": "Overall, the proposed model yields the state-of-the-art performance over all the datasets and measures (except for MAE on UW with comparable performance), clearly demonstrating the advantages of the model in this work for EFP."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-107",
"text": "Table 3 presents the performance of the proposed model when different elements are excluded to evaluate their contribution."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-108",
"text": "We only analyze the proposed model when it is trained with multiple datasets (i.e., FactBank, UW and Meantime)."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-109",
"text": "However, the same trends are observed for the models trained with separate datasets."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-110",
"text": "As we can see from the table, both semantic and syntactic information are important for the proposed model as eliminating any of them would hurt the performance."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-111",
"text": "Removing both elements (i.e., not using the structure induction component) would significantly downgrade the performance."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-112",
"text": "Finally, we see that both the BERT embeddings and the attention in the prediction are necessary for the proposed model to achieve good performance."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-114",
"text": "**ABLATION STUDY**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-116",
"text": "**CONCLUSION & FUTURE WORK**"
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-117",
"text": "We present a graph-based deep learning model for EFP that exploits both syntactic and semantic structures of the sentences to effectively model the important context words."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-118",
"text": "We achieve the state-ofthe-art performance over several EFP datasets."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-119",
"text": "One potential issue with the current approach is that it is dependent on the existence of the highquality dependency parser."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-120",
"text": "Unfortunately, such parser is not always available in different domains and languages."
},
{
"sent_id": "05b53f9e0a347c4f47d0fd066538c7-C001-121",
"text": "Consequently, in the future work, we plan to develop methods that can automatically induce the sentence structures for EFP."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"05b53f9e0a347c4f47d0fd066538c7-C001-11",
"05b53f9e0a347c4f47d0fd066538c7-C001-12"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-15",
"05b53f9e0a347c4f47d0fd066538c7-C001-16"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-17",
"05b53f9e0a347c4f47d0fd066538c7-C001-18",
"05b53f9e0a347c4f47d0fd066538c7-C001-19",
"05b53f9e0a347c4f47d0fd066538c7-C001-20",
"05b53f9e0a347c4f47d0fd066538c7-C001-21"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-22",
"05b53f9e0a347c4f47d0fd066538c7-C001-23",
"05b53f9e0a347c4f47d0fd066538c7-C001-24",
"05b53f9e0a347c4f47d0fd066538c7-C001-25",
"05b53f9e0a347c4f47d0fd066538c7-C001-26"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-40",
"05b53f9e0a347c4f47d0fd066538c7-C001-41"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-97"
]
],
"cite_sentences": [
"05b53f9e0a347c4f47d0fd066538c7-C001-12",
"05b53f9e0a347c4f47d0fd066538c7-C001-16",
"05b53f9e0a347c4f47d0fd066538c7-C001-21",
"05b53f9e0a347c4f47d0fd066538c7-C001-22",
"05b53f9e0a347c4f47d0fd066538c7-C001-41",
"05b53f9e0a347c4f47d0fd066538c7-C001-97"
]
},
"@MOT@": {
"gold_contexts": [
[
"05b53f9e0a347c4f47d0fd066538c7-C001-22",
"05b53f9e0a347c4f47d0fd066538c7-C001-23",
"05b53f9e0a347c4f47d0fd066538c7-C001-24",
"05b53f9e0a347c4f47d0fd066538c7-C001-25",
"05b53f9e0a347c4f47d0fd066538c7-C001-26"
]
],
"cite_sentences": [
"05b53f9e0a347c4f47d0fd066538c7-C001-22"
]
},
"@USE@": {
"gold_contexts": [
[
"05b53f9e0a347c4f47d0fd066538c7-C001-52"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-80"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-81"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-86"
],
[
"05b53f9e0a347c4f47d0fd066538c7-C001-88"
]
],
"cite_sentences": [
"05b53f9e0a347c4f47d0fd066538c7-C001-52",
"05b53f9e0a347c4f47d0fd066538c7-C001-80",
"05b53f9e0a347c4f47d0fd066538c7-C001-81",
"05b53f9e0a347c4f47d0fd066538c7-C001-86",
"05b53f9e0a347c4f47d0fd066538c7-C001-88"
]
},
"@DIF@": {
"gold_contexts": [
[
"05b53f9e0a347c4f47d0fd066538c7-C001-57",
"05b53f9e0a347c4f47d0fd066538c7-C001-58",
"05b53f9e0a347c4f47d0fd066538c7-C001-59"
]
],
"cite_sentences": [
"05b53f9e0a347c4f47d0fd066538c7-C001-57"
]
},
"@EXT@": {
"gold_contexts": [
[
"05b53f9e0a347c4f47d0fd066538c7-C001-99"
]
],
"cite_sentences": [
"05b53f9e0a347c4f47d0fd066538c7-C001-99"
]
}
}
},
"ABC_f2b9a5633600cdf787111841bf9ce6_5": {
"x": [
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-2",
"text": "This paper reports the performances of shallow word-level convolutional neural networks (CNN), our earlier work (2015) [3, 4] , on the eight datasets with relatively large training data that were used for testing the very deep characterlevel CNN in Conneau et al. (2016) [1]."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-3",
"text": "Our findings are as follows."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-4",
"text": "The shallow word-level CNNs achieve better error rates than the error rates reported in [1] though the results should be interpreted with some consideration due to the unique pre-processing of [1]."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-5",
"text": "The shallow word-level CNN uses more parameters and therefore requires more storage than the deep character-level CNN; however, the shallow word-level CNN computes much faster."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-8",
"text": "Text categorization is the task of labeling documents, which has many important applications such as sentiment analysis and topic categorization."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-9",
"text": "Recently, several variations of convolutional neural networks (CNNs) [7] have been shown to achieve high accuracy on text categorization (see e.g., [3, 4, 9, 1] and references therein) in comparison with a number of methods including linear methods, which had long been the state of the art."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-10",
"text": "Long-Short Term Memory networks (LSTMs) [2] have also been shown to perform well on this task, rivaling or sometimes exceeding CNNs [5, 8] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-11",
"text": "However, CNNs are particularly attractive since, due to their simplicity and parallel processing-friendly nature, training and testing of CNNs can be made much faster than LSTM to achieve similar accuracy [5] , and therefore CNNs have a potential to scale better to large training data."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-12",
"text": "Here we focus on two CNN studies that report high performances on categorizing long documents (as opposed to categorizing individual sentences):"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-13",
"text": "\u2022 Our earlier work (2015) [3, 4] : shallow word-level CNNs (taking sequences of words as input), which we abbreviate as word-CNN."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-14",
"text": "\u2022 Conneau et al. (2016) [1]: very deep character-level CNNs (taking sequences of characters as input), which we abbreviate as char-CNN."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-15",
"text": "Although both studies report higher accuracy than previous work on their respective datasets, it is not clear how they compare with each other due to lack of direct comparison."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-16",
"text": "In [1] , the very deep char-CNN was shown to perform well with larger training data (up to 2.6M documents) but perform relatively poorly with smaller training data; e.g., it underperformed linear methods when trained with 120K documents."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-17",
"text": "In [3, 4] the shallow word-CNN was shown to perform well, using training sets (most intensively, 25K documents) that are mostly smaller than those used in [1] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-18",
"text": "While these results imply that the shallow word-CNN is likely to outperform the deep char-CNN when trained with relatively small training sets such as those used in [3, 4] , the shallow word-CNN is untested on the training sets as large as those used in [1] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-19",
"text": "Hence, the purpose of this report is to fill the gap by testing the shallow word-CNNs as in [3, 4] on the datasets used in [1] , for direct comparison with the results of very deep char-CNNs reported in [1] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-20",
"text": "Limitation of work In this work, our new experiments are limited to the shallow word-CNN as in [3, 4] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-21",
"text": "We do not provide new error rate results for the very deep CNNs proposed by [1] , and we only cite their results."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-22",
"text": "Although it may be natural to assume that the error rates reported in [1] well represent the best performance that the deep char-CNNs can achieve, we note that in [1] , documents were clipped and padded so that they all became 1014 characters long, and we do not know how this pre-processing affected their model accuracy."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-23",
"text": "To experiment with word-CNN, we handle variable-sized documents as variable-sized as we see no merit in making them fixed-sized, though we reduce the size of vocabulary to reduce storage requirements."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-24",
"text": "Considering that, we emphasize that this work is not intended to be a rigorous comparison of word-CNNs and char-CNNs; instead, it should be regarded as a report on the shallow word-CNN performance on the eight datasets used in [1] , referring to the results in [1] as the state-of-the-art performances."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-26",
"text": "**PRELIMINARY**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-27",
"text": "We start with briefly reviewing the very deep word-CNN of [1] and the shallow word-CNN of [3, 4] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-29",
"text": "**VERY DEEP CHARACTER-LEVEL CNNS OF [1]**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-30",
"text": "[1] proposed very deep char-CNNs and showed that their best performing models produced higher accuracy than their shallower models and previous deep char-CNNs of [9] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-31",
"text": "Their best architecture consisted of the following:"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-32",
"text": "\u2022 Character embedding of 16 dimensions."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-33",
"text": "\u2022 29 convolution layers with the number of feature maps being 64, 128, 256, and 512."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-34",
"text": "\u2022 Two fully-connected layers with 2048 hidden units each, following the 29 convolution layers."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-35",
"text": "\u2022 One of the following three methods for downsampling to halve the temporal size: setting stride to 2 in the convolution layer, k-max pooling, or max-pooling with stride 2."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-36",
"text": "Downsampling was done whenever the number of feature maps was doubled."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-37",
"text": "\u2022 k-max pooling with k=8 to produce 4096-dimensional input (per document) to the fully-connected layer."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-38",
"text": "\u2022 Batch normalization."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-39",
"text": "The kernel size ('region size' in our wording) was set to 3 in every convolution layer."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-40",
"text": "In addition, the results obtained by two more shallower architectures were reported."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-41",
"text": "[1] should be consulted for the exact architectures."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-43",
"text": "**SHALLOW WORD-LEVEL CNNS AS IN [3, 4]**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-44",
"text": "Two types of word-CNN were proposed in [3, 4] , which are illustrated in Figure 1 ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-45",
"text": "One is a straightforward application of CNN to text (the base model), and the other involves training of tv-embedding ('tv' stands for two views) to produce additional input to the base model."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-46",
"text": "The models with tv-embedding produce higher accuracy provided that sufficiently large amounts of unlabeled data for tv-embedding learning are available."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-68",
"text": "That is, our one-hot vectors were 30K-dimensional while any out-of-vocabulary word was converted to a zero vector, and the region embedding f (x) produced 500-dimensional vectors for each region."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-47",
"text": "As discussed in [5] , the shallow word-CNN can be regarded as a special case of a general framework which jointly trains a linear model with a non-linear feature generator consisting of 'text region embedding + pooling', where text region embedding is a loose term for a function that converts regions of text (word sequences such as \"good buy\") to vectors while preserving information relevant to the task of interest."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-48",
"text": "Word-CNNs without tv-embedding (base model) In the simplest configuration of the shallow word-CNNs, the region embedding is in the form of"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-49",
"text": "where \u03c3 is a component-wise nonlinear function (typically \u03c3(x) = max(x, 0)), input x represents a text region via either the concatenation of one-hot vectors for the words in the region or the bow representation of the region, and weight matrix W and bias vector b (shared within a layer) are trained."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-50",
"text": "Note that when x is the concatenation of onehot vectors, Wx can be interpreted as summing position-sensitive word vectors, and when x is the bow representation of the region, Wx can be interpreted as summing position-insensitive word vectors."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-51",
"text": "Thus, in a sense, the region embedding f (x) above internally and implicitly includes word embedding, as opposed to having an external and A good buy ! Linear classifier"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-52",
"text": "Step 1. Train tv-embedding with two-view embedding learning objectives."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-53",
"text": "Step 2."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-54",
"text": "Train w/ target labels."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-55",
"text": "A good buy !"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-57",
"text": "**POOLING LINEAR CLASSIFIER**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-58",
"text": "Step 2."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-59",
"text": "Train w/ target labels."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-60",
"text": "(b) word-CNN with tv-embedding."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-61",
"text": "parameters W and b (shared within a layer) are trained, and \u03c3 is component-wise nonlinearity, typically \u03c3(x) = max(0, x)."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-62",
"text": "In the base model in (a), input x is one-hot representation of each text region (e.g., \"good buy\")."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-63",
"text": "In (b) we first train tv-embedding with two-view embedding learning objectives and then use it to produce additional input to the base model."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-64",
"text": "explicit word embedding layer before a convolution layer as in, e.g., [6] , which makes x the concatenation of word vectors."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-65",
"text": "See also the supplementary material of [4] for the representation power analysis."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-66",
"text": "As illustrated in Figure 1 (a), f (x) is applied to the text regions at every location of a document (ovals in the figure), and pooling aggregates the resulting region vectors into a document vector, which is used as features by a linear classifier."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-67",
"text": "In our experiments with word-CNN without tv-embedding reported below, the one-hot representation used for x was fixed to the concatenation of one-hot vectors with a vocabulary of the 30K most frequent words, and the dimensionality of region embedding (i.e., the number of feature maps) was fixed to 500."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-69",
"text": "Region size (the number of words in each region) was chosen from {3,5}. Based on our previous work, we performed max-pooling with k pooling units (each of which covers 1/k of a document) while setting k = 1 on sentiment analysis datasets and choosing k from {1, 10} on the others."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-70",
"text": "The models described here also served as the base models of the word-CNN with tv-embedding described next."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-71",
"text": "Word-CNNs with tv-embedding Training of word-CNNs with tv-embedding is done in two steps, as shown in Figure 1 (b) ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-72",
"text": "First we train region tv-embedding ('tv' stands for two views) in the form of f (x) above, with a two-view embedding learning objective such as 'predict adjacent text regions (one view) based on a text region (the other view)'."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-73",
"text": "This training can be done with unlabeled data."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-74",
"text": "[4] provides the definition and theoretical analysis of tv-embeddings."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-75",
"text": "Next, we use the tv-embedding to produce additional input to the base model and train it with labeled data."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-76",
"text": "This model can be easily extended to use multiple tv-embeddings, each of which, for example, uses a distinct vector representation of region, and so the region embedding function in the final model (hollow ovals in Figure 1 (b) ) can be written as:"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-77",
"text": "is the output of the tv-embedding indexed by i applied to the corresponding text region."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-78",
"text": "In [4] , tv-embedding training was done using unlabeled data as an additional resource; therefore, the proposed models were semi-supervised models."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-79",
"text": "In the experiments reported below, due to the lack of standard unlabeled data for the tested datasets, we trained tv-embeddings on the labeled training data ignoring the labels; thus, the resulting models are supervised ones."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-80",
"text": "We trained four tv-embeddings with four distinct one-hot representations of text regions (i.e., input to orange ovals in Figure 1 (b) ): bow representation with region size 5 or 9, and bag-of-{1,2,3}-gram representation with region size 5 or 9."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-81",
"text": "To make bow representation for tv-embedding, we used a vocabulary of the 30K most frequent words, and to make the bag-of-{1,2,3}-gram representation, we used a vocabulary of the 200K most frequent {1,2,3}-grams."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-82",
"text": "The dimensionality of tv-embeddings was 300 unless specified otherwise, and the dimensionality of g(\u00b7) was 500 (as in the base model); thus, we note that the dimensionality of internal vectors are comparable to those of the deep char-CNN of [1] , which are 64, 128, 256, and 512 as shown below."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-83",
"text": "The rest of the setting was the same as the base model above."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-84",
"text": "Other two-step approaches Another two-step approach with word-CNNs was studied by [6] , where the first step is pre-training of the word embedding layer (substituted by use of public word vectors in [6] ), which is followed by a convolution layer."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-85",
"text": "One potential advantage of our tv-embedding learning is that it can learn more complex information (embedding of word sequences) than word embedding (embedding of single words in isolation)."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-87",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-88",
"text": "We report the experimental results of the shallow word-CNNs in comparison with the results reported in [1] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-89",
"text": "The experiments can be reproduced using the code available at riejohnson.com/cnn_download.html."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-91",
"text": "**DATA AND DATA PREPROCESSING**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-92",
"text": "The eight datasets used in [1] are summarized in Table 1 (a)."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-93",
"text": "AG and Sogou are news, Dbpedia is an ontology, and Yelp and Amazon (abbreviated as 'Ama') are reviews."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-94",
"text": "'.p' (polarity) in the names of review datasets indicates that labels are either positive or negative, and '.f' (full) indicates that labels represent the number of stars."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-95",
"text": "Yahoo contains questions and answers from the 'Yahoo! Answers' website."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-96",
"text": "On all datasets, classes are balanced."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-97",
"text": "Sogou consists of Romanized Chinese."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-98",
"text": "The others are in English though some contain characters of other languages (e.g., Chinese, Korean) in small proportions."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-99",
"text": "To experiment with the deep char-CNNs, [1] converted upper-case letters to lower-case letters and used 72 characters (lower-case alphabets, digits, special characters, and special tokens for padding and out-of-vocabulary characters)."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-100",
"text": "They padded the input text with a special token to a fixed size of 1014."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-101",
"text": "To experiment with the shallow word-CNNs, we also converted upper-case letters to lower-case letters."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-102",
"text": "Unlike [1], we handled variable-sized documents as variable-sized without any shortening or padding; however, we limited the vocabulary size to 30K words and 200K {1,2,3}-grams, as described above."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-103",
"text": "To put it into perspective, the size of the complete word vocabulary of the largest training set (Ama.p) is 1.3M, and when limited to the words with frequency no less than 5, it is 221K."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-104",
"text": "By comparison, a vocabulary of 30K sounds rather small, but it covers about 98% of the text on Ama.p, and it appears to be sufficient for obtaining good accuracy."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-106",
"text": "**EXPERIMENTAL DETAILS OF WORD-LEVEL CNNS**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-107",
"text": "On all datasets, we held out 10K data points from the training set for use as validation data."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-108",
"text": "Models were trained using the training set minus validation data, and model selection (or hyper parameter tuning) was done based on the performance on the validation data."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-109",
"text": "Tv-embedding training was done as in [4] ; weighted square loss was minimized without regularization while the target regions (adjacent regions) were represented by bow vectors, and the data weights were set so that the negative sampling effect was achieved."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-110",
"text": "Tv-embeddings were fixed (i.e., no weight updating) during the final training with labeled data."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-111",
"text": "Training with labels (either with or without tv-embedding) was done as follows."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-112",
"text": "A log loss (or cross entropy) with softmax was minimized."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-113",
"text": "Optimization was done by mini-batch SGD with momentum 0.9 and the mini-batch size was set to 100."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-114",
"text": "The number of epochs was fixed to 30 (except for AG, the smallest, for which it was fixed to 100), and the learning rate was reduced once by multiplying 0.1 after 24 epochs (or 80 epochs on AG)."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-115",
"text": "In all layers, weights were initialized by the Gaussian distribution of zero mean and standard deviation 0.01."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-116",
"text": "The initial learning rate was treated as a hyper parameter."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-117",
"text": "Regularization was done by applying dropout with 0.5 to the input to the top layer and having a L2 regularization term with parameter 0.0001 on the top layer weights."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-118",
"text": "'depth' counts the hidden layers with weights in the longest path."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-119",
"text": "[9] reported the results of several linear methods, and we copied only the best results."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-120",
"text": "[1] reported the results of deep char-CNN with three downsampling methods, and we copied only the best results."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-121",
"text": "The word-CNN results are our new results."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-122",
"text": "The best (or second best) results are shown in bold (or italic) font, respectively."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-124",
"text": "**PERFORMANCE RESULTS**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-125",
"text": "Error rates In Table 1 (b), we show the error rate results of the shallow word-CNN in comparison with the best results of the deep char-CNN reported in [1] and the best results of linear models reported in [9] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-126",
"text": "On each dataset, the best results are shown in bold and the second best results are shown in the italic font."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-127",
"text": "On all datasets, the shallow word-CNN with tv-embeddings performs the best."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-128",
"text": "The second best performer is the shallow word-CNN without tv-embedding on all but Ama.f (Amazon full)."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-129",
"text": "Whereas the deep char-CNN underperforms traditional linear models when training data is relatively small, the shallow word-CNNs with and without tv-embedding clearly outperform them on all the datasets."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-130",
"text": "We observe that, as in our previous work [4] , additional input produced by tv-embeddings led to substantial improvements."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-131",
"text": "The performances of word-CNN without tv-embedding might be further improved by having multiple region sizes [3, 6] , but for simplicity, we did not attempt it in this work."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-132",
"text": "Model size and computation time In Table 2 , we observe that, compared with the deep char-CNN, the shallow word-CNN has more parameters but computes much faster."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-133",
"text": "Although the table shows computation time and error rates on one particular dataset (Yelp.f), the observation was the same on the other datasets."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-134",
"text": "The shallow word-CNN has more parameters because the number of parameters mostly depends on the vocabulary size, which is large with word-CNN (30K and 200K in our experiments) and small with char-CNN (72 in [1] )."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-135",
"text": "Nevertheless, computation of the shallow word-CNN can be made much faster than the deep char-CNN for three reasons 1 ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-136",
"text": "First, with implementation to handle sparse data efficiently, computation of shallow word-CNN does not depend on the vocabulary size."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-137",
"text": "For example, when x is the concatenation of p one-hot vectors of dimensionality v (vocabulary size), computation time of Wx (the most time-consuming step) depends not on v (e.g., 30K) but on p (e.g., 3) since we only need to multiply nonzero elements of x with the weights in W. Second, character-based methods need to process about five times more text units than word-based methods; compare the rows of average length in words and characters in Table 1 (b) ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-138",
"text": "Third, a deeper network is less parallel processing-friendly since many layers have to be processed sequentially."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-139",
"text": "If we reduce the dimensionality of tv-embedding from 300 to 100, the number of parameters can be reduced to a half with a small degradation of accuracy, as shown in Table 2 Table 3 : Error rates of the shallow word-CNN with tv-embeddings of 100 dimensions ('w/ 4 tv(100-dim')."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-140",
"text": "'w/ 4 tv (300-dim)' was copied from shown in Table 3 ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-141",
"text": "Reducing the number of tv-embeddings from four to two also reduces the number of parameters with a small degradation of accuracy ('w/ 2 tv (100-dim)' in Table 2 )."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-142",
"text": "----------------------------------"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-143",
"text": "**SUMMARY OF THE RESULTS**"
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-144",
"text": "\u2022 The shallow word-CNNs as in [3, 4] generally achieved better error rates than those of the very deep char-CNNs reported in [1] ."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-145",
"text": "\u2022 The shallow word-CNN computes much faster than the very deep char-CNN."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-146",
"text": "This is because the deep char-CNN needs to process more text units as there are many more characters than words per document, and because many layers need to be processed sequentially."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-147",
"text": "This is a practical advantage of the shallow word-CNN."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-148",
"text": "\u2022 The shallow word-CNNs use more parameters and therefore require more storage, which is a drawback in storage-tight situations."
},
{
"sent_id": "f2b9a5633600cdf787111841bf9ce6-C001-149",
"text": "Reducing the number and/or dimensionality of tv-embeddings reduces the number of parameters though it comes with the expense of a small degradation of accuracy."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f2b9a5633600cdf787111841bf9ce6-C001-2"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-9"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-12",
"f2b9a5633600cdf787111841bf9ce6-C001-13",
"f2b9a5633600cdf787111841bf9ce6-C001-14",
"f2b9a5633600cdf787111841bf9ce6-C001-15"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-17"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-18"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-27"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-43",
"f2b9a5633600cdf787111841bf9ce6-C001-44",
"f2b9a5633600cdf787111841bf9ce6-C001-45"
]
],
"cite_sentences": [
"f2b9a5633600cdf787111841bf9ce6-C001-2",
"f2b9a5633600cdf787111841bf9ce6-C001-9",
"f2b9a5633600cdf787111841bf9ce6-C001-13",
"f2b9a5633600cdf787111841bf9ce6-C001-17",
"f2b9a5633600cdf787111841bf9ce6-C001-18",
"f2b9a5633600cdf787111841bf9ce6-C001-27",
"f2b9a5633600cdf787111841bf9ce6-C001-43",
"f2b9a5633600cdf787111841bf9ce6-C001-44"
]
},
"@MOT@": {
"gold_contexts": [
[
"f2b9a5633600cdf787111841bf9ce6-C001-12",
"f2b9a5633600cdf787111841bf9ce6-C001-13",
"f2b9a5633600cdf787111841bf9ce6-C001-14",
"f2b9a5633600cdf787111841bf9ce6-C001-15"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-19"
]
],
"cite_sentences": [
"f2b9a5633600cdf787111841bf9ce6-C001-13",
"f2b9a5633600cdf787111841bf9ce6-C001-19"
]
},
"@SIM@": {
"gold_contexts": [
[
"f2b9a5633600cdf787111841bf9ce6-C001-20"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-130"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-144"
]
],
"cite_sentences": [
"f2b9a5633600cdf787111841bf9ce6-C001-20",
"f2b9a5633600cdf787111841bf9ce6-C001-130",
"f2b9a5633600cdf787111841bf9ce6-C001-144"
]
},
"@USE@": {
"gold_contexts": [
[
"f2b9a5633600cdf787111841bf9ce6-C001-65"
],
[
"f2b9a5633600cdf787111841bf9ce6-C001-109"
]
],
"cite_sentences": [
"f2b9a5633600cdf787111841bf9ce6-C001-65",
"f2b9a5633600cdf787111841bf9ce6-C001-109"
]
},
"@DIF@": {
"gold_contexts": [
[
"f2b9a5633600cdf787111841bf9ce6-C001-78",
"f2b9a5633600cdf787111841bf9ce6-C001-79"
]
],
"cite_sentences": [
"f2b9a5633600cdf787111841bf9ce6-C001-78"
]
}
}
},
"ABC_920f2b94270c0711fcc19ad23dbb0d_6": {
"x": [
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-28",
"text": "We discuss how moderation systems can be tuned, depending on the availability and workload of the moderators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-29",
"text": "We also introduce additional evaluation measures for the semi-automatic scenario."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-26",
"text": "When moderators are available, it is more realistic to develop semi- automatic systems aiming to assist, rather than replace the moderators, a scenario that has not been considered in previous work."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-27",
"text": "In this case, comments for which the system is uncertain (Fig. 1 ) are shown to a moderator to decide; all other comments are accepted or rejected by the system."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-2",
"text": "Experimenting with a new dataset of 1.6M user comments from a news portal and an existing dataset of 115K Wikipedia talk page comments, we show that an RNN operating on word embeddings outpeforms the previous state of the art in moderation, which used logistic regression or an MLP classifier with character or word n-grams."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-3",
"text": "We also compare against a CNN operating on word embeddings, and a word-list baseline."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-4",
"text": "A novel, deep, classificationspecific attention mechanism improves the performance of the RNN further, and can also highlight suspicious words for free, without including highlighted words in the training data."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-5",
"text": "We consider both fully automatic and semi-automatic moderation."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-8",
"text": "User comments play a central role in social media and online discussion fora."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-9",
"text": "News portals and blogs often also allow their readers to comment to get feedback, engage their readers, and build customer loyalty."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-10",
"text": "1 User comments, however, and more generally user content can also be abusive (e.g., bullying, profanity, hate speech) (Cheng et al., 2015) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-11",
"text": "Social media are under pressure to combat abusive content, but so far rely mostly on user reports and tools that detect frequent words and phrases of reported posts."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-12",
"text": "2 Wulczyn et al. (2017) estimated that only 17.9% of personal attacks in Wikipedia discussions were followed by moderator actions."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-13",
"text": "News portals also suffer from abusive user comments, which damage their reputations and make them liable to fines, e.g., when hosting comments encouraging illegal actions."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-14",
"text": "They often employ moderators, who are frequently overwhelmed, however, by the volume and abusiveness of comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-15",
"text": "3 Readers are disappointed when non-abusive comments do not appear quickly online because of moderation delays."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-16",
"text": "Smaller news portals may be unable to employ moderators, and some are forced to shut down their comments sections entirely."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-17",
"text": "We examine how deep learning (Goodfellow et al., 2016; Goldberg, 2016 Goldberg, , 2017 can be employed to moderate user comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-18",
"text": "We experiment with a new dataset of approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-19",
"text": "1.6M manually moderated (accepted or rejected) user comments from a Greek sports news portal (called Gazzetta), which we make publicly available."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-20",
"text": "4 This is one of the largest publicly available datasets of moderated user comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-21",
"text": "We also provide word embeddings pre-trained on 5.2M comments from the same portal."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-22",
"text": "Furthermore, we experiment on the 'attacks' dataset of Wulczyn et al. (2017) , approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-23",
"text": "115K English Wikipedia talk page comments labeled as containing personal attacks or not."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-24",
"text": "In a fully automatic scenario, there is no moderator and a system accepts or rejects comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-25",
"text": "Although this scenario may be the only available one, e.g., when news portals cannot afford moderators, it is unrealistic to expect that fully automatic moderation will be perfect, because abusive comments may involve irony, sarcasm, harassment without profane phrases etc., which are particularly difficult for a machine to detect."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-30",
"text": "On both datasets (Gazzetta and Wikipedia comments) and for both scenarios (automatic, semiautomatic), we show that a recurrent neural network (RNN) outperforms the system of Wulczyn et al. (2017) , the previous state of the art for comment moderation, which employed logistic regression or a multi-layer Perceptron (MLP), and represented each comment as a bag of (character or word) n-grams."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-31",
"text": "We also propose an attention mechanism that improves the overall performance of the RNN."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-32",
"text": "Our attention mechanism differs from most previous ones (Bahdanau et al., 2015; Luong et al., 2015) in that it is used in a classification setting, where there is no previously generated output subsequence to drive the attention, unlike sequence-to-sequence models (Sutskever et al., 2014) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-33",
"text": "In that sense, our attention is similar to that of of Yang et al. (2016) , but our attention mechanism is a deeper MLP and it is only applied to words, whereas Yang et al. also have a second attention mechanism that assigns attention scores to entire sentences."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-34",
"text": "In effect, our attention detects the words of a comment that affect most the classification decision (accept, reject), by examining them in the context of the particular comment."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-35",
"text": "Although our attention mechanism does not always improve the performance of the RNN, it has the additional advantage of allowing the RNN to highlight suspicious words that a moderator could consider to decide more quickly if a comment should be accepted or rejected."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-36",
"text": "The highlighting comes for free, i.e., the training data do not contain highlighted words."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-37",
"text": "We also show that words highlighted by the attention mechanism correlate well with words that moderators would highlight."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-38",
"text": "Our main contributions are: (i) We release a dataset of 1.6M moderated user comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-39",
"text": "(ii) We introduce a novel, deep, classification-specific attention mechanism and we show that an RNN with our attention mechanism outperforms the previous state of the art in user comment moderation."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-40",
"text": "(iii) Unlike previous work, we also consider a semiautomatic scenario, along with threshold tuning and evaluation measures for it."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-41",
"text": "(iv) We show that the attention mechanism can automatically highlight suspicious words for free, without manually highlighting words in the training data."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-43",
"text": "**DATASETS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-44",
"text": "We first discuss the datasets we used, to help acquaint the reader with the problem."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-46",
"text": "**GAZZETTA COMMENTS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-47",
"text": "There are approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-48",
"text": "1.45M training comments (covering Jan. 1, 2015 to Oct. 6, 2016 in the Gazzetta dataset; we call them G-TRAIN-L (Table 1) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-49",
"text": "Some experiments use only the first 100K comments of G-TRAIN-L, called G-TRAIN-S. An additional set of 60,900 comments (Oct. 7 to Nov. 11, 2016) was split to development (G-DEV, 29,700 comments), large test (G-TEST-L, 29,700), and small test set (G-TEST-S, 1,500)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-50",
"text": "Gazzetta's moderators (2 full-time, plus journalists occasionally helping) are occasionally instructed to be stricter (e.g., during violent events)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-51",
"text": "To get a more accurate view of performance in normal situtations, we manually re-moderated (labeled as 'accept' or 'reject') the comments of G-TEST-S, producing G-TEST-S-R. The reject ratio is approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-52",
"text": "30% in all subsets, except for G-TEST-S-R where it drops to 22%, because there are no occasions where the moderators were instructed to be stricter in G-TEST-S-R. Each G-TEST-S-R comment was re-moderated by five annotators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-53",
"text": "Krippendorff's (2004) alpha was 0.4762, close to the value (0.45) reported by Wulczyn et al. (2017) for the Wikipedia 'attacks' dataset."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-54",
"text": "Using Cohen's Kappa (Cohen, 1960) , the mean pairwise agreement was 0.4749."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-55",
"text": "The mean pairwise percentage of agreement (% of comments each pair of annotators agreed on) was 81.33%."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-56",
"text": "Cohen's Kappa and Krippendorff's alpha lead to lower scores, because they account for agreement by chance, which is high when there is class imbalance (22% reject, 78% accept in G-TEST-S-R)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-57",
"text": "During the re-moderation of G-TEST-S-R, the annotators were also asked to highlight snippets they considered suspicious, i.e., words or phrases that could lead a moderator to consider rejecting each comment."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-58",
"text": "5 We also asked the annotators to classify each snippet into one of the following categories: calumniation (e.g., false accusations), discrimination (e.g., racism), disrespect (e.g., looking down at a profession), hooliganism (e.g., calling for violence), insult (e.g., making fun of appearance), irony, swearing, threat, other."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-59",
"text": "Figure 2 shows how many comments of G-TEST-S-R contained at least one snippet of each category, according to the majority of annotators; e.g., a comment counts as containing irony if at least 3 annotators annotated it with an irony snippet (not necessarily the same)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-60",
"text": "The gold class of each comment (accept or reject) is determined by the majority of the annotators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-61",
"text": "Irony and disrespect are particularly frequent in both classes, followed by calumniation, swearing, hooliganism, insults."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-62",
"text": "Notice that comments that contain irony, disrespect etc."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-63",
"text": "are not necessarily rejected."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-64",
"text": "They are, however, more likely in the rejected class, considering that the accepted comments are 2.5 times more than the rejected ones (78% vs. 22%)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-65",
"text": "We also provide 300-dimensional word embeddings, pre-trained on approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-66",
"text": "5.2M comments (268M tokens) from Gazzetta using WORD2VEC (Mikolov et al., 2013a,b) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-67",
"text": "6 This larger dataset cannot be used to directly train classifiers, because most of its comments are from a period (before 2015) when Gazzetta did not employ moderators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-69",
"text": "**WIKIPEDIA COMMENTS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-70",
"text": "The Wikipedia 'attacks' dataset (Wulczyn et al., 2017) contains approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-71",
"text": "115K English Wikipedia talk page comments, which were labeled as containing personal attacks or not."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-72",
"text": "Each comment was labeled by at least 10 annotators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-73",
"text": "Inter-annotator agreement, measured on a random sample of 1K comments using Krippendorff's (2004) alpha, was 0.45."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-74",
"text": "The gold label of each comment is determined by the majority of annotators, leading to binary labels (accept, reject)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-75",
"text": "Alternatively, the gold label is the percentage of annotators that labeled the comment as 'accept' (or 'reject'), leading to probabilistic labels."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-76",
"text": "7 The dataset is split in three parts (Table 1) : training (W-ATT-TRAIN, 69,526 comments), development (W-ATT-DEV, 23,160), and test (W-ATT-TEST, 23,178)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-77",
"text": "In all three parts, the rejected comments are 12%, but this is an artificial ratio (Wulczyn et al. oversampled comments posted by banned users)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-78",
"text": "By contrast, the ratio of rejected comments in all the Gazzetta subsets is the truly observed one."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-79",
"text": "The Wikipedia comments are also longer (median length 38 tokens) compared to Gazzetta's (median length 25 tokens)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-80",
"text": "Wulczyn et al. (2017) also provide two additional datasets of English Wikipedia talk page comments, which are not used in this paper."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-81",
"text": "The first one, called 'aggression' dataset, contains the same comments as the 'attacks' dataset, now labeled as 'aggressive' or not."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-82",
"text": "The (probabilistic) labels of the 'attacks' and 'aggression' datasets are very highly correlated (0.8992 Spearman, 0.9718 Pearson) and we did not consider the aggression dataset any further."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-83",
"text": "The second additional dataset, called 'toxicity' dataset, contains approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-84",
"text": "160K comments labeled as being toxic or not."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-85",
"text": "Experiments we reported elsewhere (Pavlopoulos et al., 2017) show that results on the 'attacks' and 'toxicity' datasets are very similar; we do not include results on the latter in this paper to save space."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-87",
"text": "**METHODS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-88",
"text": "We experimented with an RNN operating on word embeddings, the same RNN enhanced with our attention mechanism (a-RNN), a vanilla convolutional neural network (CNN) also operating on word embeddings, the DETOX system of Wulczyn et al. (2017) , and a baseline that uses word lists."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-90",
"text": "**DETOX**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-91",
"text": "DETOX (Wulczyn et al., 2017) was the previous state of the art in comment moderation, in the sense that it had the best reported results on the Wikipedia datasets (Section 2.2), which were in turn the largest previous publicly available dataset of moderated user comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-92",
"text": "8 DETOX represents each comment as a bag of word n-grams (n \u2264 2, each comment becomes a bag containing its 1-grams and 2-grams) or a bag of character n-grams (n \u2264 5, each comment becomes a bag containing character 1-grams, . . . , 5-grams)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-93",
"text": "DETOX can rely on a logistic regression (LR) or MLP classifier, and it can use binary or probabilistic gold labels (Section 2.2) during training."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-94",
"text": "We used the DETOX implementation provided by Wulczyn et al. and the same grid search (and code) to tune the hyper-parameters of DETOX that select word or character n-grams, classifier (LR or MLP), and gold labels (binary or probabilistic)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-95",
"text": "For Gazzetta, only binary gold labels were possible, since G-TRAIN-L and G-TRAIN-S have a single gold label per comment."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-96",
"text": "Unlike Wulczyn et al., we tuned the hyper-parameters by evaluating (computing AUC and Spearman, Section 4) on a random 2% of held-out comments of W-ATT-TRAIN or G-TRAIN-S, instead of the development subsets, to be able to obtain more realistic results from the development sets while developing the methods."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-97",
"text": "For both Wikipedia and Gazzetta, the tuning selected character n-grams, as in the work of Wulczyn et al. Also, for both Wikipedia and Gazzetta, it preferred LR to MLP, whereas Wulczyn et al. reported slightly higher performance 8 Two of the co-authors of Wulczyn et al. (2017) are with Jigsaw, who recently announced Perspective, a system to detect 'toxic' comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-98",
"text": "Perspective is not the same as DETOX (personal communication), but we were unable to obtain scientific articles describing it."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-99",
"text": "An API for Perspective is available at https://www.perspectiveapi."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-100",
"text": "com/, but we did not have access to the API at the time the experiments of this paper were carried out."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-101",
"text": "for the MLP on W-ATT-DEV."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-102",
"text": "9 The tuning also selected probabilistic labels for Wikipedia, as in the work of Wulczyn et al."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-104",
"text": "**RNN-BASED METHODS RNN:**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-105",
"text": "The RNN method is a chain of GRU cells (Cho et al., 2014) that transforms the tokens w 1 . . . , w k of each comment to the hidden states h 1 . . . , h k , followed by an LR layer that uses h k to classify the comment (accept, reject)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-106",
"text": "Formally, given the vocabulary V , a matrix E \u2208 R d\u00d7|V | containing d-dimensional word embeddings, an initial h 0 , and a comment c = w 1 , . . . , w k , the RNN computes h 1 , . . . , h k as follows (h t \u2208 R m ):"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-107",
"text": "whereh t \u2208 R m is the proposed hidden state at position t, obtained by considering the word embedding x t of token w t and the previous hidden state h t\u22121 ; denotes element-wise multiplication; r t \u2208 R m is the reset gate (for r t all zeros, it allows the RNN to forget the previous state h t\u22121 ); z t \u2208 R m is the update gate (for z t all zeros, it allows the RNN to ignore the new proposedh t , hence also x t , and copy h t\u22121 as h t ); \u03c3 is the sigmoid func-"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-108",
"text": "Once h k has been computed, the LR layer estimates the probability that comment c should be rejected, with W p \u2208 R 1\u00d7m , b p \u2208 R:"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-109",
"text": "a-RNN: When the attention mechanism is added, the LR layer considers the weighted sum h sum of all the hidden states, instead of just h k (Fig. 3) : 10"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-110",
"text": "The weights a t are produced by an attention mech- 9 We repeated the tuning by evaluating on W-ATT-DEV, and again character n-grams with LR were selected."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-111",
"text": "10 We tried replacing the LR layer by a deeper classification MLP, and the RNN chain by a bidirectional RNN (Schuster and Paliwal, 1997) , but there were no improvements."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-112",
"text": "anism, which is an MLP with l layers:"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-113",
"text": ". . ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-114",
"text": "where a"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-115",
"text": "(1) t , . . . , a"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-116",
"text": "The softmax operates across the a (l) t (t = 1, . . . , k), making the weights a t sum to 1."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-117",
"text": "Our attention mechanism differs from most previous ones (Mnih et al., 2014; Bahdanau et al., 2015; Xu et al., 2015; Luong et al., 2015) in that it is used in a classification setting, where there is no previously generated output subsequence (e.g., partly generated translation) to drive the attention (e.g., assign more weight to source words to translate next), unlike seq2seq models (Sutskever et al., 2014) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-118",
"text": "It assigns larger weights a t to hidden states h t corresponding to positions where there is more evidence that the comment should be accepted or rejected."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-119",
"text": "Yang et al. (2016) use a similar attention mechanism, but ours is deeper."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-120",
"text": "In effect they always set l = 2, whereas we allow l to be larger (tuning selects l = 4)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-121",
"text": "11 On the other hand, the attention mechanism of Yang et al. is part of a classification method for longer texts (e.g., product reviews)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-122",
"text": "Their method uses two GRU RNNs, both bidirectional (Schuster and Paliwal, 1997) , one turning the word embeddings of each sentence to a sentence embedding, and one turning the sentence embeddings to a document embedding, which is then fed to an LR layer."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-123",
"text": "Yang et al. use their attention mechanism in both RNNs, to assign attention scores to words and sentences."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-124",
"text": "We consider shorter texts (comments), we have a single RNN, and we assign attention scores to words only."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-125",
"text": "12 da-CENT: We also experiment with a variant of a-RNN, called da-CENT, which does not use the hidden states of the RNN."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-126",
"text": "The input to the first layer of the attention mechanism is now directly the embedding x t instead of h t (cf. Eq. 2), and 11 Yang et al. use tanh instead of RELU in Eq. 2, which works worse in our case, and no bias b (l) in the l-th layer."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-127",
"text": "12 We tried a bidirectional instead of unidirectional GRU chain in our methods, also replacing the LR layer by a deeper classification MLP, but there were no improvements."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-128",
"text": "h sum is now the weighted sum (centroid) of word embeddings h sum = k t=1 a t x t (cf. Eq. 1)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-129",
"text": "13 We set l = 4, d = 300, r = m = 128, having tuned all hyper-parameters on the same 2% held-out comments of W-ATT-TRAIN or G-TRAIN-S that were used to tune DETOX."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-130",
"text": "We use Glorot initialization (Glorot and Bengio, 2010) , categorical cross-entropy loss, and Adam (Kingma and Ba, 2015)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-131",
"text": "14 Early stopping evaluates on the same held-out subsets."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-132",
"text": "For Gazzetta, word embeddings are initialized to the WORD2VEC embeddings we provide (Section 2.1)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-133",
"text": "For Wikipedia, they are initialized to GLOVE embeddings (Pennington et al., 2014) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-134",
"text": "15 In both cases, the embeddings are updated during backpropagation."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-135",
"text": "Out of vocabulary (OOV) words, meaning words for which we have no initial embeddings, are mapped to a single randomly initialized embedding, also updated."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-137",
"text": "**CNN**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-138",
"text": "We also compare against a vanilla CNN operating on word embeddings."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-139",
"text": "We describe the CNN only briefly, because it is very similar to that of of Kim (2014) ; see also Goldberg (2016) for an introduction to CNNs, and Zhang and Wallace (2015) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-140",
"text": "For Wikipedia comments, we use a 'narrow' convolution layer, with kernels sliding (stride 1) over (entire) embeddings of word n-grams of sizes n = 1, . . . , 4."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-141",
"text": "We use 300 kernels for each n value, a total of 1,200 kernels."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-142",
"text": "The outputs of each kernel, obtained by applying the kernel to the different n-grams of a comment c, are then max-pooled, leading to a single output per kernel."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-143",
"text": "The resulting feature vector (1,200 maxpooled outputs) goes through a dropout layer (Hinton et al., 2012) (p = 0.5), and then to an LR layer, which provides P CNN (reject|c)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-144",
"text": "For Gazzetta, the CNN is the same, except that n = 1, . . . , 5, leading to 1,500 features per comment."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-145",
"text": "All hyperparameters were tuned on the 2% held-out comments of W-ATT-TRAIN or G-TRAIN-S that were used to tune the other methods."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-146",
"text": "Again, we use 300-dimensional embeddings, which are now randomly initialized, since tuning indicated this was better than initializing to pre-trained embeddings."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-147",
"text": "OOV words are treated as in the RNN-based methods."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-148",
"text": "All embeddings are updated during backpropagation."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-149",
"text": "Early stopping evaluates on the heldout subsets."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-150",
"text": "Again, we use Glorot initialization, categorical cross-entropy loss, and Adam. 16"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-151",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-152",
"text": "**LIST BASELINE**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-153",
"text": "A baseline, called LIST, collects every word w that occurs in more than 10 (for W-ATT-TRAIN, G-TRAIN-S) or 100 comments (for G-TRAIN-L) in the training set, along with the precision of w, i.e., the ratio of rejected training comments containing w divided by the total number of training comments containing w. The resulting lists contain 10,423, 16,864, and 21,940 word types, when using W-ATT-TRAIN, G-TRAIN-S, G-TRAIN-L, respectively."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-154",
"text": "For a comment c, LIST returns as P LIST (reject|c) the maximum precision of all the words in c."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-156",
"text": "**TUNING THRESHOLDS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-157",
"text": "All methods produce a p = P (reject|c) per comment c. In semi-automatic moderation (Fig. 1) , a comment is directly rejected if its p is above a rejection theshold t r , it is directly accepted if p is below an acceptance threshold t a , and it is shown to a moderator if t a \u2264 p \u2264 t r (gray zone of Fig. 4) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-158",
"text": "In our experience, moderators (or their employers) can easily specify the approximate percentage of comments they can afford to check manually (e.g., 20% daily) or, equivalently, the approximate percentage of comments the system should 16 We implemented the CNN directly in TensorFlow."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-159",
"text": "handle automatically."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-160",
"text": "We call coverage the latter percentage; hence, 1 \u2212 coverage is the approximate percentage of comments to be checked manually."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-161",
"text": "By contrast, moderators are baffled when asked to tune t r and t a directly."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-162",
"text": "Consequently, we ask them to specify the approximate desired coverage."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-163",
"text": "We then sort the comments of the development set (G-DEV or W-ATT-DEV) by p, and slide t a from 0.0 to 1.0 (Fig. 4) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-164",
"text": "For each t a value, we set t r to the value that leaves a 1 \u2212 coverage percentage of development comments in the gray zone (t a \u2264 p \u2264 t r )."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-165",
"text": "We then select the t a (and t r ) that maximizes the weighted harmonic mean F \u03b2 (P reject , P accept ) on the development set:"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-166",
"text": "where P reject is the rejection precision (correctly rejected comments divided by rejected comments) and P accept is the acceptance precision (correctly accepted divided by accepted)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-167",
"text": "Intuitively, coverage sets the width of the gray zone, whereas P reject and P accept show how certain we can be that the red (reject) and green (accept) zones are free of misclassified comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-168",
"text": "We set \u03b2 = 2, emphasizing P accept , because moderators are more worried about wrongly accepting abusive comments than wrongly rejecting non-abusive ones."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-169",
"text": "17 The selected t a , t r (tuned on development data) are then used in experiments on test data."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-170",
"text": "In fully automatic moderation, coverage = 100 and t a = t r ; otherwise, threshold tuning is identical."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-172",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-174",
"text": "**COMMENT CLASSIFICATION EVALUATION**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-175",
"text": "Following Wulczyn et al. (2017) , we report in Table 2 AUC scores (area under ROC curve), along with Spearman correlations between systemgenerated probabilities P (accept|c) and human probabilistic gold labels (Section 2.2) when probabilistic gold labels are available."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-176",
"text": "18 Wulczyn et al. reported DETOX results only on W-ATT-DEV, shown in brackets."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-177",
"text": "Table 2 shows that RNN is 17 More precisely, when computing F \u03b2 , we reorder the development comments by time posted, and split them into batches of 100."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-178",
"text": "For each ta (and tr) value, we compute F \u03b2 per batch and macro-average across batches."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-179",
"text": "The resulting thresholds lead to F \u03b2 scores that are more stable over time."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-180",
"text": "18 When computing AUC, the gold label is the majority label of the annotators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-181",
"text": "When computing Spearman, the gold label is probabilistic (% of annotators that accepted the comment)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-182",
"text": "The decisions of the systems are always probabilistic."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-183",
"text": "Table 2 : Comment classification results."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-184",
"text": "Scores reported by Wulczyn et al. (2017) are shown in brackets."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-185",
"text": "always better than CNN and DETOX; there is no clear winner between CNN and DETOX."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-186",
"text": "Furthermore, a-RNN is always better than RNN on Gazzetta comments, but not on Wikipedia comments, where RNN is overall slightly better according to Table 2 ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-187",
"text": "Also, da-CENT is always worse than a-RNN and RNN, confirming that the hidden states (intuitively, context-aware word embeddings) of the RNN chain are important, even with the attention mechanism."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-188",
"text": "Increasing the size of the Gazzetta training set (G-TRAIN-S to G-TRAIN-L) significantly improves the performance of all methods."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-189",
"text": "The implementation of DETOX could not handle the size of G-TRAIN-L, which is why we do not report DETOX results for G-TRAIN-L. Notice, also, that the Wikipedia dataset is easier than the Gazzetta one (all methods perform better on Wikipedia comments, compared to Gazzetta)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-190",
"text": "Figure 5 shows F 2 (P reject , P accept ) on G-TEST-L and W-ATT-TEST, when t a , t r are tuned on G-DEV, W-ATT-DEV for varying coverage."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-191",
"text": "For G-TEST-L, we show results training on G-TRAIN-S (solid lines) and G-TRAIN-L (dotted)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-192",
"text": "The differences between RNN and a-RNN are again small, but it is now easier to see that a-RNN is overall better."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-193",
"text": "Again, a-RNN and RNN are better than CNN and DETOX."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-194",
"text": "All three deep learning methods benefit from the larger training set (dotted)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-195",
"text": "In Wikipedia, a-RNN obtains P accept , P reject \u2265 0.94 for all coverages (Fig. 5, call-outs) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-196",
"text": "On the more difficult Gazzetta dataset, a-RNN still obtains P accept , P reject \u2265 0.85 when tuned for 50% coverage."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-197",
"text": "When tuned for 100% coverage, comments for which the system is uncertain (gray zone) cannot be avoided and there are inevitably more misclassifications; the use of F 2 during threshold tuning places more emphasis on avoiding wrongly accepted comments, leading to high P accept (0.82), at the expense of wrongly rejected comments, i.e., sacrificing P reject (0.59)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-198",
"text": "On the re-moderated G-TEST-S-R (similar diagrams, not shown), P accept , P reject become 0.96, 0.88 for coverage 50%, and 0.92, 0.48 for coverage 100%."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-199",
"text": "We also repeated the annotator ensemble experiment of Wulczyn et al. (2017) on 8K randomly chosen comments of W-ATT-TEST (4K comments from random users, 4K comments from banned users)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-200",
"text": "19 The decisions of 10 randomly chosen annotators (possibly different per comment) were used to construct the gold label of each comment."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-201",
"text": "The gold labels were then compared to the decisions of the systems and the decisions of an ensemble of k other annotators, k ranging from 1 to 10."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-202",
"text": "Table 3 shows the mean AUC and Spearman scores, averaged over 25 runs of the experiment, along with standard errrors (in brackets)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-203",
"text": "We conclude that RNN and a-RNN are as good as an ensemble of 7 human annotators; CNN is as good as 4 annotators; DETOX is as good as 4 in AUC and 3 annotators in Spearman correlation, which is consistent with the results of Wulczyn et al. (2017) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-204",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-205",
"text": "**SNIPPET HIGHLIGHTING EVALUATION**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-206",
"text": "To investigate if the attention scores of a-RNN can highlight suspicious words, we focused on G-TEST-S-R, the only dataset with suspicious snippets annotated by humans."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-207",
"text": "We removed comments with no human-annotated snippets, leaving 841 comments (515 accepted, 326 rejected), a total of 40,572 tokens, of which 13,146 were inside a suspicious snippet of at least one annotator."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-208",
"text": "In each remaining comment, each token was assigned a gold suspiciousness score, defined as the percentage of annotators that included it in their snippets."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-209",
"text": "We evaluated three methods that score each token w t of a comment c for suspiciousness."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-210",
"text": "The first one assigns to each w t the attention score a t 19 We used the protocol, code, and data of Wulczyn et al. -RNN (trained on G-TRAIN-L) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-211",
"text": "The second method assigns to each w t its precision, as computed by LIST (Section 3.4)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-212",
"text": "The third method (RAND) assigns to each w t a random (uniform distribution) score between 0 and 1."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-213",
"text": "In the latter two methods, a softmax is applied to the scores of all the tokens per comment, as in a-RNN."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-214",
"text": "Figure 6 shows three comments (from W-ATT-TEST) highlighted by a-RNN; heat corresponds to attention."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-215",
"text": "20 We computed Pearson and Spearman correlations between the gold suspiciousness scores and the scores of the three methods on the 40,572 tokens."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-216",
"text": "Figure 7 shows the correlations on comments that were accepted (left) and rejected (right) by the majority of moderators."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-217",
"text": "In both cases, a-RNN performs better than LIST and RAND by both Pearson and Spearman correlations."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-218",
"text": "The high Pearson correlations of a-RNN also show that its attention scores are to a large extent linearly related to the gold ones."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-219",
"text": "By contrast, LIST performs reasonably well in terms of Spearman correlation, but much worse in terms of Pearson, indicating that its precision scores rank reasonably well the tokens from most to least suspicious ones, but are not linearly related to the gold scores."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-220",
"text": "Djuric et al. (2015) experimented with 952K manually moderated comments from Yahoo Finance, but their dataset is not publicly available."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-221",
"text": "They convert each comment to a comment embedding using DOC2VEC (Le and Mikolov, 2014) , which is then fed to an LR classifier."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-222",
"text": "Nobata et al. (2016) experimented with approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-223",
"text": "3.3M manually moderated comments from Yahoo Finance and News; their data are also not available."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-224",
"text": "21 They used Vowpal Wabbit 22 with character n-grams (n = 3, . . . , 5) and word n-grams (n = 1, 2), handcrafted features (e.g., number of capitalized or black-listed words), features based on dependency 20 In innocent comments, a-RNN spreads its attention to all tokens, leading to quasi-uniform low color intensity."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-225",
"text": "21 According to Nobata et al., their clean test dataset (2K comments) would be made available, but it is currently not."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-226",
"text": "22 See http://hunch.net/\u02dcvw/."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-227",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-228",
"text": "**RELATED WORK**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-229",
"text": "trees, averages of WORD2VEC embeddings, and DOC2VEC-like embeddings."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-230",
"text": "Character n-grams were the best, on their own outperforming Djuric et al. (2015) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-231",
"text": "The best results, however, were obtained using all features."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-232",
"text": "We use no hand-crafted features and parsers, making our methods more easily portable to other domains and languages."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-233",
"text": "train a (token or characterbased) RNN language model per class (accept, reject), and use the probability ratio of the two models to accept or reject user comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-234",
"text": "Experiments on the dataset of Djuric et al. (2015) , however, showed that their method (RNNLMs) performed worse than a combination of SVM and Naive Bayes classifiers (NBSVM) that used character and token n-grams."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-235",
"text": "An LR classifier operating on DOC2VEC-like comment embeddings (Le and Mikolov, 2014) also performed worse than NBSVM."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-236",
"text": "To surpass NBSVM, Mehdad et al. used an SVM to combine features from their three other methods (RNNLMs, LR with DOC2VEC, NBSVM)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-237",
"text": "Wulczyn et al. (2017) experimented with character and word n-grams."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-238",
"text": "We included their dataset and moderation system (DETOX) in our experiments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-239",
"text": "Waseem et al. (2016) used approx."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-240",
"text": "17K tweets annotated for hate speech."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-241",
"text": "Their best results were obtained using an LR classifier with character n-grams (n = 1, . . . , 4), plus gender."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-242",
"text": "Warner and Hirschberg (2012) aimed to detect anti-semitic speech, experimenting with 9K paragraphs and a linear SVM."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-243",
"text": "Their features consider windows of at most 5 tokens, examining the tokens of each window, their order, POS tags, Brown clusters etc., following Yarowsky (1994) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-244",
"text": "Cheng et al. (2015) aimed to predict which users would be banned from on-line communities."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-245",
"text": "Their best system used a random forest or LR classifier, with features examining readability, activity (e.g., number of posts daily), community and moderator reactions (e.g., up-votes, number of deleted posts)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-246",
"text": "Sood et al. (2012a; 2012b) experimented with 6.5K comments from Yahoo Buzz, moderated via crowdsourcing."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-247",
"text": "They showed that a linear SVM, representing each comment as a bag of word bigrams and stems, performs better than word lists."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-248",
"text": "Their best results were obtained by combining the SVM with a word list and edit distance."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-249",
"text": "Yin et al. (2009) used posts from chat rooms and discussion fora (<15K posts in total) to train an SVM to detect online harassment."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-250",
"text": "They used TF-IDF, sentiment, and context features (e.g., similarity to other posts in a thread)."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-251",
"text": "Our methods might also benefit by considering threads, rather than individual comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-252",
"text": "Yin at al. point out that unlike other abusive content, spam in comments or dicsussion fora (Mishne et al., 2005; Niu et al., 2007) is off-topic and serves a commercial purpose."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-253",
"text": "Spam is unlikely in Wikipedia discussions and not an issue in the Gazzetta dataset (Fig. 2) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-254",
"text": "For a more extensive discussion of related work, consult Pavlopoulos et al. (2017) ."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-255",
"text": "----------------------------------"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-256",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-257",
"text": "We experimented with a new publicly available dataset of 1.6M moderated user comments from a Greek sports news portal and an existing dataset of 115K English Wikipedia talk page comments."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-258",
"text": "We showed that a GRU RNN operating on word embeddings outpeforms the previous state of the art, which used an LR or MLP classifier with character or word n-gram features, also outperforming a vanilla CNN operating on word embeddings, and a baseline that uses an automatically constructed word list with precision scores."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-259",
"text": "A novel, deep, classification-specific attention mechanism improves further the overall results of the RNN, and can also highlight suspicious words for free, without including highlighted words in the training data."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-260",
"text": "We considered both fully automatic and semi-automatic moderation, along with threshold tuning and evaluation measures for both."
},
{
"sent_id": "920f2b94270c0711fcc19ad23dbb0d-C001-261",
"text": "We plan to consider user-specific information (e.g., ratio of comments rejected in the past) (Cheng et al., 2015; Waseem and Hovy, 2016) and explore character-level RNNs or CNNs , e.g., as a first layer to produce embeddings of unknown words from characters (dos Santos and Zadrozny, 2014; Ling et al., 2015) , which would then be passed on to our current methods that operate on word embeddings."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"920f2b94270c0711fcc19ad23dbb0d-C001-11",
"920f2b94270c0711fcc19ad23dbb0d-C001-12"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-53"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-70",
"920f2b94270c0711fcc19ad23dbb0d-C001-71",
"920f2b94270c0711fcc19ad23dbb0d-C001-72",
"920f2b94270c0711fcc19ad23dbb0d-C001-73",
"920f2b94270c0711fcc19ad23dbb0d-C001-74",
"920f2b94270c0711fcc19ad23dbb0d-C001-75",
"920f2b94270c0711fcc19ad23dbb0d-C001-76",
"920f2b94270c0711fcc19ad23dbb0d-C001-77",
"920f2b94270c0711fcc19ad23dbb0d-C001-78",
"920f2b94270c0711fcc19ad23dbb0d-C001-79"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-80",
"920f2b94270c0711fcc19ad23dbb0d-C001-81",
"920f2b94270c0711fcc19ad23dbb0d-C001-82",
"920f2b94270c0711fcc19ad23dbb0d-C001-83",
"920f2b94270c0711fcc19ad23dbb0d-C001-84"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-91",
"920f2b94270c0711fcc19ad23dbb0d-C001-92",
"920f2b94270c0711fcc19ad23dbb0d-C001-93"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-184"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-199",
"920f2b94270c0711fcc19ad23dbb0d-C001-200",
"920f2b94270c0711fcc19ad23dbb0d-C001-201"
]
],
"cite_sentences": [
"920f2b94270c0711fcc19ad23dbb0d-C001-12",
"920f2b94270c0711fcc19ad23dbb0d-C001-53",
"920f2b94270c0711fcc19ad23dbb0d-C001-70",
"920f2b94270c0711fcc19ad23dbb0d-C001-91",
"920f2b94270c0711fcc19ad23dbb0d-C001-184",
"920f2b94270c0711fcc19ad23dbb0d-C001-199"
]
},
"@USE@": {
"gold_contexts": [
[
"920f2b94270c0711fcc19ad23dbb0d-C001-22",
"920f2b94270c0711fcc19ad23dbb0d-C001-23"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-88"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-94"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-199",
"920f2b94270c0711fcc19ad23dbb0d-C001-200",
"920f2b94270c0711fcc19ad23dbb0d-C001-201"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-237",
"920f2b94270c0711fcc19ad23dbb0d-C001-238"
]
],
"cite_sentences": [
"920f2b94270c0711fcc19ad23dbb0d-C001-22",
"920f2b94270c0711fcc19ad23dbb0d-C001-88",
"920f2b94270c0711fcc19ad23dbb0d-C001-199"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"920f2b94270c0711fcc19ad23dbb0d-C001-30"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-97"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-175"
]
],
"cite_sentences": [
"920f2b94270c0711fcc19ad23dbb0d-C001-30",
"920f2b94270c0711fcc19ad23dbb0d-C001-97",
"920f2b94270c0711fcc19ad23dbb0d-C001-175"
]
},
"@DIF@": {
"gold_contexts": [
[
"920f2b94270c0711fcc19ad23dbb0d-C001-96"
]
],
"cite_sentences": []
},
"@SIM@": {
"gold_contexts": [
[
"920f2b94270c0711fcc19ad23dbb0d-C001-203"
],
[
"920f2b94270c0711fcc19ad23dbb0d-C001-237",
"920f2b94270c0711fcc19ad23dbb0d-C001-238"
]
],
"cite_sentences": [
"920f2b94270c0711fcc19ad23dbb0d-C001-203"
]
}
}
},
"ABC_754ceac25ff3a711ec3737e7eb860b_6": {
"x": [
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-2",
"text": "In this work, we present a novel neural network based architecture for inducing compositional crosslingual word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-3",
"text": "Unlike previously proposed methods, our method fulfills the following three criteria; it constrains the wordlevel representations to be compositional, it is capable of leveraging both bilingual and monolingual data, and it is scalable to large vocabularies and large quantities of data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-4",
"text": "The key component of our approach is what we refer to as a monolingual inclusion criterion, that exploits the observation that phrases are more closely semantically related to their sub-phrases than to other randomly sampled phrases."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-5",
"text": "We evaluate our method on a well-established crosslingual document classification task and achieve results that are either comparable, or greatly improve upon previous state-of-the-art methods."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-6",
"text": "Concretely, our method reaches a level of 92.7% and 84.4% accuracy for the English to German and German to English sub-tasks respectively."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-7",
"text": "The former advances the state of the art by 0.9% points of accuracy, the latter is an absolute improvement upon the previous state of the art by 7.7% points of accuracy and an improvement of 33.0% in error reduction."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-10",
"text": "Dense vector representations (embeddings) of words and phrases, as opposed to discrete feature templates, have recently allowed for notable advances in the state of the art of Natural Language Processing (NLP) (Socher et al., 2013; Baroni et al., 2014) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-11",
"text": "These representations are typically induced from large unannotated corpora by predicting a word given its context (Collobert & Weston, 2008) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-12",
"text": "Unlike discrete feature templates, these representations allow supervised methods to readily make use of unlabeled data, effectively making them semi-supervised (Turian et al., 2010) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-13",
"text": "A recent focus has been on crosslingual, rather than monolingual, representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-14",
"text": "Crosslingual representations are induced to represent words, phrases, or documents for more than one language, where the representations are constrained to preserve representational similarity or can be transformed between languages (Klementiev et al., 2012; Hermann & Blunsom, 2014) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-15",
"text": "In particular, crosslingual representations can be helpful for tasks such as translation or to leverage training data in a source language when little or no training data is available for a target language."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-16",
"text": "Examples of such transfer learning tasks are crosslingual sentiment analysis (Wan, 2009) and crosslingual document classification (Klementiev et al., 2012) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-17",
"text": "induced language-specific word representations, learned a linear mapping between the language-specific representations using bilingual word pairs and evaluated their approach for single word translation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-18",
"text": "Klementiev et al. (2012) used automatically aligned sentences and words to constrain word representations across languages based on the number of times a given word in one language was aligned to a word in another language."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-19",
"text": "They also introduced a dataset for crosslingual document classification and evaluated their work on this task."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-20",
"text": "Hermann & Blunsom (2014) introduced a method to induce compositional crosslingual word representations from sentence-aligned bilingual corpora."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-21",
"text": "Their method is trained to distinguish the sentence pairs given in a bilingual corpus from randomly generated pairs."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-22",
"text": "The model represents sentences as a function of their word representations, encouraging the word representations to be compositional."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-23",
"text": "Another approach has been to use auto-encoders and bag of words representations of sentences that can easily be applied to jointly leverage both bilingual and monolingual data (Chandar A P et al., 2014) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-24",
"text": "Most recently, Gouws et al. (2014) extended the Skip-Gram model of to be applicable to bilingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-25",
"text": "Just like the Skip-Gram model they predict a word in its context, but constrain the linear combinations of word representations from aligned sentences to be similar."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-26",
"text": "However, these previous methods all suffer from one or more of three short-comings."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-27",
"text": "Klementiev et al. (2012) ; ; Gouws et al. (2014) all learn their representations using a word-level monolingual objective."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-28",
"text": "This effectively means that compositionality is not encouraged by the monolingual objective, which may be problematic when composing word representations for a phrase or document-level task."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-29",
"text": "While the method of Hermann & Blunsom (2014) allows for arbitrary composition functions, they are limited to using sentence-aligned bilingual data and it is not immediately obvious how their method can be extended to make use of monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-30",
"text": "Lastly, while the method of Chandar A P et al. (2014) suffers from neither of the above issues, their method represents each sentence as a bag of words vector with the size of the whole vocabulary."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-31",
"text": "This leads to computational scaling issues and necessitates a vocabulary cut-off which may hamper performance for compounding languages such as German."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-32",
"text": "The question that we pose is thus, can a single method 1."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-33",
"text": "Constrain the word-level representations to be compositional."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-34",
"text": "2. Leverage both monolingual and bilingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-35",
"text": "3. Scale to large vocabulary sizes without greatly impacting training time."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-36",
"text": "In this work, we propose a neural network based architecture for creating crosslingual compositional word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-37",
"text": "The method is agnostic to the choice of composition function and combines a bilingual training objective with a novel way of training monolingual word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-38",
"text": "This enables us to draw from a plethora of unlabeled monolingual data, while our method is efficient enough to be trained using roughly seven million sentences in about six hours on a single-core desktop computer."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-39",
"text": "We evaluate our method on a well-established document classification task and achieve results for both sub-tasks that are either comparable or greatly improve upon the previous state of the art."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-40",
"text": "For the German to English sub-task our method achieves 84.4% in accuracy, an error reduction of 33.0% in comparison to the previous state of the art."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-42",
"text": "**MODEL**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-44",
"text": "**INDUCING CROSSLINGUAL WORD REPRESENTATIONS**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-45",
"text": "For any task involving crosslingual word representations we distinguish between two kinds of errors 1."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-46",
"text": "Transfer errors occur due to transferring representations between languages."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-47",
"text": "Ideally, expressions of the same meaning (words, phrases, or documents) should be represented by the same vectors, regardless of the language they are expressed in."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-48",
"text": "The more different these representations are from language 1 (l 1 ) to language 2 (l 2 ), the larger the transfer error."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-49",
"text": "2."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-50",
"text": "Monolingual errors occur because the word, phrase or document representations within the same language are not expressive enough."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-51",
"text": "For example, in the case of classification this would mean that the representations do not possess enough discriminative power for a classifier to achieve high accuracy."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-52",
"text": "The way to attain high performance for any task that involves crosslingual word representations is to keep both transfer errors and monolingual errors to a minimum using representations that are both expressive and constrained crosslingually."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-54",
"text": "**CREATING REPRESENTATIONS FOR PHRASES AND DOCUMENTS**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-55",
"text": "Following the work of Klementiev et al. (2012) ; Hermann & Blunsom (2014) ; Gouws et al. (2014) we represent each word as a vector and use separate word representations for each language."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-56",
"text": "Like Hermann & Blunsom (2014) , we look up the vector representations for all words of a given sentence in the corresponding lookup table and apply a composition function to transform these word vectors into a sentence representation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-57",
"text": "To create document representations, we apply the same composition function again, this time to transform the representations of all sentences in a document to a document representation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-58",
"text": "For the majority of this work we will make use of the addition composition function, which can be written as the sum of all word representations w i in a given phrase"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-59",
"text": "To give an example of another possible candidate composition function, we also use the bigram based addition (Bi) composition function, formalized as"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-60",
"text": "where the hyperbolic tangent (tanh) is wrapped around every word bigram to produce intermediate results that are then summed up."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-61",
"text": "By introducing a non-linear function the Bi composition is no longer a bag-of-vectors function and takes word order into account."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-62",
"text": "Given that neither of the above composition functions involve any additional parameters, the only parameters of our model are in fact the word representations that are shared globally across all training samples."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-64",
"text": "**OBJECTIVE**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-65",
"text": "Following Klementiev et al. (2012) we split our objective into two sub-objectives, a bilingual objective minimizing the transfer errors and a monolingual objective minimizing the monolingual errors for l 1 and l 2 ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-66",
"text": "We formalize the loss over the whole training set as"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-67",
"text": "where L bi is the bilingual loss for two aligned sentences, v i is a sample from the set of N bi aligned sentences in language 1 and 2, L mono is the monolingual loss which we sum over N mono1 sentences x l1 i from corpora in language 1 and N mono2 sentences y l2 i from corpora in language 2."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-68",
"text": "We learn the parameters \u03b8, which represent the whole set of word representations for both l 1 and l 2 ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-69",
"text": "The parameters are used in a shared fashion to construct sentence representations for both the monolingual corpora and the parts of the bilingual corpus corresponding to each language."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-70",
"text": "We regularize \u03b8 using the squared euclidean norm and scale the contribution of the regularizer by \u03bb."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-71",
"text": "Both objectives operate on vectors that represent composed versions of phrases and are agnostic to how a phrase is transformed into a vector."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-72",
"text": "The objective can therefore be used with arbitrary composition functions."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-73",
"text": "An illustration of our proposed method can be found in Figure 1 ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-75",
"text": "**BILINGUAL OBJECTIVE**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-76",
"text": "Given a pair of aligned sentences, s"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-77",
"text": "Figure 1: An illustration of our method."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-78",
"text": "for any two vector representations v l1 and v l2 corresponding to the sentences of an aligned translation pair."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-79",
"text": "The bilingual objective on its own is degenerate, since setting the vector representations of all sentences to the same value poses a trivial solution."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-80",
"text": "We therefore combine this bilingual objective with a monolingual objective."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-82",
"text": "**MONOLINGUAL OBJECTIVE**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-83",
"text": "The choice of the monolingual objective greatly influences the generality of models for crosslingual word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-84",
"text": "Klementiev et al. (2012) use a neural language model to leverage monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-85",
"text": "However, this does not explicitly encourage compositionality of the word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-86",
"text": "Hermann & Blunsom (2014) achieve good results with a noise-contrastive objective, discriminating aligned translation pairs from randomly sampled pairs."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-87",
"text": "However, their approach can only be trained using sentence aligned data, which makes it difficult to extend to leverage unannotated monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-88",
"text": "Gouws et al. (2014) introduced BilBOWA combining a bilingual objective with the Skip-Gram model proposed by which predicts the context of a word given the word itself."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-89",
"text": "They achieve high accuracy on the German \u2192 English sub-task of the crosslingual document classification task introduced by Klementiev et al. (2012) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-90",
"text": "Chandar A P et al. (2014) presented a bag-of-words auto-encoder model which is the current state of the art for the English \u2192 German sub-task for the same task."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-91",
"text": "Both the auto-encoder based model and BilBOWA require a sentencealigned bilingual corpus, but in addition are capable of leveraging monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-92",
"text": "However, due to their bag-of-words based nature, their architectures implicitly restrict how sentence representations are composed from word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-93",
"text": "We extend the idea of the noise-contrastive objective given by Hermann & Blunsom (2014) to the monolingual setting and propose a framework that, like theirs, is agnostic to the choice of composition function and operates on the phrase level."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-94",
"text": "However, our framework, unlike theirs, is able to leverage monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-95",
"text": "Our key novel idea is based on the observation that phrases are typically more similar to their sub-phrases than to randomly sampled phrases."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-96",
"text": "We leverage this insight using the hinge loss as follows where m is a margin, a outer is a phrase sampled from a sentence, a inner is a sub-phrase of a outer and b noise is a phrase extracted from a sentence that was sampled uniformly from the corpus."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-97",
"text": "The start and end positions of both phrases and the sub-phrase were chosen uniformly at random within their context and constrained to guarantee a minimum length of 3 words."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-98",
"text": "Subscript c denotes that a phrase has been transformed into its vector representation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-99",
"text": "We add a outer c \u2212 a inner c 2 to the hinge loss to reduce the influence of the margin as a hyperparameter and to make sure that the we retain an error signal even after the hinge loss objective is satisfied."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-100",
"text": "To compensate for differences in phrase and sub-phrase length we scale the error by the ratio between the number of words in the outer phrase and the inner phrase."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-101",
"text": "Minimizing this objective captures the intuition stated above; a phrase should generally be closer to its sub-phrases, than to randomly sampled phrases."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-102",
"text": "The examples in Figure 2 seek to further clarify this observation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-103",
"text": "In both examples, the blue area represents the outer phrase (a outer ), the red area covers the inner sub-phrase (a inner ), and the gray area marks a randomly selected phrase in a randomly sampled noise sentence."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-104",
"text": "The inner workings of the monolingual inclusion objective only become clear when more than one example is considered."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-105",
"text": "In Example 1, a inner is embedded in the same context as in Example 2, while in both examples a outer is contrasted with the same noise phrase."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-106",
"text": "Minimizing the objective brings the representations of both likes to drink beer and likes to eat chips closer to the phrase they are embedded in and makes them less similar to the same noise sentence."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-107",
"text": "Since in both examples the outer phrases are very similar, this causes likes to drink beer and likes to eat chips to be similar."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-108",
"text": "While we picked idealized sentences for demonstration purposes, this relative notion still holds in practice to varying degrees depending on the choice of sentences."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-109",
"text": "In contrast to many recently introduced log-linear models, like the Skip-Gram model, where word vectors are similar if they appear as the center of similar word windows, our proposed objective, using addition for composition, encourages word vectors to be similar if they tend to be embedded in similar phrases."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-110",
"text": "The major difference between these two formulations manifests itself for words that appear close or next to each other very frequently."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-111",
"text": "These word pairs are not usually the center of the same word windows, but they are embedded together in the same phrases."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-112",
"text": "For example: the two word central context of \"eat\" is \"to\" and \"chips\", whereas the context of \"chips\" would be \"eat\" and \"when\"."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-113",
"text": "Using the Skip-Gram model this would cause \"chips\" and \"eat\" to be less similar, with \"chips\" probably being similar to other words related to food and \"eat\" being similar to other verbs."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-114",
"text": "Employing the inclusion objective, the representations for \"eat\" and \"chips\" will end up close to each other since they tend to be embedded in the same phrases."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-115",
"text": "This causes the word representations induced by the inclusion criterion to be more topical in nature."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-116",
"text": "We hypothesize that this property is particularly useful for document classification."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-118",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-120",
"text": "**CROSSLINGUAL DOCUMENT CLASSIFICATION**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-121",
"text": "Crosslingual document classification constitutes a task where a classifier is trained to classify documents in one language (l 1 ) and is later applied to documents in a different language (l 2 )."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-122",
"text": "This requires either transforming the classifier itself to fit the new language or transforming/sharing representations of the text for both languages."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-123",
"text": "The crosslingual word and document representations induced using the approach proposed in this work present an intuitive way to tackle crosslingual document classification."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-124",
"text": "Like previous work, we evaluate our method on the crosslingual document classification task introduced by Klementiev et al. (2012) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-125",
"text": "The goal is to correctly classify news articles taken from the English and German sections of the RCV1 and RCV2 corpus (Lewis et al., 2004) into one of four (Collins, 2002) for 10 iterations on representations of documents in one language (English/German) and evaluate its performance on representations of documents in the corresponding other language (German/English)."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-126",
"text": "We use the original data and the original implementation of the averaged perceptron used by Klementiev et al. (2012) to evaluate the document representations created by our method."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-127",
"text": "There are different versions of the training set of varying sizes, ranging from 100 to 10,000 documents, and the test sets for both languages contain 5,000 documents."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-128",
"text": "Most related work only reports results using the 1,000 documents sized training set."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-129",
"text": "Following previous work, we tune the hyperparameters of our model on held out documents in the same language that the model was trained on."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-131",
"text": "**INDUCING CROSSLINGUAL WORD REPRESENTATIONS**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-132",
"text": "To induce representations using the method proposed in this work, we require at least a bilingual corpus of aligned sentences."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-133",
"text": "In addition, our model allows the representations to draw upon monolingual data from either or both languages."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-134",
"text": "Like Klementiev et al. (2012) we choose EuroParl v7 (Koehn, 2005) as our bilingual corpus and leverage the English and German parts of the RCV1 and RCV2 corpora as monolingual resources."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-135",
"text": "To avoid a testing bias, we exclude all documents that are part of the crosslingual classification task."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-136",
"text": "We detect sentence boundaries using pre-trained models of the Punkt tokenizer (Kiss & Strunk, 2006) shipped with NLTK 1 and perform tokenization and lowercasing with the scripts deployed with the cdec decoder 2 ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-137",
"text": "Following Turian et al. (2010) we remove all English sentences (and their German correspondences in EuroParl) that have a lowercase nonlowercase ratio of less than 0.9."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-138",
"text": "This affects mainly headlines and reports with numbers."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-139",
"text": "In total it reduces the number of sentences in EuroParl by about 255, 000 and the English part of the Reuters corpus by about 8 million."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-140",
"text": "Since German features more upper case characters than English we set the cutoff ratio to 0.7, which reduces the number of sentences by around 620, 000."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-141",
"text": "Further, we replace words that occur less than a certain threshold by an UNK token."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-142",
"text": "Corpus statistics and thresholds are reported in Table 1 ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-143",
"text": "We initialize all word representations with noise samples from a Gaussian with \u00b5 = 0, \u03c3 = 0.1 and optimize them in a stochastic setting to minimize the objective defined in Equation 3."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-144",
"text": "To speed up the convergence of training we use AdaGrad (Duchi et al., 2011) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-145",
"text": "We tuned all hyperparameters of our model and explored learning rates around 0.2, mini-batch sizes around 40,000, hinge loss margins around 40 (since our vector dimensionality is 40) and \u03bb (regularization) around 1.0."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-146",
"text": "We trained all versions that use the full monolingual data for 25 iterations (= 25 \u00d7 4.5 million samples) and the versions only involving bilingual data for 100 iterations on their training sets."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-147",
"text": "Training our model, implemented in a high-level, dynamic programming language (Bezanson et al., 2012) , for the largest set of data takes roughly six hours on a single-core desktop computer."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-148",
"text": "This can be compared to for example Chandar A P et al. (2014) which train their auto-encoder model for 3.5 days."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-149",
"text": "Table 2 : Results for our proposed models, baselines, and related work."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-150",
"text": "All results are reported for a training set size of 1,000 documents for each language."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-151",
"text": "We refer to our proposed method as Binclusion."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-153",
"text": "**METHOD**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-154",
"text": "Training"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-156",
"text": "**CROSSLINGUAL DOCUMENT CLASSIFICATION**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-157",
"text": "We compare our method to various architectures introduced in previous work."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-158",
"text": "As these methods differ in their ability to handle monolingual data, we evaluate several versions of our model using different data sources and sizes for training."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-159",
"text": "Also, we follow the lines of previous work and use 40-dimensional word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-180",
"text": "This speaks strongly in favor of how our objectives complement each other, even though these words were only observed in the monolingual data they relate sensibly across languages."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-160",
"text": "We report results when using the first 500,000 sentence pairs of EuroParl (Euro500k), the full EuroParl corpus (EuroFull), the first 500,000 sentence pairs of EuroParl and the German and English text from the Reuters corpus as monolingual data (Euro500kReuters), and one version using the full EuroParl and Reuters corpus (EuroFullReuters)."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-161",
"text": "Table 2 shows results for all these configurations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-162",
"text": "The result table includes previous work as well as the Glossed, the machine translation and the majority class baselines from Klementiev et al. (2012) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-163",
"text": "Our method achieves results that are comparable or improve upon the previous state of the art for all dataset configurations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-164",
"text": "It advances the state of the art for the EN \u2192 DE sub-task by 0.9% points of accuracy and greatly outperforms the previous state of the art for the DE \u2192 EN sub-task, where it yields an absolute improvement of 7.7% points of accuracy."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-165",
"text": "The latter corresponds to an error reduction of 33.0% in comparison to the previous state of the art."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-166",
"text": "An important observation is that including monolingual data is strongly beneficial for the classification accuracy."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-167",
"text": "We found increases in performance to 80.6% for DE \u2192 EN and 88.6% accuracy for EN \u2192 DE, even when using as little as 5% of the monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-168",
"text": "We hypothesize that the key cause of this effect is domain adaptation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-169",
"text": "From this observation it is also worth pointing out that our method is on par with the previous state of the art for the DE \u2192 EN sub-task using no monolingual training data and would improve upon it using as little as 5% of the monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-170",
"text": "To show that our method achieves high accuracy even with a reduced vocabulary, we discard representations for infrequent terms and report results using our best setup with the same vocabulary size as Klementiev et al. (2012) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-172",
"text": "**INTERESTING PROPERTIES OF THE INDUCED CROSSLINGUAL WORD REPRESENTATIONS**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-173",
"text": "For a bilingual word representation model that makes use of monolingual data, the most difficult cases to resolve are words that appear in the monolingual data, but never in the bilingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-174",
"text": "When it comes to these words the model does not have any kind of direct signal regarding what translations they should correspond to."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-175",
"text": "Their location in the vector space is entirely determined by how the monolingual objective arranges them."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-176",
"text": "Therefore, looking specifically at these difficult examples presents a good way to get an impression of how well the monolingual and bilingual objective complement each other."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-177",
"text": "In Table 3 , we list some of the most frequently occurring words that are present in the monolingual data but not in the bilingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-178",
"text": "The nearest neighbors are topically strongly related to their corresponding queries."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-179",
"text": "For example, the credit-rating agency Standard & Poor's (s&p) is matched to rating-related words, soybeans is proximal to crop and food related terms, forex features a list of currency related terms, and the list for stockholders, includes aktion\u00e4re, its correct German translation."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-181",
"text": "To convey an impression of how the induced representations behave, not interlingually, but within the same language, we list some examples in Table 4 ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-182",
"text": "The semi-conductor chip maker intel, is very close to IT-related companies like ibm or netscape and also to microprocessor-related terms."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-183",
"text": "For the verb fly, the nearest neighbors not only include forms like flying, but also related nouns like airspace or air, underlining the topical nature of our proposed objective."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-185",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-186",
"text": "In this work we introduced a method that is capable of inducing compositional crosslingual word representations while scaling to large amounts of data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-187",
"text": "Our novel approach for learning monolingual word representations integrates naturally with our sentence based bilingual objective and allows us to make use of sentence-aligned bilingual corpora as well as monolingual data."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-188",
"text": "The method is agnostic to the choice of composition function, enabling more complex (e.g. preserving word order information) ways to compose phrase representations from word representations."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-189",
"text": "Depending on the amount of training data available the accuracy achieved with our models is comparable or greatly improves upon previously reported results for the crosslingual document classification task introduced by Klementiev et al. (2012) ."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-190",
"text": "To increase the expressiveness of our method we plan to investigate more complex composition functions, possibly based on convolution or other ways to preserve word order information."
},
{
"sent_id": "754ceac25ff3a711ec3737e7eb860b-C001-191",
"text": "We consider the monolingual inclusion objective to be worthy of further research on its own and will evaluate its performance in comparison to related methods when learning word representations from monolingual data."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"754ceac25ff3a711ec3737e7eb860b-C001-14",
"754ceac25ff3a711ec3737e7eb860b-C001-15"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-16"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-18",
"754ceac25ff3a711ec3737e7eb860b-C001-19"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-84",
"754ceac25ff3a711ec3737e7eb860b-C001-85"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-88",
"754ceac25ff3a711ec3737e7eb860b-C001-89"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-124",
"754ceac25ff3a711ec3737e7eb860b-C001-125"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-162",
"754ceac25ff3a711ec3737e7eb860b-C001-163"
]
],
"cite_sentences": [
"754ceac25ff3a711ec3737e7eb860b-C001-14",
"754ceac25ff3a711ec3737e7eb860b-C001-16",
"754ceac25ff3a711ec3737e7eb860b-C001-18",
"754ceac25ff3a711ec3737e7eb860b-C001-84",
"754ceac25ff3a711ec3737e7eb860b-C001-89",
"754ceac25ff3a711ec3737e7eb860b-C001-124",
"754ceac25ff3a711ec3737e7eb860b-C001-162"
]
},
"@MOT@": {
"gold_contexts": [
[
"754ceac25ff3a711ec3737e7eb860b-C001-26",
"754ceac25ff3a711ec3737e7eb860b-C001-27",
"754ceac25ff3a711ec3737e7eb860b-C001-28"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-84",
"754ceac25ff3a711ec3737e7eb860b-C001-85"
]
],
"cite_sentences": [
"754ceac25ff3a711ec3737e7eb860b-C001-27",
"754ceac25ff3a711ec3737e7eb860b-C001-84"
]
},
"@SIM@": {
"gold_contexts": [
[
"754ceac25ff3a711ec3737e7eb860b-C001-55"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-134",
"754ceac25ff3a711ec3737e7eb860b-C001-135"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-170"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-189"
]
],
"cite_sentences": [
"754ceac25ff3a711ec3737e7eb860b-C001-55",
"754ceac25ff3a711ec3737e7eb860b-C001-134",
"754ceac25ff3a711ec3737e7eb860b-C001-170",
"754ceac25ff3a711ec3737e7eb860b-C001-189"
]
},
"@USE@": {
"gold_contexts": [
[
"754ceac25ff3a711ec3737e7eb860b-C001-65"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-124",
"754ceac25ff3a711ec3737e7eb860b-C001-125"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-126",
"754ceac25ff3a711ec3737e7eb860b-C001-127",
"754ceac25ff3a711ec3737e7eb860b-C001-128",
"754ceac25ff3a711ec3737e7eb860b-C001-129"
],
[
"754ceac25ff3a711ec3737e7eb860b-C001-134",
"754ceac25ff3a711ec3737e7eb860b-C001-135"
]
],
"cite_sentences": [
"754ceac25ff3a711ec3737e7eb860b-C001-65",
"754ceac25ff3a711ec3737e7eb860b-C001-124",
"754ceac25ff3a711ec3737e7eb860b-C001-126",
"754ceac25ff3a711ec3737e7eb860b-C001-134"
]
}
}
},
"ABC_f1eae0918a246174b1866ba71d4efc_6": {
"x": [
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-2",
"text": "This paper gives an Abstract Categorial Grammar (ACG) account of (Kallmeyer and Kuhlmann, 2012)'s process of transformation of the derivation trees of Tree Adjoining Grammar (TAG) into dependency trees."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-3",
"text": "We make explicit how the requirement of keeping a direct interpretation of dependency trees into strings results into lexical ambiguity."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-4",
"text": "Since the ACG framework has already been used to provide a logical semantics from TAG derivation trees, we have a unified picture where derivation trees and dependency trees are related but independent equivalent ways to account for the same surface-meaning relation."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-7",
"text": "Tree Adjoining Grammars (TAG) (Joshi et al., 1975; Joshi and Schabes, 1997 ) is a tree grammar formalism relying on two operations between trees: substitution and adjunction."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-8",
"text": "In addition to the tree generated by a sequence of such operations, there is a derivation tree which records this sequence."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-9",
"text": "Derivation trees soon appeared as good candidates to encode semantic-like relations between the elementary trees they glue together."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-10",
"text": "However, some mismatch between these trees and the relative scoping of logical connectives and relational symbols, or between these trees and the dependency relations, have been observed."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-11",
"text": "Solving these problems often leads to modifications of derivation tree structures (Schabes and Shieber, 1994; Kallmeyer, 2002; Joshi et al., 2003; Rambow et al., 2001; Chen-Main and Joshi, To appear) ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-32",
"text": "It is a homomorphism 1 that maps types and terms built on \u03a3 to types and terms built on \u039e. We note t:= G u if L(t) = u and omit the G subscript if obvious from the context."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-12",
"text": "While alternative proposals have succeeded in linking derivation trees to semantic representations using unification (Kallmeyer and Romero, 2004; Kallmeyer and Romero, 2007) or using an encoding (Pogodalla, 2004; Pogodalla, 2009) of TAG into the ACG framework (de Groote, 2001) , only recently (Kallmeyer and Kuhlmann, 2012) has proposed a transformation from standard derivation trees to dependency trees."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-13",
"text": "This paper provides an ACG perspective on this transformation."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-14",
"text": "The goal is twofold."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-15",
"text": "First, it exhibits the underlying lexical blow up of the yield functions associated with the elementary trees in (Kallmeyer and Kuhlmann, 2012) ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-16",
"text": "Second, using the same framework as (Pogodalla, 2004; Pogodalla, 2009 ) allows us to have a shared perspective on a phrase-structure architecture and a dependency one and an equivalence on the surface-meaning relation they define."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-17",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-18",
"text": "**ABSTRACT CATEGORIAL GRAMMARS**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-19",
"text": "ACGs provide a framework in which several grammatical formalisms may be encoded (de Groote and Pogodalla, 2004) ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-20",
"text": "They generate languages of linear \u03bb-terms, which generalize both string and tree languages."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-21",
"text": "A key feature is to provide the user direct control over the parse structures of the grammar, the abstract language, which allows several grammatical formalisms to be defined in terms of ACG, in particular TAG (de Groote, 2002) ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-22",
"text": "We refer the reader to (de Groote, 2001; Pogodalla, 2009) for the details and introduce here only few relevant definitions and notations."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-23",
"text": "Definition."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-24",
"text": "A higher-order linear signature is defined to be a triple \u03a3 = A, C, \u03c4 , where:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-25",
"text": "\u2022 A is a finite set of atomic types (also noted A \u03a3 ), \u2022 C is a finite set of constants (also noted C \u03a3 ), \u2022 and \u03c4 is a mapping from C to T A the set of types built on A: T A ::= A|T A T A (also noted T \u03a3 )."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-26",
"text": "A higher-order linear signature will also be called a vocabulary."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-27",
"text": "\u039b(\u03a3) is the set of \u03bb-terms built on \u03a3, and for t \u2208 \u039b(\u03a3) and \u03b1 \u2208 T \u03a3 such that t has type \u03b1, we note t : \u03a3 \u03b1 (the \u03a3 subscript is omitted when it is obvious from the context)."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-28",
"text": "Definition."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-29",
"text": "An abstract categorial grammar is a quadruple G = \u03a3, \u039e, L, s where:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-30",
"text": "1. \u03a3 and \u039e are two higher-order linear signatures, which are called the abstract vocabulary and the object vocabulary, respectively; 2."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-31",
"text": "L : \u03a3 \u2212\u2192 \u039e is a lexicon from the abstract vocabulary to the object vocabulary."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-33",
"text": "3. s \u2208 T \u03a3 is a type of the abstract vocabulary, which is called the distinguished type of the grammar."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-34",
"text": "Since there is no structural difference between the abstract and the object vocabulary as they both are higher-order signatures, ACGs can be combined in different ways."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-35",
"text": "Either by having a same abstract vocabulary shared by several ACGs in order to make two object terms (for instance a string and a logical formula) share the same underlying structure as G d-ed trees and G Log in Fig. 1 . Or by making the abstract vocabulary of an ACG the object vocabulary of another ACG, allowing the latter to control the admissible structures of the former, as G yield and G d-ed trees in Fig. 1 ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-37",
"text": "**TAG AS ACG**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-38",
"text": "As Fig. 1 shows, the encoding of TAG into ACG uses two ACGs"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-39",
"text": "We exemplify the encoding 2 of a TAG analyzing (1) 3 1 In addition to defining L on the atomic types and on the constants of \u03a3, we have:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-40",
"text": ") with the proviso that for any constant c :"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-41",
"text": "2 We refer the reader to (Pogodalla, 2009 ) for the details."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-42",
"text": "3 The TAG literature typically uses this example, and (Kallmeyer and Kuhlmann, 2012) as well, to show the mismatch between the derivation trees and the expected se- This sentence is usually analyzed in TAG with a derivation tree where the to love component scopes over all the other arguments, and where claims and seems are unrelated, as Fig. 2(a) shows."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-43",
"text": "The three higher-order signatures are: \u03a3 der\u03b8 : Its atomic types include s, vp, np, s A , vp A . . . where the X types stand for the categories X of the nodes where a substitution can occur while the X A types stand for the categories X of the nodes where an adjunction can occur."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-44",
"text": "For each elementary tree \u03b3 lex."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-45",
"text": "entry it contains a constant C lex."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-46",
"text": "entry whose type is based on the adjunction and substitution sites as Table 1 shows."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-47",
"text": "It additionally contains constants I X : X A that are meant to provide a fake auxiliary tree on adjunction sites where no adjunction actually takes place in a TAG derivation."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-48",
"text": "\u03a3 trees : Its unique atomic type is \u03c4 the type of trees."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-49",
"text": "Then, for any X of arity n belonging to the ranked alphabet describing the elementary trees of the TAG, we have a constant"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-50",
"text": "Its unique atomic type is \u03c3 the type of strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-51",
"text": "The constants are the terminal symbols of the TAG (with type \u03c3), the concatenation + : \u03c3 \u03c3 \u03c3 and the empty string \u03b5 : \u03c3."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-52",
"text": "Table 1 illustrates L d-ed trees ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-53",
"text": "4 L yield is defined as follows:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-54",
"text": "\u2022 L yield (\u03c4 ) = \u03c3;"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-55",
"text": "\u2022 for n = 0, X 0 : \u03c4 represents a terminal symmantics and the relative scopes of the predicates."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-56",
"text": "4 With L d-ed trees (XA) = \u03c4 \u03c4 and for any other type"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-57",
"text": "bol and L yield (X 0 ) = X. Then, the derivation tree, the derived tree, and the yield of Fig. 2 are represented by: Trees (Kallmeyer and Kuhlmann, 2012) 's process to translate derivation trees into dependency trees is a two-step process."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-58",
"text": "The first one does the actual transformation, using macro-tree transduction, while the second one modifies the way to get the yield from the dependency trees rather than from the derivation ones."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-60",
"text": "**FROM DERIVATION TO DEPENDENCY TREES**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-61",
"text": "This transformation aims at modeling the differences in scope of the argument between the derivation tree for (1) shown in Fig. 2 (a) and the corresponding dependency tree shown in Fig. 2 (b)."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-62",
"text": "For instance, in the derivation trees, claims and seems are under the scope of to love while in the dependency tree this order is reversed."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-63",
"text": "According to (Kallmeyer and Kuhlmann, 2012) , such edge reversal is due to the fact that an edge between a complement taking adjunction (CTA) and an initial tree has to be reversed, while the other edges remain unchanged."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-64",
"text": "Moreover, in case an initial tree accepts several adjunction of CTAs, (Kallmeyer and Kuhlmann, 2012) hypothesizes that the farther from the head a CTA is, the higher it is in the dependency tree."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-65",
"text": "In the case of to love, the s node is farther from the head than the vp node."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-66",
"text": "Therefore any adjunction on the s node (e.g. claims) should be higher than the one on the vp node (e.g. seems) in the dependency tree."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-67",
"text": "We represent the dependency tree for (1) as"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-68",
"text": "In order to do such reversing operations, (Kallmeyer and Kuhlmann, 2012) uses Macro Tree Transducers (MTTs) (Engelfriet and Vogler, 1985) ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-69",
"text": "Note that the MTTs they use are linear, i.e. non-copying."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-70",
"text": "It means that any node of an input tree cannot be translated more than once."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-71",
"text": "(Yoshinaka, 2006) has shown how to encode such MTTs as the composition G \u2022 G \u22121 of two ACGs, and we will use a very similar construct."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-72",
"text": "(Kallmeyer and Kuhlmann, 2012) adds to the transformation from derivation trees to dependency trees the additional constraint that the string associated with a dependency structure is computed directly from the latter, without any reference to the derivation tree."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-73",
"text": "To achieve this, they use two distinct yield functions: yield TAG from derivation trees to strings, and yield dep from dependency trees to strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-75",
"text": "**THE YIELD FUNCTIONS**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-76",
"text": "Let us imagine an initial tree \u03b3 i and an auxiliary tree \u03b3 a with no substitution nodes."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-77",
"text": "The yield of the derived tree resulting from the operations of the derivation tree \u03b3 of Fig. 3 defined in (Kallmeyer and Kuhlmann, 2012)"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-78",
"text": ", w 2 where x, y denotes a tuple of strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-79",
"text": "Because of the adjunction, the corresponding dependency structure has a reverse order (\u03b3 = \u03b3 a (\u03b3 i )), the requirement on yield dep imposes that"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-80",
"text": "In the interpretation of derivation trees as strings, initial trees (with no substitution nodes) Abstract Indeed, an initial tree can have several adjunction sites."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-81",
"text": "In this case, to be ready for another adjunction after a first one, the first result itself should be a tuple of strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-82",
"text": "So an initial tree (with no substitution nodes) with n adjunction sites is interpreted as a (2n + 1)-tuple of strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-83",
"text": "Accordingly, depending on the location where it can adjoin, an auxiliary tree is interpreted as a function from (2k + 1)-tuple of strings to (2k \u2212 1)-tuple of strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-84",
"text": "Taking into account that to model trees having the substitution nodes is then just a matter of adding k string parameters where k is the number of substitution nodes in a tree."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-85",
"text": "Then using the interpretation:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-86",
"text": "yield dep (d to love ) = \u03bbx 11 x 21 ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-87",
"text": "x 11 , x 21 , to love, \u03b5, \u03b5 yield dep (d seems ) = \u03bb x 11 , x 12 , x 13 , x 14 , x 15 ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-88",
"text": "x 11 , x 12 + seems + x 13 x 14 , x 15 yield dep (d claims ) = \u03bbx 21 x 11 , x 13 , x 14 ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-89",
"text": "x 11 + x 21 + claims + x 14 + x 13 we can check that"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-90",
"text": "John + Bill + claims + Mary + seems + to love"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-91",
"text": "Remark."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-92",
"text": "The given interpretation of d to love is only valid for structures reflecting adjunctions both on the s node and on the vp node of \u03b3 to love ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-93",
"text": "So actually, an initial tree such as \u03b3 to love yields four interpretations: one with the two adjunctions (5-tuple), two with one adjunction either on the vp node or on the s node (3-tuple), and one with no adjunction (1-tuple)."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-94",
"text": "The two first cases correspond to the sentences (2a) and (2b)."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-95",
"text": "5 Accordingly, we need multiple interpretations for the auxiliary trees, for instance for the two occurrences of seems in (3) where the yield of the last one yield dep (d seems ) maps a 5-tuple to a 3-tuple, and the yield of the first one maps a 3-tuple to a 3-tuple. And yield dep (d claims ) maps a 3-tuple to a 1-tuple of strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-96",
"text": "We will mimic this behavior by introducing as many different non-terminal symbols for the dependency structures in our ACG setting."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-97",
"text": "(2) a. John Bill claims Mary seems to love b. John Mary seems to love (3) John Bill seems to claim Mary seems to love Remark."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-98",
"text": "Were we not interested in the yields but only in the dependency structures, we wouldn't have to manage this ambiguity."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-99",
"text": "This is true both for (Kallmeyer and Kuhlmann, 2012) 's approach and ours."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-100",
"text": "But as we have here a unified framework for the two-step process they propose, this lexical blow up will result in a multiplicity of types as Section 5 shows."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-102",
"text": "**DISAMBIGUATED DERIVATION TREES**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-103",
"text": "In order to encode the MTT acting on derivation trees, we introduce a new abstract vocabulary \u03a3 der\u03b8 for disambiguated derivation trees as in (Yoshinaka, 2006 to love is used to model sentences where both adjunctions are performed into \u03b3 to love ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-104",
"text": "C 10 to love and C 01 to love are used for sentences where only one adjunction at the s or at the vp node occurs respectively."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-105",
"text": "C 00 to love : np np s is used when no adjunction occurs."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-106",
"text": "6 This really mimics (Yoshinaka, 2006) 's encoding of (Kallmeyer and Kuhlmann, 2012) MTT rules: . . . are designed in order to indicate that a given adjunction has n adjunctions above it (i.e. which scope over it)."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-107",
"text": "The superscripts (2(n + 1))(2(n \u2212 1)) express that an adjunction that has n adjunctions above it is translated as a function that takes a 2(n + 1)-tuple as argument and returns a 2(n \u2212 1)-tuple."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-108",
"text": "To model auxiliary trees which are CTAs we need a different strategy."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-109",
"text": "For each such adjunction tree T we have two sets in \u03a3 der\u03b8 : S 1 T the set of constants which can be adjoined into initial trees and S 2 T the set of constants which can be adjoined into auxiliary trees."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-110",
"text": "For instance, \u03b3 seems would generate S 1 seems that includes C 11 seems31 , C 10 seems31 , C 01 seems31 , C 00 seems31 , C 11 seems53 etc."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-111",
"text": "C 00 seems31 is of type vp 31 A , which means that it can be adjoined into initial trees which contain vp 31 A as its argument type (e.g. C 01 to love )."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-112",
"text": "6 See note 5. When an auxiliary tree is adjoined into another auxiliary tree as in (3), we do not allow the former to modify the tupleness of the latter."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-113",
"text": "For instance \u03b3 seems would generate S 2 seems that includes C 11 seems3\u22123 , C 10 seems3\u22123 , C 01 seems3\u22123 , C 00 seems3\u22123 , C 11 seems5\u22125 etc."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-114",
"text": "C 00 seems3\u22123 has a subscript (k\u2212k) that correspond to adjunctions into adjunction trees."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-115",
"text": "The type of C 00 seems3\u22123 is vp"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-117",
"text": "**3\u22123**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-118",
"text": "A , meaning that it can directly adjoin into auxiliary trees which have arguments of type vp"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-119",
"text": "A , which means that it itself expects an adjunction and the result can be adjoined into another adjunction tree."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-120",
"text": "Now it is easy to define L der from \u03a3 der\u03b8 to \u03a3 der\u03b8 ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-121",
"text": "It maps every type X \u2208 \u03a3 der\u03b8 to X \u2208 \u03a3 der\u03b8 and every X N A to X A ; types without numbers are mapped to themselves, i.e. s to s, np to np, etc."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-122",
"text": "Moreover, the different versions of some constant, that were introduced in order to extract the yield, are translated using only one constant and fake adjunctions."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-123",
"text": "For instance:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-125",
"text": "**ENCODING A DEPENDENCY GRAMMAR**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-126",
"text": "The ACG of (Pogodalla, 2009) mapping TAG derivation trees to logical formulas already encoded some reversal of the predicate-argument structure."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-127",
"text": "Here we map the disambiguated derivation trees to dependency structures."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-128",
"text": "The vocabulary that define these dependency trees is \u03a3 dep ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-129",
"text": "It is also designed to allow us to build two lexicons from it to \u03a3 string (to provide a direct yield function) and to \u03a3 Log (to provide a logical semantic representation)."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-130",
"text": "In"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-131",
"text": "Furthermore, we describe \u03a3 Log 7 and define two lexicons: L dep."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-132",
"text": "yield : \u03a3 dep \u2212\u2192 \u03a3 string and L dep."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-133",
"text": "log : \u03a3 dep \u2212\u2192 \u03a3 Log ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-134",
"text": "Table 2 provides examples of these two translations."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-135",
"text": "L dep."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-136",
"text": "yield : It translates any atomic type \u03c4 n or \u03c4 n X with X \u2208 {n A , n d A . . .} as a n-tuple of string of non-complement-taking verbal or sentential adjunctions \u03c4 2 vp and \u03c4 2 s are translated as t t. Let us show for the sentence (1) how the ACGs defined above work with the data provided in Table 2 ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-137",
"text": "Its representation in \u03a3 der\u03b8 is:"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-138",
"text": "f (John + (Bill + (claims + ((Mary + ((seems + to love) + )) + )))) and L dep."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-139",
"text": "log (t 0 ) = claim bill (seem (love john mary)"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-141",
"text": "**CONCLUSION**"
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-142",
"text": "In this paper, we have given an ACG perspective on the transformation of the derivation trees of TAG to the dependency trees proposed in (Kallmeyer and Kuhlmann, 2012) ."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-143",
"text": "Figure 4 illustrates the architecture we propose."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-144",
"text": "This transformation is a two-step process using first a macrotree transduction then an interpretation of dependency trees as (tuples of) strings."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-145",
"text": "It was known from (Yoshinaka, 2006) how to encode a macrotree transducer into a G dep \u2022G \u22121 der ACG composition."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-146",
"text": "Dealing with typed trees to represent derivation trees allows us to provide a meaningful (wrt."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-147",
"text": "the TAG formalism) abstract vocabulary \u03a3 der\u03b8 encoding this macro-tree transducer."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-148",
"text": "The encoding of the second step then made explicit the lexical blow up for the interpretation of the functional symbols of the dependency trees in (Kallmeyer and Kuhlmann, 2012 )'s construct."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-149",
"text": "It also provides a push out (in the categorical sense) of the two morphisms from the disambiguated derivation trees to the derived trees and to the dependency trees."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-150",
"text": "The diagram is completed with the yield function from the derived trees and from the dependency trees to the string vocabulary."
},
{
"sent_id": "f1eae0918a246174b1866ba71d4efc-C001-151",
"text": "Finally, under the assumption of (Kallmeyer and Kuhlmann, 2012) of plausible dependency structures, we get two possible grammatical approaches to the surface-semantics relation that are related but independent: it can be equivalently modeled using either a phrase structure or a dependency model."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f1eae0918a246174b1866ba71d4efc-C001-2",
"f1eae0918a246174b1866ba71d4efc-C001-3",
"f1eae0918a246174b1866ba71d4efc-C001-4"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-12",
"f1eae0918a246174b1866ba71d4efc-C001-13"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-14",
"f1eae0918a246174b1866ba71d4efc-C001-15",
"f1eae0918a246174b1866ba71d4efc-C001-16"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-42"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-57",
"f1eae0918a246174b1866ba71d4efc-C001-58"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-61",
"f1eae0918a246174b1866ba71d4efc-C001-62",
"f1eae0918a246174b1866ba71d4efc-C001-63"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-64"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-68",
"f1eae0918a246174b1866ba71d4efc-C001-69",
"f1eae0918a246174b1866ba71d4efc-C001-70"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-77",
"f1eae0918a246174b1866ba71d4efc-C001-78"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-100",
"f1eae0918a246174b1866ba71d4efc-C001-98",
"f1eae0918a246174b1866ba71d4efc-C001-99"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-103",
"f1eae0918a246174b1866ba71d4efc-C001-104",
"f1eae0918a246174b1866ba71d4efc-C001-105",
"f1eae0918a246174b1866ba71d4efc-C001-106"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-142",
"f1eae0918a246174b1866ba71d4efc-C001-143",
"f1eae0918a246174b1866ba71d4efc-C001-144"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-148",
"f1eae0918a246174b1866ba71d4efc-C001-149",
"f1eae0918a246174b1866ba71d4efc-C001-150"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-151"
]
],
"cite_sentences": [
"f1eae0918a246174b1866ba71d4efc-C001-2",
"f1eae0918a246174b1866ba71d4efc-C001-12",
"f1eae0918a246174b1866ba71d4efc-C001-15",
"f1eae0918a246174b1866ba71d4efc-C001-42",
"f1eae0918a246174b1866ba71d4efc-C001-57",
"f1eae0918a246174b1866ba71d4efc-C001-63",
"f1eae0918a246174b1866ba71d4efc-C001-64",
"f1eae0918a246174b1866ba71d4efc-C001-68",
"f1eae0918a246174b1866ba71d4efc-C001-77",
"f1eae0918a246174b1866ba71d4efc-C001-99",
"f1eae0918a246174b1866ba71d4efc-C001-106",
"f1eae0918a246174b1866ba71d4efc-C001-142",
"f1eae0918a246174b1866ba71d4efc-C001-148",
"f1eae0918a246174b1866ba71d4efc-C001-151"
]
},
"@MOT@": {
"gold_contexts": [
[
"f1eae0918a246174b1866ba71d4efc-C001-14",
"f1eae0918a246174b1866ba71d4efc-C001-15",
"f1eae0918a246174b1866ba71d4efc-C001-16"
],
[
"f1eae0918a246174b1866ba71d4efc-C001-100",
"f1eae0918a246174b1866ba71d4efc-C001-98",
"f1eae0918a246174b1866ba71d4efc-C001-99"
]
],
"cite_sentences": [
"f1eae0918a246174b1866ba71d4efc-C001-15",
"f1eae0918a246174b1866ba71d4efc-C001-99"
]
},
"@SIM@": {
"gold_contexts": [
[
"f1eae0918a246174b1866ba71d4efc-C001-100",
"f1eae0918a246174b1866ba71d4efc-C001-98",
"f1eae0918a246174b1866ba71d4efc-C001-99"
]
],
"cite_sentences": [
"f1eae0918a246174b1866ba71d4efc-C001-99"
]
}
}
},
"ABC_6293d300ab46a6d6135ed256005403_6": {
"x": [
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-41",
"text": "**BACKGROUND**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-2",
"text": "Document enrichment focuses on retrieving relevant knowledge from external resources, which is essential because text is generally replete with gaps."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-3",
"text": "Since conventional work primarily relies on special resources, we instead use triples of Subject, Predicate, Object as knowledge and incorporate distributional semantics to rank them."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-4",
"text": "Our model first extracts these triples automatically from raw text and converts them into real-valued vectors based on the word semantics captured by Latent Dirichlet Allocation."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-5",
"text": "We then represent these triples, together with the source document that is to be enriched, as a graph of triples, and adopt a global iterative algorithm to propagate relevance weight from source document to these triples so as to select the most relevant ones."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-6",
"text": "Evaluated as a ranking problem, our model significantly outperforms multiple strong baselines."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-7",
"text": "Moreover, we conduct a task-based evaluation by incorporating these triples as additional features into document classification and enhances the performance by 3.02%."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-10",
"text": "Document enrichment is the task of acquiring relevant background knowledge from external resources for a given document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-11",
"text": "This task is essential because, during the writing of text, some basic but well-known information is usually omitted by the author to make the document more concise."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-12",
"text": "For example, Baghdad is the capital of Iraq is omitted in Figure 1a ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-13",
"text": "A human will fill these gaps automatically with the background knowledge in his mind."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-14",
"text": "However, the machine lacks both the necessary background knowledge and the ability to select."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-15",
"text": "The task of document enrichment is proposed to tackle this problem, and has been proved helpful in many NLP tasks such as web search (Pantel and Fuxman, 2011) , coreference resolution (Bryl et al., 2010) , document cluster (Hu et al., 2009 ) and entity disambiguation (Sen, 2012) ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-16",
"text": "We can classify previous work into two classes according to the resources they rely on."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-17",
"text": "The first line of work uses Wikipedia, the largest on-line encyclopedia, as a resource and introduces the content of Wikipedia pages as external knowledge (Cucerzan, 2007; Kataria et al., 2011; He et al., 2013) ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-18",
"text": "Most research in this area relies on the text similarity (Zheng et al., 2010; Hoffart et al., 2011) and structure information (Kulkarni et al., 2009; Sen, 2012; He et al., 2013) between the mention and the Wikipedia page."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-19",
"text": "Despite the apparent success of these methods, most Wikipedia pages contain too much information, most of which is not relevant enough to the source document, and this causes a noise problem."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-20",
"text": "Another line of work tries to improve the accuracy by introducing ontologies (Fodeh et al., 2011; Kumar and Salim, 2012) and structured knowledge bases such as WordNet (Nastase et al., 2010) , which provide semantic information about words such as synonym and antonym (Sansonnet and Bouchet, 2010) ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-21",
"text": "However, these methods primarily rely on special resources constructed with supervision or even manually, which are difficult to expand and in turn limit their applications in practice."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-22",
"text": "In contrast, we wish to seek the benefits of both coverage and accuracy from a better representation of background knowledge: triples of Subject, Predicate, Object (SPO)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-23",
"text": "According to Hoffart et al. (2013) , these triples, such as LeonardCohen, wasBornIn, Montreal, can be extracted automatically from Wikipedia and other sources, which is compatible with the RDF data model (Staab and Studer, 2009 A source document about a U.S. air strike omitting two important pieces of background knowledge which are acquired by our framework."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-24",
"text": "triples from multiple sources, we also get better coverage."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-25",
"text": "Therefore, one can expect that this representation is helpful for better document enrichment by incorporating both accuracy and coverage."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-26",
"text": "In fact, there is already evidence that this representation is helpful."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-27",
"text": "Zhang et al. (2014) proposed a triple-based document enrichment framework which uses triples of SPO as background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-28",
"text": "They first proposed a search enginebased method to evaluate the relatedness between every pair of triples, and then an iterative propagation algorithm was introduced to select the most relevant triples to a given source document (see Section 2), which achieved a good performance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-29",
"text": "However, to evaluate the semantic relatedness between two triples, Zhang et al. (2014) primarily relied on the text of triples and used search engines, which makes their method difficult to re-implement and in turn limits its application in practice."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-30",
"text": "Moreover, they did not carry out any task-based evaluation, which makes it uncertain whether their method will be helpful in real applications."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-31",
"text": "Therefore, we instead use topic models, especially Latent Dirichlet Allocation (LDA), to encode distributional semantics of words and convert every triple into a real-valued vector, which is then used to evaluate the relatedness between a pair of triples."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-32",
"text": "We then incorporate these triples into the given source document and represent them together as a graph of triples."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-33",
"text": "Then a modified iterative propagation is carried out over the entire graph to select the most relevant triples of background knowledge to the given source document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-34",
"text": "To evaluate our model, we conduct two series of experiments: (1) evaluation as a ranking problem, and (2) task-based evaluation."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-35",
"text": "We first treat this task as a ranking problem which inputs one document and outputs the top N most-relevant triples of background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-36",
"text": "Second, we carry out a task-based evaluation by incorporating these relevant triples acquired by our model into the original model of document classification as additional features."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-37",
"text": "We then perform a direct comparison between the classification models with and without these triples, to determine whether they are helpful or not."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-38",
"text": "On the first series of experiments, we achieve a MAP of 0.6494 and a P@N of 0.5597 in the best situation, which outperforms the strongest baseline by 5.87% and 17.21%."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-39",
"text": "In the task-based evaluation, the enriched model derived from the triples of background knowledge performs better by 3.02%, which demonstrates the effectiveness of our framework in real NLP applications."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-42",
"text": "The most closely related work in this area is our own (Zhang et al., 2014) , which used the triples of SPO as background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-43",
"text": "In that work, we first proposed a triple graph to represent the source document and then used a search enginebased iterative algorithm to rank all the triples."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-44",
"text": "We describe this work in detail below."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-45",
"text": "Triple graph Zhang et al. (2014) proposed the triple graph as a document representation, where the triples of SPO serve as nodes, and the edges between nodes indicate their semantic relatedness."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-46",
"text": "There are two kinds of nodes in the triple graph:"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-47",
"text": "(1) source document nodes (sd-nodes), which are triples extracted from source documents, and (2) background knowledge nodes (bk-nodes), which are triples extracted from external sources."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-48",
"text": "Both of them are extracted automatically with Reverb, a well-known Open Information Extraction system (Etzioni et al., 2011) ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-49",
"text": "There are also two kinds of edges: (1) an edge between a pair of sd-nodes, and (2) an edge between one sd-node and another bk-node, both of which are unidirectional."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-50",
"text": "In the original representation, there are no edges between two bk-nodes because they treat the bk-nodes as recipients of relevance weight only."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-51",
"text": "In this paper, we modify this setup and connect every pair of bknodes with an edge, so the bk-nodes serve as intermediate nodes during the iterative propagation process and contribute to the final performance too as shown in our experiments (see Section 5.1)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-52",
"text": "Relevance evaluation To compute the weight of a edge, Zhang et al. (2014) evaluate the semantic relatedness between two nodes with a search engine-based method."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-53",
"text": "They first convert every node, which is a triple of SPO, into a query by combining the text of Subject and Object together."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-54",
"text": "Then for every pair of nodes t i and t j , they construct three queries: p, q, and p \u2229 q, which correspond to the queries of t i , t j , and t j \u2229 t j , the combination of t i and t j ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-55",
"text": "All these queries are put into a search engine to get H(p), H(q), and H(p \u2229 q), the numbers of returned pages for query p, p, and p \u2229 q. Then the WebJaccard Coefficient (Bollegala et al., 2007 ) is used to evaluate r(i, j), the relatedness between t i and t j , according to Formula 1."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-56",
"text": "otherwise."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-57",
"text": "(1)"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-58",
"text": "Using r(i, j), Zhang et al. (2014) further define p(i, j), the probability of t i and t j propagating to each other, as shown in Formula 2."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-59",
"text": "Here N is the set of all nodes, and \u03b4 (i, j) denotes whether an edge exists between two nodes or not."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-60",
"text": "Iterative propagation Considering that the source document D is represented as a graph of sd-nodes, so the relevance of background knowledge t b to D is naturally converted into that of t b to the graph of sd-nodes."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-61",
"text": "Zhang et al. (2014) evaluate this relevance by propagating relevance weight from sd-nodes to t b iteratively."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-62",
"text": "After convergence, the relevance weight of t b will be treated as the final relevance to D. There are in total n \u00d7 n pairs of nodes, and their p(i, j) are stored in a matrix P. Zhang et al. (2014) use W = (w 1 , w 2 , . . . , w n ) to denote the relevance weights of nodes, where w i indicates the relevance of t i to D. At the beginning, each w i of bk-nodes is initialized to 0, and each that of sd-nodes is initialized to its importance to D. Then W is updated to W after every iteration according to Formula 3. They keep updating the weights of both sd-nodes and bk-nodes until convergence and do not distinguish them explicitly."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-64",
"text": "**METHODOLOGY**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-65",
"text": "The key idea behind this work is that every document is composed of several units of information, which can be extracted into triples automatically."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-66",
"text": "For every unit of background knowledge b, the more units that are relevant to b and the more relevant they are, the more relevant b will be to the source document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-67",
"text": "Based on this intuition, we first present both source document information and background knowledge together as a document-level triple graph as illustrated in Section 2."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-68",
"text": "Then we use LDA to capture the distributional semantics of a triple by representing it as a vector of distributional probabilities over k topics and evaluate the relatedness between two triples with cosine-similarity."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-69",
"text": "Finally, we propose a modified iterative process to propagate the relevance score from the source document information to the background knowledge and select the top n relevant ones."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-71",
"text": "**ENCODING DISTRIBUTIONAL SEMANTICS**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-72",
"text": "LDA LDA is a popular generative probabilistic model, which was first introduced by Blei et al. (2003) ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-73",
"text": "LDA views every document as a mixture over underlying topics, and each topic as a distribution over words."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-74",
"text": "Both the document-topic and the topic-word distributions are assumed to have a Dirichlet prior."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-75",
"text": "Given a set of documents and a number of topics, the model returns \u03b8 d , the topic distribution for each document d, and \u03c6 z , the word distribution for every topic z."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-76",
"text": "LDA assumes the following generative process for each document in a corpus D:"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-78",
"text": "**CHOOSE \u0398 \u223c DIR(\u0391).**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-79",
"text": "(a) Choose a topic z n \u223c Multinomial(\u03b8 )."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-80",
"text": "(b) Choose a word w n from p(w n |z n , \u03b2 ) conditioned on the topic z n ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-81",
"text": "Here the dimensionality k of the Dirichlet distribution (and thus the dimensionality of the topic vari- able z) is assumed to be known and fixed; \u03b8 is a kdimensional Dirichlet random variable, where the parameter \u03b1 is a k-vector with components \u03b1 i > 0; and the \u03b2 indicates the word probabilities over topics, which is a matrix with \u03b2 i j = p(w j = 1|z i = 1)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-82",
"text": "Figure 2 shows the representation of LDA as a probabilistic graphical model with three levels."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-83",
"text": "There are two corpus-level parameters \u03b1 and \u03b2 , which are assumed to be sampled once in the process of generating a corpus; one document-level variable \u03b8 d , which is sampled once per document; and two word-level variables z dn and w dn , which are sampled once for each word in each document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-84",
"text": "We employ the publicly available implementation of LDA, JGibbLDA2 1 (Phan et al., 2008) , which has two main execution methods: parameter estimation (model building) and inference for new data (classification of a new document)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-85",
"text": "Relevance evaluation Given a set of documents and the number of topics k, LDA will return \u03c6 z , the word distribution over the topic z. So for every word w n , we get k distributional probabilities over k topics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-86",
"text": "We use p w n z i to denote the probability that w n appears in the i th topic z i , where i \u2264 k, z i \u2208 Z, the set of k topics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-87",
"text": "Then we combine these k possibilities together as a real-valued vector v w n to represent w n as shown in Formula 4."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-88",
"text": "After getting the vectors of words, we employ an intuitive method to compute the vector of a triple t, by accumulating all the corresponding vectors of words appearing in t according to Formula 5. Considering that the elements of this newly generated vector indicate the distributional probabilities of t over k topics, we then normalize it according to Formula 6 so that its elements sum to 1."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-89",
"text": "This gives us v t , the real-valued vector of triple t, which captures its distributional probabilities over k topics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-90",
"text": "Here t corresponds to a triple of background knowledge or of source document, p tz i indicates the possibility of t to appear in the i th topic z i , and w n \u2208 t means that w n appears in t."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-91",
"text": "Using the vectors of triples, we can easily compute the semantic relatedness between a pair of triples as their cosine-similarity according to Formula 7."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-92",
"text": "Here A, B correspond to the real-valued vectors of two triples, r(A, B) denotes their semantic relatedness, and k is the number of topics, which is also the length of A (or B)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-93",
"text": "A high value of r(A, B) usually indicates a close relatedness between A and B, and thus a higher probability of propagating to each other in the following modified iterative propagation illustrated in Section 3.2."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-95",
"text": "**MODIFIED ITERATIVE PROPAGATION**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-96",
"text": "In this part, we propose a modified iterative propagation based ranking model to select the mostrelevant triples of background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-97",
"text": "There are three primary modifications to the original model of Zhang et al. (2014) , all of which are shown more powerful in our experiments."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-98",
"text": "First of all, the original model (Zhang et al., 2014) does not reset the relevance weight of sdnodes after every iteration."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-99",
"text": "This results in a continued decrease of the relevance weight of sd-nodes, which weakens the effect of sd-nodes during the iterative propagation and in turn affects the final performance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-100",
"text": "To tackle this problem, we decrease the relevance weight of bk-nodes and increase that of sd-nodes according to a fixed ratio after every iteration, so as to ensure that the total weight of sd-nodes is always higher than that of bk-nodes."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-120",
"text": "Then a SVM classifier takes this vector as input and outputs the class of the document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-101",
"text": "Note that although the relevance weights of bk-nodes are changed after the redistribution, the corresponding ranking of them is not changed because the redistribution is carried out over all nodes accordingly."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-102",
"text": "In our experiments, we tried different ratios and finally chose 10:1, with sd-nodes corresponding to 10 and bk-nodes to 1, which achieved the best performance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-103",
"text": "In addition, we also modify the triple graph, the representation of a document illustrated in Section 2, by connecting every pair of bk-nodes with an edge, which is not allowed in the original model."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-104",
"text": "This modification was motivated by the intuition that the relatedness between bk-nodes also contributes to the better evaluation of relevance to the source document, because the bk-nodes can serve as the intermediate nodes during the iterative propagation over the entire graph."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-105",
"text": "Figure 3 shows an example, where the bk-node John Lennon is close to both the sd-node Beatles and to another bknode Yoko Ono, so the relatedness between two bk-nodes John Lennon and Yoko Ono helps in better evaluation of the relatedness between the bknode Yoko Ono and the sd-node Beatles."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-106",
"text": "We also modify the definition of p(i, j), the probability of two nodes t i and t j propagating to each other."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-107",
"text": "Zhang et al. (2014) compute this probability according to Formula 2, which highlights the number of neighbors, but weakens the relatedness between nodes, due to the normalization."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-108",
"text": "For instance, if a node t x has only one neighbor t y , no matter how low their relatedness is, their p(x, y) will still be equal to 1 in the original model, while another node with two equally but closely related neighbors will only get a probability of 0.5 for each neighbor."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-109",
"text": "We modify this setup by removing the normalization process and computing p(i, j) as the relatedness between t i and t j directly, which is evaluated according to Formula 1 ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-111",
"text": "**ENCODING BACKGROUND KNOWLEDGE INTO DOCUMENT CLASSIFICATION**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-112",
"text": "In this part, we demonstrate that the introduction of relevant knowledge could be helpful to real NLP applications."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-113",
"text": "In particular, we choose the document classification task as a demonstration, which aims to classify documents into predefined categories automatically (Sebastiani, 2002 )."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-114",
"text": "We choose this task for two reasons: (1) This task has witnessed a booming interest in the last 20 years, due to the increased availability of documents in digital form and the ensuing need to organize them, so it is important in both research and application."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-115",
"text": "(2) The state-of-the-art performance of this task is achieved by a series of topic modelbased methods, which rely on the same model as we do, but make use of source document information only."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-116",
"text": "However, there is always some omitted information and relevant knowledge, which cannot be captured from the source document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-117",
"text": "Intuitively, the recovery of this information will be helpful."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-118",
"text": "If we can improve the performance by introducing extra background knowledge into existing framework of document classification, we can inference naturally that the improvement benefits from the introduction of this knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-119",
"text": "Traditional methods primarily use topic models to represent a document as a topic vector."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-121",
"text": "In this work, we propose a new framework for document classification to incorporate extra knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-122",
"text": "Given a document to be classified, we select the top N mostrelevant triples of background knowledge with our model introduced in Section 3, all of which are represented as vectors of v t = (p tz 1 , p tz 2 , . . . , p tz k )."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-123",
"text": "Then we combine these N triples as a new vector v t , which is then incorporated into the original framework of document classification."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-124",
"text": "Another SVM classifier takes v t , together with the original features extracted from the source document, as input and outputs the category of the source document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-125",
"text": "To combine N triples as one, we employ an intuitive method by computing the average of N corresponding vectors in every dimension."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-126",
"text": "One possible problem is how to decide N, the number of triples to be introduced."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-127",
"text": "We first introduce a fixed amount of triples for every document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-128",
"text": "Moreover, we also select the triples according to their relevance weight to the source document (see Section 3.2) by setting a threshold of relevance weight first and selecting the triples whose weights are higher than the threshold."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-129",
"text": "We further discuss the impact of different thresholds in Section 5.2."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-131",
"text": "**528**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-132",
"text": "To evaluate our model, we conduct two series of experiments: (1) We first treat this task as a ranking problem, which takes a document as input and outputs the ranked triples of background knowledge, and evaluate the ranking performance by computing the scores of MAP and P@N. (2) We also conduct a task-based evaluation, where document classification (see Section 4) is chosen as a demonstration, by enriching the background knowledge to the original framework as additional features and performing a direct comparison."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-134",
"text": "**EVALUATION AS A RANKING PROBLEM**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-135",
"text": "Data preparation The data is composed of two parts: source documents and background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-136",
"text": "For source documents, we use a publicly available Chinese corpus which consists of 17,199 documents and 13,719,428 tokens extracted from Internet news 2 including 9 topics: Finance, IT, Health, Sports, Travel, Education, Jobs, Art, Military."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-137",
"text": "We then randomly but equally select 600 articles as the set of source documents from 9 topics without data bias."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-138",
"text": "We use all the other 16,599 documents of the same corpus as the source of background knowledge, and then introduce a wellknown Chinese open source tool (Che et al., 2010) to extract the triples of background knowledge from the raw text automatically."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-139",
"text": "So the background knowledge also distributes evenly across the same 9 topics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-140",
"text": "We use the same tool to extract the triples of source documents too."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-141",
"text": "Baseline systems As Zhang et al. (2014) argued, it is difficult to use the methods in traditional ranking tasks, such as information retrieval (Manning et al., 2008) and entity linking (Han et al., 2011; Sen, 2012) , as baselines in this task, because our model takes triples as basic input and thus lacks some crucial information such as link structure."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-142",
"text": "For better comparison, we implement three methods as baselines, which have been proved effective in relevance evaluation: (1) Vector Space Model (VSM), (2) Word Embedding (WE), and (3) Latent Dirichlet Allocation (LDA)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-143",
"text": "Note that our model captures the distributional semantics of triples with LDA, while WE serves as a baseline only, where the word embeddings are acquired over the same corpus mentioned previously with 2 http://www.sogou.com/labs/dl/c.html the publicly available tool word2vec 3 ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-144",
"text": "Here we use t i , D, and w i to denote a triple of background knowledge, a source document, and the relevance of t i to D. For VSM, we represent both t i and D with a tf-idf scheme first (Salton and McGill, 1986) and compute w i as their cosinesimilarity."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-145",
"text": "For WE, we first convert both t i and the triples extracted from D into real-valued vectors with WE and then compute w i by accumulating all the cosine-similarities between t i and every triple from D. For LDA, we represent t i as a vector with our model introduced in Section 3.1 and get the vector of D directly with LDA."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-146",
"text": "Then we evaluate their relevance of t i to D by computing the cosinesimilarity of two corresponding vectors."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-147",
"text": "Moreover, to determine whether our modified iterative propagation is helpful or not, we also compare our full model (Ours) against a simplified version without iterative propagation (Ours-S)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-148",
"text": "In Ours-S, we represent both t i and the triples extracted from D as real-valued vectors with our model introduced in Section 3.1."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-149",
"text": "Then we compute w i by accumulating all the cosine-similarities between t i and the triples extracted from D. For all the baselines, we rank the triples of background knowledge according to w i , their relevance to D."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-150",
"text": "Experimental setup Previous research relies on manual annotation to evaluate the ranking performance (Zhang et al., 2014) , which costs a lot, and in which it is difficult to get high consistency."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-151",
"text": "In this paper, we carry out an automatic evaluation."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-152",
"text": "The corpus we used consists of 9 different classes, from which we extract triples of background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-153",
"text": "So correspondingly, there will be 9 sets of triples too."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-154",
"text": "Then we randomly select 200 triples from every class and mix 200 \u00d7 9 = 1800 triples together as S, the set of triples of background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-155",
"text": "For every document D to be enriched, our model selects the top N mostrelevant triples from S and returns them to D as enrichments."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-156",
"text": "We treat a triple t i selected by our model as positive only if t i is extracted from the same class as D. We evaluate the performance of our model with two well-known criteria in ranking problem: MAP and P@N (Voorhees et al., 2005) ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-157",
"text": "Statistically significant differences of performance are determined using the two-tailed paired t-test computed at a 95% confidence level based on the average performance per source document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-158",
"text": "The performance evaluated as a ranking task."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-159",
"text": "Here Ours corresponds to our full model, while Ours-S is a simplified version of our model without iterative propagation (see Section 3.2)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-160",
"text": "Results The performance of multiple models is shown in Table 1 ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-161",
"text": "Overall, our full model Ours outperforms all the baseline systems significantly in every metric."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-162",
"text": "When evaluating the top 10 triples with the highest relevance weight, our framework outperforms the best baseline LDA by 4.4% in MAP and by 3.91% in P@N. When evaluating the top 5 triples, our framework performs even better and significantly outperforms the best baseline by 5.87% in MAP and by 17.21% in P@N."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-163",
"text": "To analyze the results further, Ours-S, the simplified version of our model without iterative propagation, outperforms two strong baselines VSM and WE, which indicates the effectiveness of encoding distributional semantics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-164",
"text": "However, the performance of this simplified model is not as good as that of LDA, because Ours-S evaluates the relevance with simple accumulation, which fails to capture the relatedness between multiple triples from the source document."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-165",
"text": "We tackle this problem by incorporating the modified iterative propagation over the entire triple graph into Ours, which achieves the best performance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-166",
"text": "One possible problem is why WE has a poor performance, the reason of which lies in the setup of our evaluation, where we label positive and negative instances according to the class information of triples and documents."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-167",
"text": "This is better fit for topic model-based methods."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-168",
"text": "Discussion We further analyze the impact of the three modifications we made to the original model (see Section 3.2)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-169",
"text": "We first focus on the impact of decreasing the relevance weight of bk-nodes and increasing that of sd-nodes after every iteration."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-170",
"text": "As mentioned previously, we change their relevance weight according to a fixed ratio, which is important to the performance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-171",
"text": "Figure 4 shows the performance of models with different ratios."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-172",
"text": "With any increase of the ratio, our model improves its performance in every metric, which shows the effectiveness of this setup."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-173",
"text": "The performance remains stable from the value of 10:1, which is thus chosen as the final value in our experiments."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-174",
"text": "We then turn to the other two modifications about the edges between bk-nodes and the setup of propagation probability."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-175",
"text": "Table 2 shows the performance of our full model and the simplified models without these two modifications."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-176",
"text": "With the edges between bk-nodes, our model improves the performance by 1.48% in MAP 5 and by 1.82% in P@5."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-177",
"text": "With the modified iterative propagation, we achieve a even greater improvement of 13.99% in MAP 5 and 24.27% in P@5."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-178",
"text": "All these improvements are statistically significant, which indicates the effectiveness of these modifications to the original model."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-179",
"text": "Table 2 : The performance of our full model (Full) and two simplified models without modifications:"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-180",
"text": "(1) without edges between bk-nodes (Full\u2212bb), (2) without the newly proposed definition of propagation probability between nodes (Full\u2212p)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-182",
"text": "**TASK-BASED EVALUATION**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-183",
"text": "Data preparation To carry out the task-based evaluation, we use the same Chinese corpus as that in previous experiments, which consists of 17,199 documents extracted from Internet news in 9 topics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-184",
"text": "We also use the same tool (Che et al., 2010) We also evaluate the impact of knowledge quality by proposing two different models to introduce background knowledge: our full model introduced in Section 3 (Ours), and a simplified version of our model without iterative propagation (Ours-S)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-185",
"text": "They have different performances on introducing background knowledge as shown in previous experiments (see Section 5.1)."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-186",
"text": "We then conduct a direct comparison between the document classification models with these conditions, whose differing performances demonstrates the impact of different qualities of background knowledge on this task."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-187",
"text": "Results Table 3 shows the results."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-188",
"text": "We use P, R, F to evaluate the performance, which are computed as the micro-average over 9 topics."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-189",
"text": "Both models with background knowledge (LDA+SVM+Ours-S, LDA+SVM+Ours) outperform systems without knowledge, which shows that the introduction of background knowledge helps in better classifica- The performance of document classification models with different thresholds."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-190",
"text": "The knowledge whose relevance weight to the source document exceeds the threshold will be introduced as background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-191",
"text": "tion of documents."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-192",
"text": "The system with the simplified version of our model without iterative propagation (LDA+SVM+Ours-S) achieves a F-value of 0.8501, which outperforms the other baselines without knowledge too."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-193",
"text": "Moreover, the system with our full model (LDA+SVM+Ours) achieves the best performance, a F-value of 0.8691, and outperforms the best baseline LDA+SVM significantly."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-194",
"text": "This shows that introducing better quality of background knowledge is helpful to the better classification of documents."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-195",
"text": "Statistical significance is also verified using the two-tailed paired t-test computed at a 95% confidence level based on the results of classification over the test set."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-196",
"text": "Discussion One important question here is how much background knowledge to include."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-197",
"text": "As mentioned in Section 4, we have tried two different solutions: (1) introducing a fixed amount of background knowledge for every document, and (2) setting a threshold and selecting knowledge whose relevance weight exceeds the threshold."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-198",
"text": "The results are shown in Table 4 , where the systems with threshold outperform that with fixed amount, which shows that the threshold helps in better introduction of background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-199",
"text": "Table 4 : The performance of document classification with the full model (Ours) and the simplified model (Ours-S) to introduce knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-200",
"text": "We also evaluate the impact of different thresholds as shown in Figure 5 ."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-201",
"text": "The performance keeps improving as the threshold increases up to 6.4 and becomes steady from 6.4 to 6.7, while it begins to decline sharply from 6.7."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-202",
"text": "This is reasonable because at the beginning, as the threshold increases, we recall more background knowledge and provide more information."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-203",
"text": "However, with the further increase of the threshold, we introduce more noise, which decreases the performance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-204",
"text": "In our experiments, we choose 6.4 as the final threshold."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-206",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-207",
"text": "This study encodes distributional semantics into the triple-based background knowledge ranking model (Zhang et al., 2014) for better document enrichment."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-208",
"text": "We first use LDA to represent every triple as a real-valued vector, which is used to evaluate the relatedness between triples, and then propose a modified iterative propagation model to rank all the triples of background knowledge."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-209",
"text": "For evaluation, we conduct two series of experiments: (1) evaluation as ranking problem, and (2) taskbased evaluation, especially for document classification."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-210",
"text": "In the first set of experiments, our model outperforms multiple strong baselines based on VSM, LDA, and WE."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-211",
"text": "In the second set of experiments, our full model with background knowledge outperforms the state-of-the-art systems significantly."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-212",
"text": "Moreover, we also explore the impact of knowledge quality and show its importance."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-213",
"text": "In our future work, we wish to explore a better way to encode distributional semantics by proposing a modified LDA for better triples representation."
},
{
"sent_id": "6293d300ab46a6d6135ed256005403-C001-214",
"text": "In addition, we also want to explore the effect of introducing background knowledge in conjunction with other NLP tasks, especially with discourse parsing (Marcu, 2000; Pitler et al., 2009 )."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"6293d300ab46a6d6135ed256005403-C001-27",
"6293d300ab46a6d6135ed256005403-C001-28"
],
[
"6293d300ab46a6d6135ed256005403-C001-29",
"6293d300ab46a6d6135ed256005403-C001-30",
"6293d300ab46a6d6135ed256005403-C001-31",
"6293d300ab46a6d6135ed256005403-C001-32",
"6293d300ab46a6d6135ed256005403-C001-33"
],
[
"6293d300ab46a6d6135ed256005403-C001-42",
"6293d300ab46a6d6135ed256005403-C001-43"
],
[
"6293d300ab46a6d6135ed256005403-C001-45",
"6293d300ab46a6d6135ed256005403-C001-46",
"6293d300ab46a6d6135ed256005403-C001-47",
"6293d300ab46a6d6135ed256005403-C001-48",
"6293d300ab46a6d6135ed256005403-C001-49",
"6293d300ab46a6d6135ed256005403-C001-50",
"6293d300ab46a6d6135ed256005403-C001-51"
],
[
"6293d300ab46a6d6135ed256005403-C001-52",
"6293d300ab46a6d6135ed256005403-C001-53",
"6293d300ab46a6d6135ed256005403-C001-54",
"6293d300ab46a6d6135ed256005403-C001-55",
"6293d300ab46a6d6135ed256005403-C001-56"
],
[
"6293d300ab46a6d6135ed256005403-C001-58",
"6293d300ab46a6d6135ed256005403-C001-59"
],
[
"6293d300ab46a6d6135ed256005403-C001-61",
"6293d300ab46a6d6135ed256005403-C001-62"
],
[
"6293d300ab46a6d6135ed256005403-C001-96",
"6293d300ab46a6d6135ed256005403-C001-97"
],
[
"6293d300ab46a6d6135ed256005403-C001-100",
"6293d300ab46a6d6135ed256005403-C001-98",
"6293d300ab46a6d6135ed256005403-C001-99"
],
[
"6293d300ab46a6d6135ed256005403-C001-150",
"6293d300ab46a6d6135ed256005403-C001-151"
],
[
"6293d300ab46a6d6135ed256005403-C001-207",
"6293d300ab46a6d6135ed256005403-C001-208",
"6293d300ab46a6d6135ed256005403-C001-209",
"6293d300ab46a6d6135ed256005403-C001-210",
"6293d300ab46a6d6135ed256005403-C001-211",
"6293d300ab46a6d6135ed256005403-C001-212"
]
],
"cite_sentences": [
"6293d300ab46a6d6135ed256005403-C001-29",
"6293d300ab46a6d6135ed256005403-C001-42",
"6293d300ab46a6d6135ed256005403-C001-45",
"6293d300ab46a6d6135ed256005403-C001-52",
"6293d300ab46a6d6135ed256005403-C001-58",
"6293d300ab46a6d6135ed256005403-C001-62",
"6293d300ab46a6d6135ed256005403-C001-97",
"6293d300ab46a6d6135ed256005403-C001-98",
"6293d300ab46a6d6135ed256005403-C001-150",
"6293d300ab46a6d6135ed256005403-C001-207"
]
},
"@MOT@": {
"gold_contexts": [
[
"6293d300ab46a6d6135ed256005403-C001-29",
"6293d300ab46a6d6135ed256005403-C001-30",
"6293d300ab46a6d6135ed256005403-C001-31",
"6293d300ab46a6d6135ed256005403-C001-32",
"6293d300ab46a6d6135ed256005403-C001-33"
],
[
"6293d300ab46a6d6135ed256005403-C001-100",
"6293d300ab46a6d6135ed256005403-C001-98",
"6293d300ab46a6d6135ed256005403-C001-99"
],
[
"6293d300ab46a6d6135ed256005403-C001-106",
"6293d300ab46a6d6135ed256005403-C001-107",
"6293d300ab46a6d6135ed256005403-C001-109"
],
[
"6293d300ab46a6d6135ed256005403-C001-141",
"6293d300ab46a6d6135ed256005403-C001-142",
"6293d300ab46a6d6135ed256005403-C001-143"
],
[
"6293d300ab46a6d6135ed256005403-C001-207",
"6293d300ab46a6d6135ed256005403-C001-208",
"6293d300ab46a6d6135ed256005403-C001-209",
"6293d300ab46a6d6135ed256005403-C001-210",
"6293d300ab46a6d6135ed256005403-C001-211",
"6293d300ab46a6d6135ed256005403-C001-212"
]
],
"cite_sentences": [
"6293d300ab46a6d6135ed256005403-C001-29",
"6293d300ab46a6d6135ed256005403-C001-98",
"6293d300ab46a6d6135ed256005403-C001-141",
"6293d300ab46a6d6135ed256005403-C001-207"
]
},
"@DIF@": {
"gold_contexts": [
[
"6293d300ab46a6d6135ed256005403-C001-29",
"6293d300ab46a6d6135ed256005403-C001-30",
"6293d300ab46a6d6135ed256005403-C001-31",
"6293d300ab46a6d6135ed256005403-C001-32",
"6293d300ab46a6d6135ed256005403-C001-33"
],
[
"6293d300ab46a6d6135ed256005403-C001-45",
"6293d300ab46a6d6135ed256005403-C001-46",
"6293d300ab46a6d6135ed256005403-C001-47",
"6293d300ab46a6d6135ed256005403-C001-48",
"6293d300ab46a6d6135ed256005403-C001-49",
"6293d300ab46a6d6135ed256005403-C001-50",
"6293d300ab46a6d6135ed256005403-C001-51"
],
[
"6293d300ab46a6d6135ed256005403-C001-100",
"6293d300ab46a6d6135ed256005403-C001-98",
"6293d300ab46a6d6135ed256005403-C001-99"
],
[
"6293d300ab46a6d6135ed256005403-C001-106",
"6293d300ab46a6d6135ed256005403-C001-107",
"6293d300ab46a6d6135ed256005403-C001-109"
],
[
"6293d300ab46a6d6135ed256005403-C001-141",
"6293d300ab46a6d6135ed256005403-C001-142",
"6293d300ab46a6d6135ed256005403-C001-143"
],
[
"6293d300ab46a6d6135ed256005403-C001-150",
"6293d300ab46a6d6135ed256005403-C001-151"
]
],
"cite_sentences": [
"6293d300ab46a6d6135ed256005403-C001-29",
"6293d300ab46a6d6135ed256005403-C001-45",
"6293d300ab46a6d6135ed256005403-C001-98",
"6293d300ab46a6d6135ed256005403-C001-141",
"6293d300ab46a6d6135ed256005403-C001-150"
]
}
}
},
"ABC_9684063f991f9a4688d6530fe5a16c_6": {
"x": [
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-194",
"text": "In contrast, lower performance was achieved without the difference."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-220",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-221",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-2",
"text": "Considering the importance of public speaking skills, a system that can predict where audiences might laugh during a talk can be helpful to a person preparing for a presentation."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-3",
"text": "We investigated the possibility that a state-of-the-art humor recognition system could be used to detect sentences that induce laughters."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-4",
"text": "In this study, we used TED talks and audience laughters during those talks as data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-5",
"text": "Our results showed that the state-of-the-art system needs to be improved in order to be used in a practical application."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-6",
"text": "In addition, our analysis showed that classifying humorous sentences in talks is very challenging due to the close similarity between humorous and nonhumorous sentences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-9",
"text": "Public speaking is an important skill for delivering knowledge or opinions to public audiences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-10",
"text": "In order to develop a successful talk, it is common to practice presentations, with colleagues acting as simulated audiences who then offer their feedback."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-11",
"text": "A recent focus on the importance of public speaking led various studies (Batrinca et al., 2013; Kurihara et al., 2007; Nguyen et al., 2012) to develop systems for automatically evaluating public speaking skills."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-12",
"text": "These studies used audio and video cues in order to evaluate the overall aspects of public speaking."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-13",
"text": "However, the collection of human evaluation data for such systems is time-consuming and challenging (Chen et al., 2014) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-14",
"text": "If it is shown to be easier to collect audiences' reactions, it may also make sense to explore building an automated system which provides expected audience reactions."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-15",
"text": "For example, speakers sometimes try to add sentences that make audiences laugh or applaud in order to make a successful talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-16",
"text": "As Gruner (1985) said, humor in public speakings will \"produce a more favorable reaction toward a speaker\" and \"enhance speaker image.\" However, there is no guarantee that the expected reactions would occur in an actual talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-17",
"text": "If an automatic system can provide audience reactions which are likely to occur in actual talks, it will be helpful in the process of preparing a talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-18",
"text": "In this study, we investigated the feasibility of current NLP technologies in building a system which provides expected audience reactions to public speaking."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-19",
"text": "Studies on automatic humor recognition (Mihalcea and Strapparava, 2005; Yang et al., 2015; Zhang and Liu, 2014; Purandare and Litman, 2006 ) have defined the recognition task as a binary classification task."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-20",
"text": "So, their classification models categorized a given sentence as a humorous or non-humorous sentence."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-21",
"text": "Among the studies on humor classification, Mihalcea and Strapparava (2005) and Yang et al. (2015) reported high performance on the task."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-22",
"text": "Considering the performance of their systems, it is reasonable to test the applicability of their models to a real application."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-23",
"text": "In this study, we specifically applied a state-of-the-art automatic humor recognition model to talks and investigated if the model could be used to provide simulated laughters."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-24",
"text": "In our application of the state-of-art system to talks, we could not achieve a comparable performance to the reported performance of the system."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-25",
"text": "We investigated the potential reasons for the performance difference through further analysis."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-26",
"text": "Some humor classification studies (Mihalcea and Strapparava, 2005; Yang et al., 2015; Barbieri and Saggion, 2014 ) have used negative instances from different domains or topics, because non-humorous sentences could not be found or are very challenging to collect in target domains or topics."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-27",
"text": "Their studies showed that it was possible to achieve promising performance using data from heterogeneous domains."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-28",
"text": "However, our study showed that humorous sentences which were semantically close to non-humorous sentences were very challenging to distinguish."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-29",
"text": "We first describe previous studies related to our study."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-30",
"text": "Then, the data we used is described."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-31",
"text": "The descriptions of our experiments and results follow."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-32",
"text": "Our first experiment was to apply a state-of-the-art humor classification system to talks."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-33",
"text": "We then conducted additional experiments and analysis in order to see the impact of domain differences on humor classification tasks."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-35",
"text": "**BACKGROUND**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-36",
"text": "Previous studies (Mihalcea and Strapparava, 2005; Yang et al., 2015; Zhang and Liu, 2014; Purandare and Litman, 2006; Bertero and Fung, 2016) dealt with the humor recognition task as a binary classification task, which was to categorize a given text as humorous or non-humorous."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-37",
"text": "These studies collected textual data which consisted of humorous texts and non-humorous texts and built a classification model using textual features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-38",
"text": "Humorous and non-humorous texts were from different domains across the studies."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-39",
"text": "Pun websites, daily joke websites, or tweets were used as sources of humorous texts."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-40",
"text": "Resources such as news websites, proverb websites, etc. were used as sources of non-humorous texts."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-94",
"text": "The numbers of humorous and nonhumorous sentences were 5,801 (3%) and 168,974 (97%), respectively."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-41",
"text": "Yang et al. (2015) tried to minimize genre differences between humorous and non-humorous texts in order to avoid a chance that a trained model was optimized to distinguish genre differences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-42",
"text": "Barbieri and Saggion (2014) examined cross-domain application of humor detection systems using Twitter data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-43",
"text": "For example, they trained a model using tweets with '#humor' and '#education' hashtags and evaluated the performance of the model on evaluation data containing tweets with '#humor' and '#politics' hashtags."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-44",
"text": "They also reported promising performance in the cross-domain application."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-45",
"text": "These studies which used data from different domains or topics reported very high performance -around 80% accuracy."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-46",
"text": "Distinct from the other studies, Purandare and Litman (2006) used data from a single domain, the famous TV series, Friends."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-47",
"text": "In their study, the target task was to categorize a speaker's turn as humorous or non-humorous."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-48",
"text": "Speakers' turns which occurred right before simulated laughters were defined as humorous ones and the other turns as non-humorous ones."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-49",
"text": "Another difference from other studies was that their study used speakers' acoustic characteristics as features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-50",
"text": "Their study reported low performance of around 0.600 accuracy for the classification task."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-51",
"text": "Bertero and Fung (2016) pursued similar hypothesis to Purandare and Litman (2006) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-52",
"text": "In their study, the target task was to categorize an utterance in a sitcom, The Big Bang Theory, into those followed by laughters or not."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-53",
"text": "Their study was the first study where a deep learning algorithm was used for humor classification."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-54",
"text": "In this study, our target task was to categorize sentences in talk data into humorous and non-humorous sentences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-55",
"text": "We only examined textual features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-56",
"text": "Compared to previous studies, one innovation of this study was that a trained model was evaluated using humorous and non-humorous sentences from the same genre and same topic."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-57",
"text": "Mihalcea and Strapparava (2005) and Yang et al. (2015) borrowed negative instances from different genres such as news websites or proverbs."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-58",
"text": "Barbieri and Saggion (2014) borrowed negative instances from different topics among tweets, though both their positive and negative instances came from the same genre, tweets."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-59",
"text": "Our talk data, on the other hand, was distinct from Barbieri and Saggion (2014) in that negative instances were selected from the same talks as positive instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-60",
"text": "As a result, negative instances were inherently from the same topic as corresponding positive instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-61",
"text": "In addition, we used real audience reactions (audience laughters) in building our data set."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-62",
"text": "So, the task of this study was to categorize sentences into sentences which made audiences laugh or not, in a talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-64",
"text": "**DATA AND FEATURES**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-66",
"text": "**PUN OF DAY DATA**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-67",
"text": "Yang et al. (2015) collected a corpus of Pun of Day data 1 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-68",
"text": "The data consisted of 2,423 humorous (positive) texts and 2,403 non-humorous (negative) texts."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-69",
"text": "The humorous texts were from the Pun of the Day website, and the negative texts from AP News2, New York Times, Yahoo! Answers and Proverb websites."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-70",
"text": "Examples of humorous and non-humorous sentences are given below."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-71",
"text": "Humorous The one who invented the door knocker got a No-bell prize."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-72",
"text": "Non-Humorous The one who discovered/invented it had the last name of fahrenheit."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-73",
"text": "In order to reduce the differences between positive and negative instances in the data, Yang et al. (2015) used two constraints when collecting negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-74",
"text": "Non-humorous texts were required to have lengths between the minimum and maximum lengths of positive instances, in order to be selected as negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-75",
"text": "In addition, only non-humorous texts which consisted of words found in positive instances were collected."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-77",
"text": "**TED TALK DATA**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-78",
"text": "TED Talks 2 are recordings from TED conferences, and other special TED programs."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-79",
"text": "Corresponding transcripts of most TED Talks are available online."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-80",
"text": "We used the transcripts of the talks as data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-81",
"text": "Most transcripts of the talks contain the markup '(Laughter)', which represents where audiences laughed aloud during the talks."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-82",
"text": "In addition, time stamps are available in the transcripts."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-83",
"text": "An example transcription is given below 3 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-84",
"text": "1:14 ...My mother said that she thought I'd really rather have a blue balloon. But I said that I definitely wanted the pink one. And she reminded me that my favorite color was blue."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-85",
"text": "The fact that my favorite color now is blue, but I'm still gay --(Laughter) --is evidence of both my mother's influence and its limits."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-86",
"text": "1:57 (Laughter) 2:06 When I was little, my mother used to say, ... After collecting TED Talk transcripts 4 , we manually cleaned up the data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-87",
"text": "First, we removed transcripts of talks which contained performance like dance or music (e.g. http://www.ted.com/ talks/a_choir_as_big_as_the_internet)."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-88",
"text": "Then, transcripts without '(Laughter)' markups were removed."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-89",
"text": "Other transcripts which we excluded were talks in languages other than English."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-90",
"text": "After the cleaning, the final remaining data set contained 1,192 transcripts."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-91",
"text": "Following the manual cleaning, we split the transcripts into sentences using the Stanford CoreNLP tool (Manning et al., 2014) , then categorized the sentences into humorous and non-humorous sentences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-92",
"text": "Humorous sentences were sentences which contained or were immediately followed by '(Laughter)'."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-93",
"text": "The other sentences were categorized as non-humorous sentences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-95",
"text": "When giving a talk, a speaker can induce laughters using means other than language, such as silly gestures."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-96",
"text": "For example, audiences laughed after the sentence 'But, check this out.' in a TED Talk video because the speaker showed a funny picture."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-97",
"text": "We tried to include only humorous sentences where the language alone induced laughters, because we only used textual features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-98",
"text": "In selecting humorous sentences, we used a simple heuristic."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-99",
"text": "When laughters occurred after a very short sentence which consisted of fewer than seven words, it was likely that the laughters were due to something other than the sentence itself."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-100",
"text": "'Pun of the Day' data can provide indirect support for our threshold because the humorous content of 'Pun of the Day' data is solely textual."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-101",
"text": "The average length of 'Pun of the Day' data was 14 words, with a standard deviation of 5."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-102",
"text": "The number of humorous sentences left after removing sentences with fewer than seven words was 4,726."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-103",
"text": "Utilizing the same experimental setup as Mihalcea and Strapparava (2005) and Yang et al. (2015) (50% positive and 50% negative instances), we selected 4,726 sentences from among all collected nonhumorous sentences as negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-104",
"text": "During selection, we minimized differences between positive and negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-105",
"text": "A negative instance was selected from among sentences located close to a positive instance in a talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-106",
"text": "We made a candidate set of non-humorous sentences using sentences within a window size of seven (e.g. from sent-7 to sent-1 and from sent+1 to sent+7 in the following): sent-7 . . . . . ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-107",
"text": "sent-1 And she reminded me that my favorite color was blue."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-108",
"text": "Humorous The fact that my favorite color now is blue, but I'm still gay is evidence of both my mother's influence and its limits."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-109",
"text": "sent+1 When I was little, my mother used to say, ... . . ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-110",
"text": "sent+7 . . ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-111",
"text": "Among the candidates, sentences which consisted of less than seven words were removed and a negative instance was randomly selected among the remaining ones."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-113",
"text": "**IMPLEMENTATION OF FEATURES**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-114",
"text": "Features from Yang et al. (2015) , which we implemented, consisted of (1) two incongruity features, (2) six ambiguity features, (3) four interpersonal effect features, (4) four phonetic features, (5) five k-Nearest Neighbor features, and (6) 300 Word2Vec features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-115",
"text": "The total number of features used in this study was 321."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-116",
"text": "We describe our implementation of the features in this section."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-117",
"text": "The justifications for the features can be found in the original paper."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-118",
"text": "Incongruity Features: the existence of incongruous or incompatible words in a text can cause laughters (e.g. A clean desk is a sign of a cluttered desk drawer."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-119",
"text": "(Mihalcea and Strapparava, 2005) )."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-120",
"text": "We calculated meaning distances of all word pairs in a sentence using a Word2Vec implementation in Python 5 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-121",
"text": "The maximum and minimum meaning distances among the calculated distances in a sentence were used as two incongruity features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-122",
"text": "Ambiguity Features: the use of ambiguous words in a sentence can also trigger humorous effects (i.e. A political prisoner is one who stands behind her convictions."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-123",
"text": "(Miller and Gurevych, 2015) )."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-124",
"text": "We calculated sense combinations of nouns, verbs, adjectives and adverbs."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-125",
"text": "We made four groups, composed of the nouns, verbs, adjectives and adverbs in a sentence, respectively."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-126",
"text": "Then, we collected counts of possible meanings of each word in each group from WordNet (Fellbaum, 1998) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-127",
"text": "For example, when two nouns in a sentence have two and three different meanings in WordNet, the sense combination of the noun group was 1.792 (log(2 \u00d7 3))."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-128",
"text": "We also calculated the largest and smallest WordNet Path Similarity values of pairs of words in a sentence using a Python interface for WordNet 6 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-129",
"text": "Interpersonal Effect Features: sentences can be humorous when sentences contain strong sentiment or subjectivity words (Zhang and Liu, 2014) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-130",
"text": "In TED Talk data, some positive instances also contain strong sentiment words (i.e. Then, just staying above the Earth for one more second, people are acting like idiots all across the country.) We extracted the number of occurrences of all negative (positive) polarity words and the number of weak (strong) subjectivity words using the word association resource from Wilson et al. (2005) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-131",
"text": "Phonetic Style: phonetic properties such as alliteration and rhyme can make people laugh (i.e. Infants don't enjoy infancy like adults do adultery."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-132",
"text": "(Mihalcea and Strapparava, 2005) Pronuncing Dictionary, we extracted the number of alliteration chains in a sentence, the maximum length of alliteration chains, the number of rhyme chains, and the maximum length of rhyme chains."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-133",
"text": "k-Nearest Neighbors Features: We used unigram feature vectors with a k-nearest neighbor algorithm in calculating these features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-134",
"text": "When a sentence is given, we retrieved labels of the five nearest neighbors in a k-nearest neighbor model using euclidean distance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-135",
"text": "The five labels were used as features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-136",
"text": "Word2Vec Features: we collected Word2Vec embeddings of words in a sentence, then used the average of the embeddings as a representation of the sentence."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-137",
"text": "We used the Google Word2Vec model 7 and the Gensim Python package (\u0158eh\u016f\u0159ek and Sojka, 2010)."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-139",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-141",
"text": "**APPLICATION OF STATE-OF-ART TECHNOLOGY TO TALK DATA**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-142",
"text": "In this section, we present expeirments that we ran to determine 1) how effective a model trained using 'Pun of Day' data (Pun) is when applied to TED Talk data (Talk), and 2) whether the performance of a model trained using Talk data would be similar to the performance reported in Yang et al. (2015) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-143",
"text": "We reimplemented features developed by Yang et al. (2015) and evaluated those features on Talk data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-144",
"text": "Considering the different characteristics of Talk data versus Pun data, we sought to investigate whether Yang's model could achieve the reported performance (over 85% accuracy) on our Talk data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-145",
"text": "The differences were 1) humorous sentences in Talk data were sentences which induced audience laughters, compared to Pun data which used canned textual humor, 2) all non-humorous sentences in Talk data were also from TED talks, and 3) each pair of humorous and non-humorous sentences were semantically close because they were closely placed."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-146",
"text": "These differences made the humor classification task more challenging."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-147",
"text": "We first validated the performance of the reimplemented features."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-148",
"text": "We followed the experimental setup of Yang et al. (2015) in order to see if the performance of our duplicated features was comparable to their reported performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-149",
"text": "Their best performance was 85.4% accuracy (Yang in Table 1 ) when they used Random Forest as a classifier and 10-fold cross validation (CV) as an evaluation method."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-150",
"text": "Replicating this experiment setup, we were able to achieve 86.0% accuracy (Pun-to-Pun in Table 1 ), which is slightly better than the performance reported in their paper."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-151",
"text": "The performance difference could be due to the difference in partitions in CV."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-152",
"text": "After verifying the feature implementation, we built a humor recognition model using the entirety of the Pun data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-153",
"text": "The model was evaluated on Talk data in order to see how effective a state-of-art model was in spite of differences between the two data sets."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-154",
"text": "The accuracy was only 50.5% (Pun-to-Talk in Table 1 ) which is 0.5% higher than a majority class classifier."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-155",
"text": "The poor performance observed in this second experiment could be due to the differences between Pun and Talk data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-156",
"text": "Based on these experimental results, it can be said that a humor classification model trained using Pun data can't be directly used in categorizing humor sentences from talks."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-157",
"text": "The third experiment was designed to observe the performance of a model (Talk-to-Talk) built using Talk data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-158",
"text": "The Talk-to-Talk model was evaluated on Talk data using 10-fold CV."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-159",
"text": "When we split Talk data into train and test data in a CV fold, sources of sentences were used as a criteria in the split."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-160",
"text": "All humorous and non-humorous sentences from one talk only belonged to a train data or a test data, not Table 2 : The Performance using combined data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-161",
"text": "'Pos' and 'Neg' mean 'Positives' and 'Negatives'."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-162",
"text": "both."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-163",
"text": "This criterion was adopted because sentences from a talk could share contexts and the shared contexts could boost performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-164",
"text": "Using the model, we got 53.2% accuracy (Talk-to-Talk in Table 1 )."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-165",
"text": "Thus, we observed a 3% increase in accuracy and 10% increase in F1 score, when compared with the Pun-to-Talk model."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-166",
"text": "But, the performance was still poorer than Yang's reported performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-167",
"text": "The model trained on Talk data showed a preference for categorizing instances in evaluation data into humorous instances, according to the precision and recall values of Talk-to-Talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-169",
"text": "**CROSS DOMAIN DATA COMBINATIONS**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-170",
"text": "In the experiments described in the preceding section, we weren't able to get results comparable to Yang et al. (2015) when Talk data was used in both train and evaluation data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-171",
"text": "The results of our experiments raised questions about why two different results were observed for two different data sets."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-172",
"text": "A major difference in the two data sets was the source of negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-173",
"text": "Yang et al. (2015) borrowed negative instances from different genres such as news websites and proverbs. But, in Talk-to-Talk, both positive and negative instances were from the same genre."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-174",
"text": "Furthermore, each humorous instance had a corresponding non-humorous instance from the same talk."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-175",
"text": "In this section, we investigate the impact of genre differences in the humor classification task, using Pun and Talk data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-176",
"text": "The positive instances (humorous sentences) in the Talk data may be substantially different from the ones found in Pun 8 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-177",
"text": "Humorous sentences in the Pun data set are 'self-contained'."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-178",
"text": "It means that the point of humor can be understood within a single sentence."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-179",
"text": "On the other hand, the humorous sentences in the Talk data set may be 'discourse-based', which means that the source of humor in target sentences might be understood in the wider context of the speaker's performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-180",
"text": "In addition, negative instances of Talk data may also be 'discourse-based', which means that the wider context can be required to understand the sentences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-181",
"text": "However, the negatives in the Pun data are not 'discourse-based'."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-182",
"text": "It is worth investigating whether the 'discourse-based' characteristics of the Talk data made it impossible to achieve high performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-183",
"text": "So, we combined 'discourse-based' instances with 'self-contained' instances and checked if we could achieve high performance using the combined data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-184",
"text": "We built two data sets combining positives of Talk and negatives of Pun ('Talk Pos + Pun Neg'), and positives of Pun and negatives of Talk ('Pun Pos + Talk Neg') in order to make data sets containing positives and negatives from different genres."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-185",
"text": "When we trained and evaluated 'Talk Pos + Pun Neg' and 'Pun Pos + Talk Neg' models using 10-fold CV, we could achieve 82.5% and 83.6% accuracies which were similar to Pun-to-Pun performance as observed in Table 2 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-186",
"text": "In both cases of 'Pun Pos + Talk Neg' and 'Talk Pos + Pun Neg', we didn't observe a significant drop in performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-187",
"text": "We assumed that 'discourse-based' characteristics of Talk data were difficult to learn based on the low performance of 'Talk-to-Talk' in Table 1 . When we looked through humorous instances of Talk data, we observed 'discourse-based' humorous cases which could be difficult to capture using Yang's features (i.e. \"this was the worst month of my life\", \"and I said well that would be great\", and \"so I wanted to follow that rule\")."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-188",
"text": "Of particular interest, we still observed precision and recall as high as 82.7% and 85.8%, respectively."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-189",
"text": "The high performance without a significant drop was counter-intuitive."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-190",
"text": "This observation raised the question of what exactly classifiers learned using the data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-192",
"text": "**DISCUSSION**"
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-193",
"text": "Through our experiments, we observed higher performances when genre difference existed between positive and negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-195",
"text": "Our hypothesis of the cause of the phenomena was semantic distance between positive and negative data points."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-196",
"text": "Negative instances from Talk data were selected from among sentences within seven preceding and following sentences of positive instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-197",
"text": "So, the meaning of a negative instance would be close to the meaning of a corresponding positive instance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-198",
"text": "But, the meaning of Pun positives would be quite different from the meaning of Pun negatives because they were from different genres although words in positives and negatives of Pun were shared."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-199",
"text": "Recently, Li et al. (2016) and Arras et al. (2016) showed that it is possible to understand predictions of NLP models by visualizing word embeddings."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-200",
"text": "Following those studies, we also tried to get a hint at the accuracy of our hypothesis through visualizing the Word2Vec embedding features that we used in our experiments."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-201",
"text": "We used the average of Word2Vec embeddings of words in a sentence as a representation of the sentence."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-202",
"text": "We visualized sentence representations using t-SNE (van der Maaten and Hinton, 2008) ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-203",
"text": "As shown in Figure 1a , meanings of Pun positives and negatives were grouped in distinct areas."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-204",
"text": "Pun positives and negatives were positioned at the right bottom area and left upper area, respectively."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-205",
"text": "The combination of Talk positives and Pun negatives was another case containing clearer meaning distinction between positive and negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-206",
"text": "In the case of the combination of Pun positives and Talk negatives, the distinction was weaker but one can still identify a small group of negatives at the upper left and the somewhat more dispersed group of positives at the bottom right."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-207",
"text": "However, Talk positives and negatives were completely mixed throughout."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-208",
"text": "So, it was impossible to make distinctions on groups of positives and negatives."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-209",
"text": "This analysis provided clues to the high performances of 'Pun-to-Pun' in Table 1 , and 'Talk Pos + Pun Neg' and 'Pun Pos + Talk Neg' in Table 2 , as well as the low performance of 'Talk-to-Talk' in Table 1 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-210",
"text": "The high-performance data were much more learnable than 'Talk-to-Talk', based on the above observations about the discreteness of each data set's tokens."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-211",
"text": "Another analysis we conducted was the impact of the closeness of negatives in Talk data."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-212",
"text": "We selected a negative instance within seven preceding and following sentences of a positive instance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-213",
"text": "Positive in-stances of Talk data could be punchlines which brougt up audiences' laughters after laughable mood was built up through preceding sentences."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-214",
"text": "In other words, preceding sentences could be also humorous but not humorous enough to cause laughters."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-215",
"text": "When slightly humorous sentences are included in negative instances, the poor performance of 'Talk-to-Talk' is reasonable because it is very challenging to distinguish humorous sentences from less humorous sentences, even for humans."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-216",
"text": "So, we conducted another experiment after randomly choosing a negative instance among all sentences, which didn't cause laughters, within a talk of a positive instance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-217",
"text": "Then, we trained and evaluated models using 10-fold CV."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-218",
"text": "In this experiment, we could get 55.4% accuracy which was only 2% higher than 'Talk-to-Talk' in Table 1 ."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-219",
"text": "This further analysis is a supporting evidence that humor detection in a talk is a challenging task irrespective of the distance in text between positive and negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-222",
"text": "In this study, we investigated whether a state-of-the-art humor recognition model could be used in simulating audience laughters in talks."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-223",
"text": "Our results showed that lots of improvements in the humor recognition task would be needed in order to be used in real applications."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-224",
"text": "In addition, we showed through the visualization of the features that Talk data is much more difficult for a machine to learn due to the featural closeness of positive and negative instances."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-225",
"text": "We have a plan to develop features on the discouse level, in order to improve the performance."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-226",
"text": "Humorous sentences in TED talks are parts of talks."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-227",
"text": "Preceding sentences before humorous sentences construct contexts."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-228",
"text": "The combination of contents of humorous sentences and established contexts can lead to laughter."
},
{
"sent_id": "9684063f991f9a4688d6530fe5a16c-C001-229",
"text": "We will investigate this conceptual possibility in future work."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"9684063f991f9a4688d6530fe5a16c-C001-19",
"9684063f991f9a4688d6530fe5a16c-C001-20"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-21",
"9684063f991f9a4688d6530fe5a16c-C001-22",
"9684063f991f9a4688d6530fe5a16c-C001-23"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-24",
"9684063f991f9a4688d6530fe5a16c-C001-25",
"9684063f991f9a4688d6530fe5a16c-C001-26",
"9684063f991f9a4688d6530fe5a16c-C001-27",
"9684063f991f9a4688d6530fe5a16c-C001-28"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-36",
"9684063f991f9a4688d6530fe5a16c-C001-37",
"9684063f991f9a4688d6530fe5a16c-C001-38",
"9684063f991f9a4688d6530fe5a16c-C001-39",
"9684063f991f9a4688d6530fe5a16c-C001-40"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-41"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-67",
"9684063f991f9a4688d6530fe5a16c-C001-68",
"9684063f991f9a4688d6530fe5a16c-C001-69",
"9684063f991f9a4688d6530fe5a16c-C001-70",
"9684063f991f9a4688d6530fe5a16c-C001-71",
"9684063f991f9a4688d6530fe5a16c-C001-72"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-73",
"9684063f991f9a4688d6530fe5a16c-C001-74",
"9684063f991f9a4688d6530fe5a16c-C001-75"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-142"
]
],
"cite_sentences": [
"9684063f991f9a4688d6530fe5a16c-C001-19",
"9684063f991f9a4688d6530fe5a16c-C001-21",
"9684063f991f9a4688d6530fe5a16c-C001-26",
"9684063f991f9a4688d6530fe5a16c-C001-36",
"9684063f991f9a4688d6530fe5a16c-C001-41",
"9684063f991f9a4688d6530fe5a16c-C001-67",
"9684063f991f9a4688d6530fe5a16c-C001-73",
"9684063f991f9a4688d6530fe5a16c-C001-142"
]
},
"@MOT@": {
"gold_contexts": [
[
"9684063f991f9a4688d6530fe5a16c-C001-21",
"9684063f991f9a4688d6530fe5a16c-C001-22",
"9684063f991f9a4688d6530fe5a16c-C001-23"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-24",
"9684063f991f9a4688d6530fe5a16c-C001-25",
"9684063f991f9a4688d6530fe5a16c-C001-26",
"9684063f991f9a4688d6530fe5a16c-C001-27",
"9684063f991f9a4688d6530fe5a16c-C001-28"
]
],
"cite_sentences": [
"9684063f991f9a4688d6530fe5a16c-C001-21",
"9684063f991f9a4688d6530fe5a16c-C001-26"
]
},
"@DIF@": {
"gold_contexts": [
[
"9684063f991f9a4688d6530fe5a16c-C001-24",
"9684063f991f9a4688d6530fe5a16c-C001-25",
"9684063f991f9a4688d6530fe5a16c-C001-26",
"9684063f991f9a4688d6530fe5a16c-C001-27",
"9684063f991f9a4688d6530fe5a16c-C001-28"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-56",
"9684063f991f9a4688d6530fe5a16c-C001-57"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-143",
"9684063f991f9a4688d6530fe5a16c-C001-144",
"9684063f991f9a4688d6530fe5a16c-C001-145",
"9684063f991f9a4688d6530fe5a16c-C001-146"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-148",
"9684063f991f9a4688d6530fe5a16c-C001-149",
"9684063f991f9a4688d6530fe5a16c-C001-150",
"9684063f991f9a4688d6530fe5a16c-C001-151"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-172",
"9684063f991f9a4688d6530fe5a16c-C001-173",
"9684063f991f9a4688d6530fe5a16c-C001-174",
"9684063f991f9a4688d6530fe5a16c-C001-175"
]
],
"cite_sentences": [
"9684063f991f9a4688d6530fe5a16c-C001-26",
"9684063f991f9a4688d6530fe5a16c-C001-57",
"9684063f991f9a4688d6530fe5a16c-C001-143",
"9684063f991f9a4688d6530fe5a16c-C001-148",
"9684063f991f9a4688d6530fe5a16c-C001-173"
]
},
"@USE@": {
"gold_contexts": [
[
"9684063f991f9a4688d6530fe5a16c-C001-103",
"9684063f991f9a4688d6530fe5a16c-C001-104",
"9684063f991f9a4688d6530fe5a16c-C001-105"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-114"
],
[
"9684063f991f9a4688d6530fe5a16c-C001-148",
"9684063f991f9a4688d6530fe5a16c-C001-149",
"9684063f991f9a4688d6530fe5a16c-C001-150",
"9684063f991f9a4688d6530fe5a16c-C001-151"
]
],
"cite_sentences": [
"9684063f991f9a4688d6530fe5a16c-C001-103",
"9684063f991f9a4688d6530fe5a16c-C001-114",
"9684063f991f9a4688d6530fe5a16c-C001-148"
]
},
"@EXT@": {
"gold_contexts": [
[
"9684063f991f9a4688d6530fe5a16c-C001-143",
"9684063f991f9a4688d6530fe5a16c-C001-144",
"9684063f991f9a4688d6530fe5a16c-C001-145",
"9684063f991f9a4688d6530fe5a16c-C001-146"
]
],
"cite_sentences": [
"9684063f991f9a4688d6530fe5a16c-C001-143"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"9684063f991f9a4688d6530fe5a16c-C001-170",
"9684063f991f9a4688d6530fe5a16c-C001-171"
]
],
"cite_sentences": [
"9684063f991f9a4688d6530fe5a16c-C001-170"
]
}
}
},
"ABC_74cd12a801d1f8a95f8898a8cef9c0_6": {
"x": [
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-2",
"text": "Sentiment Analysis and other semantic tasks are commonly used for social media textual analysis to gauge public opinion and make sense from the noise on social media."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-3",
"text": "The language used on social media not only commonly diverges from the formal language, but is compounded by codemixing between languages, especially in large multilingual societies like India."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-4",
"text": "Traditional methods for learning semantic NLP tasks have long relied on end to end task specific training, requiring expensive data creation process, even more so for deep learning methods."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-5",
"text": "This challenge is even more severe for resource scarce texts like codemixed language pairs, with lack of well learnt representations as model priors, and task specific datasets can be few and small in quantities to efficiently exploit recent deep learning approaches."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-6",
"text": "To address above challenges, we introduce curriculum learning strategies for semantic tasks in codemixed Hindi-English (Hi-En) texts, and investigate various training strategies for enhancing model performance."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-7",
"text": "Our method outperforms the state of the art methods for Hi-En codemixed sentiment analysis by 3.31% accuracy, and also shows better model robustness in terms of convergence, and variance in test performance."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-10",
"text": "Codemixing is the phenomenon of intermixing linguistic units from two or more languages in a single utterance, and is especially widespread in multilingual societies across the world [Muysken et al., 2000] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-11",
"text": "With increasing internet access to such large populations of multilingual speakers, there is active ongoing research on processing codemixed texts on online socialmedia communities such as Twitter and Facebook [Singh et al., 2018a; Prabhu et al., 2016] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-12",
"text": "Not only do these texts contain a diverse variety of language use spanning * This work was presented at 2nd Workshop on Humanizing AI (HAI) at IJCAI'19 in Macao, China."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-13",
"text": "\u2020 Contact Author the formal and colloquial spectra, such texts also pose challenging problems such as out of vocabulary words, slangs, grammatical switching and structural inconsistencies."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-14",
"text": "Previously, various approaches [Prabhu et al., 2016; Jhanwar and Das, 2018; Singh et al., 2018a; Singh et al., 2018b] have focused on task specific datasets and learning architectures for syntactic and semantic processing for codemixed texts."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-15",
"text": "This has facilitated developement of various syntactic and semantic task specific datasets and neural architectures, but has been limited by the expensive efforts towards annotation."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-16",
"text": "As a result, while these efforts have enabled processing of codemixed texts, they still suffer from data scarcity and poor representation learning, and the small individual dataset sizes usually limiting the model performance."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-17",
"text": "Curriculum Learning, as introduced by [Bengio et al., 2009 ] is \"to start small, learn easier aspects of the task or easier subtasks, and then gradually increase the difficulty level\"."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-18",
"text": "They also draw parallels with human learning curriculum and education system, where different concepts are introduced in an order at different times, and has led to advancement in research towards animal training [Krueger and Dayan, 2009] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-19",
"text": "Previous experiments with tasks like language modelling [Bengio et al., 2009] , Dependency Parsing, and entailment [Hashimoto et al., 2016] have shown faster convergence and performance gains by following a curriculum training regimen in the order of increasingly complicated syntactic and semantic tasks."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-20",
"text": "[Weinshall and Cohen, 2018 ] also find theoretical and experimental evidence for curriculum learning by pretraining on another task leading to faster convergence."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-21",
"text": "With this purview, we propose a syntactico-semantic curriculum training strategy for Hi-En codemixed twitter sentiment analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-22",
"text": "We explore various pretraining strategies encompassing Language Identification, Part of Speech Tagging, and Language Modelling in different configurations."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-23",
"text": "We investigate the role of different transfer learning strategies by changing learning rates and gradient freezing to prevent catastrophic forgetting and interference between source and target tasks."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-24",
"text": "We also propose a new model for codemixed sentiment analysis based on character trigram sequences and pooling over time for representation learning."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-25",
"text": "We investigate the convergence rate and model performance across various learning strategies, and find faster model convergence and performance gains on the test set."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-27",
"text": "**ARXIV:1906.07382V1 [CS.CL] 18 JUN 2019**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-28",
"text": "Research on semantic and syntactic processing of codemixed texts has increasingly gained attention, and various approaches have been proposed to this end."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-29",
"text": "[Prabhu et al., 2016] released a dataset comprising user comments on Facebook pages, and proposed a convolutions over character embeddings approach towards sentiment analysis for Hi-En Codemixed texts."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-30",
"text": "[Jhanwar and Das, 2018] propose a character trigram approach coupled with an ensemble of an RNN and a Naive Bayes classifier towards sentiment analysis for codemixed data."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-31",
"text": "More generally for monolingual sentiment analysis, RNNs and other sequential deep learning models have shown to be successful."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-32",
"text": "[Socher et al., 2012] obtained significant performance improvement by incorporating compositional vector representations over single vector representations."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-33",
"text": "[Zheng and Xia, 2018 ] take a different approach by capturing the most important words on either side to perform targeted sentiment analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-34",
"text": "Their LSTM based model uses context2target attention to achieve better benchmark performance on three datasets."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-35",
"text": "[Singh et al., 2018b] developed a dataset for HiEn codemixed Part of Speech tagging, and proposed a CRF based approach."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-36",
"text": "[Singh et al., 2018b] developed a dataset for Hindi English Codemixed Language Identification and NER, and propose a CRF based approach with handcrafted features for Named Entity Recognition."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-37",
"text": "Bengio et."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-38",
"text": "al. [2009] introduced curriculum learning approaches towards both vision and language related task, and show significant convergence and performance gains for language modelling task."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-39",
"text": "[Hashimoto et al., 2016] propose a hierarchical multitask neural architecture with the lower layers performing syntactic tasks, and the higher layers performing the more involved semantic tasks while using the lower layer predictions."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-40",
"text": "[Swayamdipta et al., 2018] also propose a syntactico semantic curriculum with chunking, semantic role labelling and coreference resolution, and show performance gains over strong baselines."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-41",
"text": "Like [Hashimoto et al., 2016] , they hypothesize the incorporation of simpler syntactic information into semantic tasks, and provide empirical evidence for the same."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-42",
"text": "3 Datasets [Prabhu et al., 2016] released a Hi-En codemixed dataset for Sentiment Analysis, comprising 3879 Facebook comments on public pages of Salman Khan and Narendra Modi."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-43",
"text": "Comments are annotated as positive, negative and neutral based on their sentiment polarity, and are distributed across the 3 classes as 15% negative, 50% neutral and 35% positive comments."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-44",
"text": "[ Singh et al., 2018a ] released a twitter corpus for Part of Speech tagging for Hindi English codemixed tweets about 5 incidents, and annotated 1489 tweets with the POS tag for each token."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-45",
"text": "[ Singh et al., 2018b]"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-47",
"text": "**APPROACH**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-48",
"text": "In the following subsections, we introduce our proposed model architecture for processing the above tasks in a hierarchical manner, discuss the various curriculum strategies we experiment with, and finally discuss the transfer learning techniques we explore."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-50",
"text": "**MODEL**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-51",
"text": "We case normalise the texts and mask user mentions and URLs with special characters."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-52",
"text": "After tokenizing the texts, we append a token terminal \"*\" symbol to each token and further split each token into its constituent character trigrams."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-53",
"text": "Thus, a token \"girl\" is split into \"gir\" + \"l*#\", where \"#\" is the padding symbol for character trigrams."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-54",
"text": "Due to the high class imbalance in Sentiment Analysis data, we perform a mixture of oversampling and undersampling between classes than simply prune the samples of the larger classes."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-55",
"text": "As shown in Figure 1 , the model comprises of an Embedding layer followed by two layers of bidirectional LSTMs."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-56",
"text": "The embedding layer serves as a lookup table for our character trigram dense representations, and a sequence of these representations are passed on to the LSTM stack for each input sample."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-57",
"text": "Layer 1 of the LSTM stack takes the sequence of character trigram embeddings as input, and is used to predict the corresponding POS tag and the Language tag at each time step."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-58",
"text": "The concatenation of the left and right directional hidden states at each time step i is passed to a standard softmax classifier to output the probability distribution over the POS tags at that timestep."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-59",
"text": "Similarly, another softmax classifier outputs the language tags at each timestep."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-60",
"text": "The LSTM Layer 2 takes the concatenated bidirectional hidden states of Layer 1 (H (1) i ) as input to learn the sequence representation as an abstraction over the layer 1 representations."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-61",
"text": "This architecture allows the semantic task to consider both the character trigram dense representation as well as \"POS\" and \"Language\" aware bidirectional representations to perform more complex semantic tasks like language modelling and sentiment analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-62",
"text": "For language modelling, the concatenated terminal hidden states from the right and left directional LSTM layer 2 is passed to a standard softmax classifier, which outputs the probability distribution over the character trigram vocabulary for the next trigram in the input sequence."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-63",
"text": "For sentiment analysis, we concatenate the max pooling over time, avg pooling over time, and the terminal hidden states of the Layer 2 BiL-STM to form the representation."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-64",
"text": "The maxpooling and avgpooling over time representations circumvents the information loss in sequence terminal representations."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-65",
"text": "This representation is passed to a standard softmax classifier to predict the sentiment polarity over the 3 classes."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-66",
"text": "For each of the tasks described above, we train our model to optimize the cross entropy loss for the given prediction, formulated as :"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-67",
"text": "where y is the true label, and p is the predicted probability of that label by the model."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-69",
"text": "**CURRICULUM TRAINING**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-70",
"text": "While our proposed model enables efficient transfer learning by progressive abstraction of representations for more complicated tasks, the highlight of the approach lies in the training regimen followed."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-71",
"text": "Curriculum learning can be seen as a sequence of training criteria [Bengio et al., 2009] , with increasing task or sample difficulty as the training progresses."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-72",
"text": "It is also closely related with transfer learning by pretraining, especially in the case when the tasks form a logical hierarchy and contribute to the downstream tasks."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-73",
"text": "With this purview, we propose a linguistic hierarchy of training tasks for codemixed languages, with further layers abstracting over the previous ones to achieve increasingly complicated tasks."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-74",
"text": "Considering the codemixed nature of texts and linguistic hierarchy of information, we propose the tasks in the order of : Language Identification, Part of Speech Tagging, Language Modelling and further semantic tasks like sentiment analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-75",
"text": "Since tokens in codemixed texts have distinct semantic spaces based on their source language, Language Identification can incorporate this disparity among the learnt trigram representations."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-76",
"text": "Following this, the Part of Speech Tagging groups the words based on their logical semantic categories, and encodes simpler word category information in a sequence."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-77",
"text": "Also, as in [Singh et al., 2018a; Sharma et al., 2016] , Language Tag and Part of Speech Tag have previously been provided as manual handcrafted features for a range of downstream syntactic and semantic tasks."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-78",
"text": "In addition to the above tasks, Language Model pretraining has shown significant performance gains as reported by [Howard and Ruder, 2018] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-79",
"text": "It captures various aspects of language such as long range dependencies [Linzen et al., 2016] , word categories, and sentiment [Radford et al., 2017] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-80",
"text": "Conforming with the linguistic hierarchical information, we first train our model to predict the language labels for each character trigram as per its token."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-81",
"text": "This is followed by further training the model to predict the Part of Speech tag for each of its character trigram as per its token."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-82",
"text": "Subsequently, the model is trained on Language Modelling task, in process training the LSTM Layer 2 to build over the LSTM Layer 1 inputs to learn meaningful sequence representation."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-83",
"text": "Lastly, the model is trained to predict the sentiment of the input text based on the LSTM Layer 2 representation."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-85",
"text": "**TRANSFER LEARNING**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-86",
"text": "As noted in earlier efforts [Howard and Ruder, 2018 ] towards finetuning pretrained models for NLP tasks, aggressive finetuning can cause catastrophic forgetting, thus causing the model to simply fit over the target task and forget any capabilities gained during the pretraining stage."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-87",
"text": "On the other hand, too cautious finetuning can cause slow convergence and overfitting."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-88",
"text": "To this end, we experiment with different strategies which can be broadly categorized as:"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-89",
"text": "Discriminative Finetuning : As also noted by [Yosinski et al., 2014] , different layers capture different types of information, and thus need to be optimised differently."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-90",
"text": "In the context of our model, the embedding layer captures the individual character trigram information, the LSTM layer 1 is trained towards capturing the token level information such as Part of Speech and Language Tag, and the final LSTM Layer 2 is trained to capture the overall textual representation to perform Language Modelling and Sentiment Analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-91",
"text": "With this purview, similar to [Howard and Ruder, 2018] , we propose optimizing different layers in our model to different extents, and keep lower step sizes for the deeper pretrained layers while finetuning on a downstream task."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-92",
"text": "We thus split the parameters as {\u03b8 1 , ..., \u03b8 l } , where \u03b8 i corresponds to the parameters of layer i, and optimize them with separate learning rates {\u03b7 1 , ...., \u03b7 l } ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-93",
"text": "Also, when finetuning a pretrained layer for a downstream task, we keep \u03b7 i < \u03b7 j ; \u2200i < j. Thus, while finetuning the POS + Lang Id pretrained model for Language Modeling, we propose to keep the learning rates for Embedding Layer and LSTM Layer 1 lower than the LSTM Layer 2 weights."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-94",
"text": "Similarly, when finetuning the Language Model for Sentiment Analysis, we keep the learning rates of the deeper layers lower than that of the shallower ones."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-95",
"text": "Gradual Unfreezing: Similar to [Howard and Ruder, 2018] , rather than updating all the layers together for finetuning, we explore gradual ordered unfreezing of layers."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-96",
"text": "Thus, initially we freeze all the layers."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-97",
"text": "Then starting from the last layer, we train the model for a certain number of epochs before unfreezing the layer below it."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-98",
"text": "Thus for Sentiment Analysis finetuning, for the first epoch, only \u03b8 sentiment receives the gradient updates, after which we unfreeze the \u03b8 lstm2 , and subsequently unfreeze the lower layers in a similar manner."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-100",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-101",
"text": "The input to the LSTM stack is the sequence of character trigram dense representations, which we keep as 64 dimensional vectors."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-102",
"text": "We also explore other token representations such as sequence of unigrams, convolution over unigrams [Prabhu et al., 2016] , and Byte Pair Encoding (BPE) [Sennrich et al., 2015] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-103",
"text": "BPE is an unsupervised approach towards subword decomposition, and has shown improvements in MT systems and summarization."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-104",
"text": "We train our model from scratch for Sentiment Analysis using the above mentioned character encodings, and report the results in Table 3 ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-105",
"text": "Our LSTM stack consists of two layers of bidirectional LSTMs, with 64 hidden state dimensions."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-106",
"text": "We add a dropout layer with the dropout rate set to 0.2 between the LSTM layers to prevent overfitting."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-107",
"text": "We experiment with average pooling and max pooling concatenation over hidden states for semantic prediction, similar to [Howard and Ruder, 2018] , and observe increase in model accuracy by 2.2% on sentiment analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-108",
"text": "To evaluate our baseline for curriculum training experiments, we initially train the model from scratch on the single target task (Sentiment Analysis) for 25 epochs."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-109",
"text": "We approach the evaluation of our curriculum by training the model sequentially for four subtasks -Language Identification, POS Tagging, Language Modelling and Sentiment Analysis."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-110",
"text": "We evaluate the strategy of pretraining with only POS Tagging and Language Identification, and observe similar performance as no curriculum training."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-111",
"text": "We hypothesize the potential reasons for this drop and find a significant divergence in character trigram occurance between the Source Tasks (POS + Lang Id) and Target Task(Sentiment Analysis)."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-112",
"text": "This experiement highlights the importance of inclusion of language model pretraining for better token level representation learning, and provides a better model prior for sequence representation (LSTM Layer 2 output)."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-113",
"text": "We experiment with only Language Modelling as pretraining task, and observe significant gains over no curriculum strategy."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-114",
"text": "We note the convergence of our model with and without curriculum training, and observe that the curriculum training regimen causes faster convergence, as has been observed in previous works [Bengio et al., 2009; Howard and Ruder, 2018] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-115",
"text": "This is expected as the model is pretrained on prior tasks already have a general purpose representation learning, and only needs to adapt to the idiosyncrasies of the target task, i.e. Sentiment Analysis in this case."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-116",
"text": "As discussed in Section 4.3, for our transfer learning optimization experiments, we segment the optimization of different parameters of our model with different learning rates, in order to limit catastrophic forgetting and interference among the tasks, as proposed by [Howard and Ruder, 2018] ."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-117",
"text": "We segregate our model parameters in the following 4 groups:"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-118",
"text": "\u2022 Emb Layer"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-119",
"text": "\u2022 LSTM Layer 1"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-120",
"text": "\u2022 LSTM Layer 2"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-122",
"text": "**\u2022 SENTIMENT LINEAR MAP**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-123",
"text": "We set the learning rate of the previous layer \u03b7 l\u22121 = \u03b7 l /2.0."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-124",
"text": "For our gradual unfreezing experiments, we unfreeze the lower layer after training the model for 1 epoch with the lower layer unfrozen."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-126",
"text": "**FUTURE WORK**"
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-127",
"text": "In future work, we would like to explore word normalization and miltilingual embeddings in conjunction to learning representations from scratch."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-128",
"text": "Another line of potential study could be investigation into why BPE is able to lead to performance gains as in monolingual domains, but fails in the codemixed multilingual tasks in our experiments."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-129",
"text": "We would also like to explore convolution over character embeddings as a method to further circumvent the out of vocabulary problem with codemixed social media data."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-130",
"text": "We also plan to explore better representation learning for the semantic tasks."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-131",
"text": "One particular direction we plan to explore attention over the LSTM layer 2 as a weighted peek into the intermediate hidden states for the semantic classification task."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-132",
"text": "In future, we would like to experiment with more syntactic tasks like dependency label predictions, and also study more semantic tasks like aggression detection."
},
{
"sent_id": "74cd12a801d1f8a95f8898a8cef9c0-C001-133",
"text": "Codemixed domains suffer from severe resource scarcity, and thus vocabulary divergence between various datasets proves as a roadblock to generalizable models, as observed in POS + Language Id pretraining experiments."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-19"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-39"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-41"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-71",
"74cd12a801d1f8a95f8898a8cef9c0-C001-72",
"74cd12a801d1f8a95f8898a8cef9c0-C001-73"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-74",
"74cd12a801d1f8a95f8898a8cef9c0-C001-75",
"74cd12a801d1f8a95f8898a8cef9c0-C001-76",
"74cd12a801d1f8a95f8898a8cef9c0-C001-77",
"74cd12a801d1f8a95f8898a8cef9c0-C001-78"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-86",
"74cd12a801d1f8a95f8898a8cef9c0-C001-87"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-107"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-114"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-116"
]
],
"cite_sentences": [
"74cd12a801d1f8a95f8898a8cef9c0-C001-19",
"74cd12a801d1f8a95f8898a8cef9c0-C001-39",
"74cd12a801d1f8a95f8898a8cef9c0-C001-41",
"74cd12a801d1f8a95f8898a8cef9c0-C001-71",
"74cd12a801d1f8a95f8898a8cef9c0-C001-78",
"74cd12a801d1f8a95f8898a8cef9c0-C001-86",
"74cd12a801d1f8a95f8898a8cef9c0-C001-107",
"74cd12a801d1f8a95f8898a8cef9c0-C001-114",
"74cd12a801d1f8a95f8898a8cef9c0-C001-116"
]
},
"@MOT@": {
"gold_contexts": [
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-71",
"74cd12a801d1f8a95f8898a8cef9c0-C001-72",
"74cd12a801d1f8a95f8898a8cef9c0-C001-73"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-86",
"74cd12a801d1f8a95f8898a8cef9c0-C001-87"
]
],
"cite_sentences": [
"74cd12a801d1f8a95f8898a8cef9c0-C001-71",
"74cd12a801d1f8a95f8898a8cef9c0-C001-86"
]
},
"@SIM@": {
"gold_contexts": [
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-91",
"74cd12a801d1f8a95f8898a8cef9c0-C001-92",
"74cd12a801d1f8a95f8898a8cef9c0-C001-93",
"74cd12a801d1f8a95f8898a8cef9c0-C001-94"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-95",
"74cd12a801d1f8a95f8898a8cef9c0-C001-96",
"74cd12a801d1f8a95f8898a8cef9c0-C001-97",
"74cd12a801d1f8a95f8898a8cef9c0-C001-98"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-107"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-114"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-116"
]
],
"cite_sentences": [
"74cd12a801d1f8a95f8898a8cef9c0-C001-91",
"74cd12a801d1f8a95f8898a8cef9c0-C001-95",
"74cd12a801d1f8a95f8898a8cef9c0-C001-107",
"74cd12a801d1f8a95f8898a8cef9c0-C001-114",
"74cd12a801d1f8a95f8898a8cef9c0-C001-116"
]
},
"@USE@": {
"gold_contexts": [
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-91",
"74cd12a801d1f8a95f8898a8cef9c0-C001-92",
"74cd12a801d1f8a95f8898a8cef9c0-C001-93",
"74cd12a801d1f8a95f8898a8cef9c0-C001-94"
],
[
"74cd12a801d1f8a95f8898a8cef9c0-C001-95",
"74cd12a801d1f8a95f8898a8cef9c0-C001-96",
"74cd12a801d1f8a95f8898a8cef9c0-C001-97",
"74cd12a801d1f8a95f8898a8cef9c0-C001-98"
]
],
"cite_sentences": [
"74cd12a801d1f8a95f8898a8cef9c0-C001-91",
"74cd12a801d1f8a95f8898a8cef9c0-C001-95"
]
}
}
},
"ABC_4588d13c734d1ca0f348e056b1d39e_6": {
"x": [
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-101",
"text": "The penalization term on attention matrices is"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-2",
"text": "Pooling is an essential component of a wide variety of sentence representation and embedding models."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-3",
"text": "This paper explores generalized pooling methods to enhance sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-4",
"text": "We propose vector-based multi-head attention that includes the widely used max pooling, mean pooling, and scalar self-attention as special cases."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-5",
"text": "The model benefits from properly designed penalization terms to reduce redundancy in multi-head attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-6",
"text": "We evaluate the proposed model on three different tasks: natural language inference (NLI), author profiling, and sentiment classification."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-7",
"text": "The experiments show that the proposed model achieves significant improvement over strong sentence-encoding-based methods, resulting in state-of-the-art performances on four datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-8",
"text": "The proposed approach can be easily implemented for more problems than we discuss in this paper."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-11",
"text": "Distributed representation learned with neural networks has shown to be effective in modeling natural language at different granularities."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-12",
"text": "Learning representation for words (Bengio et al., 2000; Mikolov et al., 2013; Pennington et al., 2014) , for example, has achieved notable success."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-13",
"text": "Much remains to be done to model larger spans of text such as sentences or documents."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-14",
"text": "The approaches to computing sentence embedding generally fall into two categories."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-15",
"text": "The first consists of learning sentence embedding with unsupervised learning, e.g., auto-encoder-based models (Socher et al., 2011), Paragraph Vector (Le and Mikolov, 2014) , SkipThought vectors , FastSent (Hill et al., 2016) , among others."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-16",
"text": "The second category consists of models trained with supervised learning, such as convolution neural networks (CNN) (Kim, 2014; Kalchbrenner et al., 2014) , recurrent neural networks (RNN) (Conneau et al., 2017; Bowman et al., 2015) , and tree-structure recursive networks (Socher et al., 2013; Tai et al., 2015) , just to name a few."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-17",
"text": "Pooling is an essential component of a wide variety of sentence representation and embedding models."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-18",
"text": "For example, in recurrent-neural-network-based models, pooling is often used to aggregate hidden states at different time steps (i.e., words in a sentence) to obtain sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-19",
"text": "Convolutional neural networks (CNN) also often uses max or mean pooling to obtain a fixed-size sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-20",
"text": "In this paper we explore generalized pooling methods to enhance sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-21",
"text": "Specifically, by extending scalar self-attention models such as those proposed in Lin et al. (2017) , we propose vectorbased multi-head attention, which includes the widely used max pooling, mean pooling, and scalar selfattention itself as special cases."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-22",
"text": "On one hand, the proposed method allows for extracting different aspects of the sentence into multiple vector representations through the multi-head mechanism."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-23",
"text": "On the other, it allows the models to focus on one of many possible interpretations of the words encoded in the context vector through the vector-based attention mechanism."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-24",
"text": "In the proposed model we design penalization terms to reduce redundancy in multi-head attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-25",
"text": "We evaluate the proposed model on three different tasks: natural language inference, author profiling, and sentiment classification."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-26",
"text": "The experiments show that the proposed model achieves significant improvement over strong sentence-encoding-based methods, resulting in state-of-the-art performances on four datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-27",
"text": "The proposed approach can be easily implemented for more problems than we discuss in this paper."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-29",
"text": "**RELATED WORK**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-30",
"text": "There exist in the literature much previous work for sentence embedding with supervised learning, which mostly use RNN and CNN as building blocks."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-31",
"text": "For example, Bowman et al. (2015) used BiLSTMs as sentence embedding for natural language inference task."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-32",
"text": "Kim (2014) used CNN with max pooling for sentence classification."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-33",
"text": "More complicated neural networks were also proposed for sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-34",
"text": "For example, Socher et al. (2013) introduced Recursive Neural Tensor Network (RNTN) over parse trees to compute sentence embedding for sentiment analysis. and Tai et al. (2015) proposed tree-LSTM."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-35",
"text": "Yu and Munkhdalai (2017a) proposed a memory augmented neural networks, called Neural Semantic Encoder (NSE), as sentence embedding for natural language understanding tasks."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-36",
"text": "Some recent research began to explore inner/self-sentence attention mechanism for sentence embedding, which can be classified into two categories: self-attention network and self-attention pooling."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-37",
"text": "Cheng et al. (2016) proposed an intra-sentence level attention mechanism on the base of LSTM, called LSTMN."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-38",
"text": "For each step in LSTMN, it calculated the attention between a certain word and its previous words."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-39",
"text": "Vaswani et al. (2017) proposed a self-attention network for the neural machine translation task."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-40",
"text": "The self-attention network uses multi-head scaled dot-product attention to represent each word by weighted summation of all word in the sentence."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-41",
"text": "Shen et al. (2017) proposed DiSAN, which is composed of a directional self-attention with temporal order encoded."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-42",
"text": "Shen et al. (2018) proposed reinforced selfattention network (ReSAN), which integrate both soft and hard attention into one context fusion with reinforced learning."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-43",
"text": "Self-attention pooling has also been studied in previous work."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-44",
"text": "Liu et al. (2016) proposed innersentence attention based pooling methods for sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-45",
"text": "They calculate scalar attention between the LSTM states and the mean pooling using multi-layer perceptron (MLP) to obtain the vector representation for a sentence."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-46",
"text": "Lin et al. (2017) proposed a scalar structure/multi-head self-attention method for sentence embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-47",
"text": "The multi-head self-attention is calculated by a MLP with only LSTM states as input."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-48",
"text": "There are two main differences from our proposed method; i.e., (1) they used scalar attention instead of vectorial attention, (2) we propose different penalization terms which is suitable for vector-based multi-head self-attention, while their penalization term on attention matrix is only designed for scalar multi-head self-attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-49",
"text": "Choi et al. (2018) proposed a fine-grained attention mechanism for neural machine translation, which also extend scalar attention to vectorial attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-50",
"text": "Shen et al. (2017) proposes multi-dimensional/vectorial self-attention pooling on the top of self-attention network instead of BiLSTM."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-51",
"text": "However, both of them didn't consider multi-head self-attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-53",
"text": "**THE MODEL**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-54",
"text": "In this section we describe the proposed models that enhance sentence embedding with generalized pooling approaches."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-55",
"text": "The pooling layer is built on a state-of-the-art sequence encoder layer."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-56",
"text": "Below, we first discuss the sequence encoder, which, when enhanced with the proposed generalized pooling, achieves state-of-the-art performance on three different tasks on four datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-58",
"text": "**SEQUENCE ENCODER**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-59",
"text": "The sequence encoder in our model takes into T word tokens of a sentence S = (w 1 , w 2 , . . . , w T )."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-60",
"text": "Each word w i is from the vocabulary V ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-61",
"text": "For each word we concatenate pre-trained word embedding and embedding learned from characters."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-62",
"text": "The character composition model feeds all characters of the word into a convolution neural network (CNN) with max pooling (Kim, 2014) ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-63",
"text": "The detailed experiment setup will be discussed in Section 4."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-64",
"text": "The sentence S is represented as a word embedding sequence: X = (e 1 , e 2 , . . . , e T ) \u2208 R T \u00d7de , where d e is the dimension of word embedding which concatenates embedding obtained from character composition and pretrained word embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-65",
"text": "To represent words and their context in sentences, the sentences are fed into stacked bidirectional LSTMs (BiLSTMs)."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-66",
"text": "Shortcut connections are applied, which concatenate word embeddings and input hidden states at each layer in the stacked BiLSTM except for the first (bottom) layer."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-67",
"text": "The formulae are as follows:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-68",
"text": "where hidden states h l t in layer l concatenate two directional hidden states of LSTM at time t. Then the sequence is represented as the hidden states in the top layer L:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-69",
"text": ". For simplicity, we ignore the superscript L in the remainder of the paper."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-71",
"text": "**GENERALIZED POOLING**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-73",
"text": "**VECTOR-BASED MULTI-HEAD ATTENTION**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-74",
"text": "To transform a variable length sentence into a fixed size vector representation, we propose a generalized pooling method."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-75",
"text": "We achieve that by using a weighted summation of the T LSTM hidden vectors, and the weights are vectors rather than scalars, which can control every element in all hidden vectors:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-76",
"text": "where W 1 \u2208 R da\u00d72d and W 2 \u2208 R 2d\u00d7da are weight matrices; b 1 \u2208 R da and b 2 \u2208 R 2d are bias, where d a is the dimension of attention network and d is the dimension of LSTMs."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-77",
"text": "H \u2208 R T \u00d72d and A \u2208 R T \u00d72d are the hidden vectors at the top layer and weight matrices, respectively."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-78",
"text": "The softmax ensures that (A 1 , A 2 , . . . , A T ) are non-negative and sum up to 1 for every element in vectors."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-79",
"text": "Then we sum up the LSTM hidden states H according to the weight vectors provided by A to get a vector representation v of the input sentence."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-80",
"text": "However, the vector representation usually focuses on a specific component of the sentence, like a special set of related words or phrases."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-81",
"text": "We extend pooling method to a multi-head way:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-82",
"text": "where a i t indicates the vectorial attention from A i for the t-th token in i-th head and is the elementwise product (also called the Hadamard product)."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-83",
"text": "Thus the final representation is a concatenated vector v = [v 1 ; v 2 ; . . . ; v I ], where each v i captures different aspects of the sentence."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-84",
"text": "For example, some heads of vectors may represent the predicate of sentence and other heads of vectors represent argument of the sentence, which enhances representation of sentences obtained in single-head attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-86",
"text": "**PENALIZATION TERMS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-87",
"text": "To reduce the redundancy of multi-head attention, we design penalization terms for vector-based multihead attention in order to encourage the diversity of summation weight across different heads of attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-88",
"text": "We propose three types of penalization terms."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-90",
"text": "**PENALIZATION TERM ON PARAMETER MATRICES**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-91",
"text": "The first penalization term is applied to parameter matrix W i 1 in Equation 5, as shown in the following formula:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-92",
"text": "Intuitively, we encourage different heads to have different parameters."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-93",
"text": "We maximum the Frobenius norm of the differences between two parameter matrices, resulting in encouraging the diversity of different heads."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-94",
"text": "It has no further bonus when the Frobenius norm of the difference of two matrices exceeds the a threshold \u03bb."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-95",
"text": "Similar to adding an L2 regularization term on neural networks, the penalization term P will be added to the original loss with a weight of \u00b5. Hyper-parameters \u03bb and \u00b5 need to be tuned on a development set."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-96",
"text": "We can also add constrains on W i 2 in a similar way, but we did not observe further improvement in our experiments."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-97",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-98",
"text": "**PENALIZATION TERM ON ATTENTION MATRICES**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-99",
"text": "The second penalization term is added on attention matrices."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-100",
"text": "Instead of using AA T \u2212 I 2 F to encourage the diversity for scalar attention matrix as in Lin et al. (2017) , we propose the following formula to encourage the diversity for vectorial attention matrices."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-172",
"text": "Table 3 shows the results of different models for the Yelp and the Age dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-102",
"text": "where \u03bb and \u00b5 are hyper-parameters which need to be tuned based on a development set."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-103",
"text": "Intuitively, we try to encourage the diversity of any two different A i under the threshold \u03bb."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-105",
"text": "**PENALIZATION TERM ON SENTENCE EMBEDDINGS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-106",
"text": "In addition, we propose to add a penalization term on multi-head sentence embedding v i directly as follows:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-107",
"text": "where \u03bb and \u00b5 are hyper-parameters."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-108",
"text": "Here we try to maximize the l 2 -norm of any two different heads of sentence embeddings under the threshold \u03bb."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-110",
"text": "**TOP-LAYER CLASSIFIERS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-111",
"text": "The output of pooling is fed to a top-layer classifier to solve different problems."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-112",
"text": "In this paper we evaluate our sentence embedding models on three different tasks: natural language inference (NLI), author profiling, and sentiment classification, on four datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-113",
"text": "The evaluation covers two typical types of problems."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-114",
"text": "The author profiling and sentiment tasks classify individual sentences into different categories and the two NLI tasks classify sentence pairs."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-115",
"text": "For the NLI tasks, to enhance the relationship between sentence pairs, we concatenate the embeddings of two sentences with their absolute difference and element-wise product (Mou et al., 2016) as the input to the multilayer perceptron (MLP) classifier:"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-116",
"text": "where is the element-wise product."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-117",
"text": "The MLP has two hidden layers with ReLU activation with shortcut connections and a softmax output layer."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-118",
"text": "The entire model is trained end-to-end through minimizing the cross-entropy loss."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-119",
"text": "Note that for the two classification tasks on individual sentences (i.e., the author profiling and sentiment classification task), we use the same MLP classifiers described above for sentence pair classification."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-120",
"text": "But instead of concatenating two sentences, we directly feed a sentence embedding into MLP."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-121",
"text": "4 Experimental Setup 4.1 Data SNLI The SNLI (Bowman et al., 2015) is a large dataset for natural language inference."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-122",
"text": "The task detects three relationships between a premise and a hypothesis sentence: the premise entails the hypothesis (entailment), they contradict each other (contradiction), or they have a neutral relation (neutral)."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-123",
"text": "We use the same data split as in Bowman et al. (2015) , i.e., 549.367 samples for training, 9,842 samples for development and 9,824 samples for testing."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-124",
"text": "MultiNLI MultiNLI (Williams et al., 2017 ) is another natural language inference dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-125",
"text": "The data are collected from a broader range of genres such as fiction, letters, telephone speech, and 9/11 reports."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-126",
"text": "Half of these 10 genres are used in training while the rest are not, resulting in-domain and cross-domain development and test sets used to test NLI systems."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-127",
"text": "We use the same data split as in Williams et al. (2017) , i.e., 392,702 samples for training, 9,815/9,832 samples for in-domain/cross-domain development, and 9,796/9,847 samples for in-domain/cross-domain testing."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-128",
"text": "Note that, we do not use SNLI as an additional training/development set in our experiments."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-129",
"text": "Age Dataset To compare our models with that of Lin et al. (2017) , we use the same Age dataset in our experiment here, which is an Author Profiling dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-130",
"text": "The dataset are extracted from the Author Profiling dataset 1 , which consists of tweets from English Twitter."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-131",
"text": "The task is to predict the age range of authors of input tweets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-132",
"text": "The age range are split into 5 classes: 18-24, 25-34, 35-49, 50-64, 65+."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-133",
"text": "We use the same data split as in Lin et al. (2017) , i.e., 68,485 samples for training, 4,000 for development, and 4,000 for testing."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-135",
"text": "**YELP DATASET**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-136",
"text": "The Yelp dataset 2 is a sentiment analysis task, which takes reviews as input and predicts the level of sentiment in terms of the number of stars, from 1 to 5 stars, where 5-star means the most positive."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-137",
"text": "We use the same data split as in Lin et al. (2017) , i.e., 500,000 samples for training, 2,000 for development, and 2,000 for testing."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-139",
"text": "**TRAINING DETAILS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-140",
"text": "We implement our algorithm with Theano (Theano Development Team, 2016) framework."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-141",
"text": "We use the development set (in-domain development set for MultiNLI) to select models for testing."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-142",
"text": "To help replicate our results, we publish our code 3 , which is developed from our codebase for multiple tasks (Chen et al., 2018; Chen et al., 2017a; Chen et al., 2016; ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-143",
"text": "Specifically, we use Adam (Kingma and Ba, 2014) for optimization."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-144",
"text": "The initial learning rate is 4e-4 for SNLI and MultiNLI, 2e-3 for Age dataset, 1e-3 for Yelp dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-145",
"text": "For SNLI and MultiNLI dataset, stacked BiLSTMs have 3 layers."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-146",
"text": "For Age and Yelp dataset, stacked BiLSTMs have 1 layer."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-147",
"text": "The hidden states of BiLSTMs for each direction and MLP are 300 dimension, except for SNLI whose dimensions are 600."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-148",
"text": "We clip the norm of gradients to make it smaller than 10 for SNLI and MultiNLI, and 0.5 for Age and Yelp dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-149",
"text": "The character embedding has 15 dimensions, and 1D-CNN filters lengths are 1, 3 and 5, respectively."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-150",
"text": "Each filter has 100 feature maps, resulting in 300 dimensions for character-composition embedding."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-151",
"text": "We initialize wordlevel embedding with pre-trained GloVe-840B-300D embeddings (Pennington et al., 2014) and initialize out-of-vocabulary words randomly with a Gaussian distribution."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-152",
"text": "The word-level embedding is fixed during training for SNLI and MultiNLI dataset, but updated during training for Age and Yelp dataset, which is determined by the performance on development sets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-153",
"text": "The mini-batch size is 128 for SNLI and 32 for the rest."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-154",
"text": "We use 5 heads generalized pooling for all tasks. And d a is 600 for SNLI and 300 for the other datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-155",
"text": "For the penalization term, we choose \u03bb = 1; the penalization weight \u00b5 is selected from [1,1e-1,1e-2,1e-3,1e-4] based on performances on the development sets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-157",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-158",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-159",
"text": "**OVERALL PERFORMANCE**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-160",
"text": "For the NLI tasks, there are many ways to add cross-sentence (Rockt\u00e4schel et al., 2015; Parikh et al., 2016; Chen et al., 2017a) level attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-161",
"text": "To ensure the comparison is fair, we only compare methods that use sentence-encoding-based models; i.e., cross-sentence attention is not allowed."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-162",
"text": "Note that this Model Test 100D LSTM (Bowman et al., 2015) 77.6 300D LSTM (Bowman et al., 2016) 80.6 1024D GRU (Vendrov et al., 2015) 81.4 300D Tree CNN (Mou et al., 2016) 82.1 600D SPINN-PI (Bowman et al., 2016) 83.3 600D BiLSTM (Liu et al., 2016) 83.3 300D NTI-SLSTM-LSTM (Yu and Munkhdalai, 2017b) 83.4 600D BiLSTM intra-attention (Liu et al., 2016) 84.2 600D BiLSTM self-attention (Lin et al., 2017) 84.4 4096D BiLSTM max pooling (Conneau et al., 2017) 84.5 300D NSE (Yu and Munkhdalai, 2017a) 84.6 600D BiLSTM gated-pooling (Chen et al., 2017b) 85.5 300D DiSAN (Shen et al., 2017) 85.6 300D Gumbel TreeLSTM (Choi et al., 2017) 85.6 600D Residual stacked BiLSTM (Nie and Bansal, 2017) 85.7 300D CAFE (Tay et al., 2018) 85.9 600D Gumbel TreeLSTM (Choi et al., 2017) 86.0 1200D Residual stacked BiLSTM (Nie and Bansal, 2017) 86.0 300D ReSAN (Shen et al., 2018) 86.3 1200D BiLSTM max pooling 85.3 1200D BiLSTM mean pooling 85.0 1200D BiLSTM last pooling 84.9 1200D BiLSTM generalized pooling 86.6 Table 1 : Accuracies of the models on the SNLI dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-163",
"text": "follows the setup in the RepEval-2017 Shared Task."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-164",
"text": "Table 1 shows the results of different models for NLI, consisting of results of previous work on sentence-encoding-based models, plus the performance of our baselines and that of the model proposed in this paper."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-165",
"text": "We have three additional baseline models: the first uses max pooling on top of BiLSTM, which achieves an accuracy of 85.3%; the second uses mean pooling on top of BiLSTM, which achieves an accuracy of 85.0%; the third uses last pooling, i.e., concatenating the last hidden states of forward and backward LSTMs, which achieves an accuracy of 84.9%."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-166",
"text": "Instead of using heuristic pooling methods, the proposed sentence-encoding-based model with generalized pooling achieves a new state-of-the-art accuracy of 86.6% on the SNLI dataset; the improvement over the baseline with max pooling is statistically significant under the one-tailed paired t-test at the 99.999% significance level."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-167",
"text": "The previous state-of-the-art model ReSAN (Shen et al., 2018) used a hybrid of hard and soft attention model with reinforced learning achieved an accuracy of 86.3%."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-168",
"text": "Table 2 shows the results of different models on the MultiNLI dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-169",
"text": "The first group is the results of previous sentence-encoding-based models."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-170",
"text": "The proposed model with generalized pooling achieves an accuracy of 73.8% on the in-domain test set and 74.0% on the cross-domain test set; both improve over the baselines using max pooling, mean pooling and last pooling."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-171",
"text": "In addition, the results on cross-domain test set yield a new state of the art at an accuracy of 74.0%, which is better than 73.6% of shortcut-stacked BiLSTM (Nie and Bansal, 2017) ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-173",
"text": "The BiLSTM with self-attention proposed by Lin et al. (2017) achieves better result than CNN and BiLSTM with max pooling."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-174",
"text": "One of our baseline models using max pooling on BiLSTM achieves accuracies of 65.00% and 82.30% on the Yelp and the Age dataset respectively, which is already better than the self-attention model proposed by Lin et al. (2017) ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-175",
"text": "We also show that the results of baseline with mean pooling and last pooling, in which mean pooling achieves the best result on the Yelp dataset among three baseline models and max pooling achieves the best on the Age dataset among three baselines."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-176",
"text": "Our proposed generalized pooling method obtains further improvement on these already strong baselines, achieving 66.55% on the Yelp dataset and 82.63% on the Age dataset (statistically significant p < 0.00001 against best baselines),"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-178",
"text": "**MODEL**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-179",
"text": "In Cross CBOW (Williams et al., 2017) 64.8 64.5 BiLSTM (Williams et al., 2017) 66.9 66.9 BiLSTM gated-pooling (Chen et al., 2017b) 73.5 73.6 Shortcut stacked BiLSTM (Nie and Bansal, 2017) (Lin et al., 2017) 61.99 77.30 CNN max pooling (Lin et al., 2017) 62.05 78.15 BiLSTM self-attention (Lin et al., 2017) which are also new state of the art performances on these two datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-181",
"text": "**DETAILED ANALYSIS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-182",
"text": "Effect of Multiple Vectors/Scalars To compare the difference between vector-based attention and scalar attention, we draw the learning curves of different models using different heads on the SNLI development dataset without penalization terms as in Figure 1 ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-183",
"text": "The green lines indicate scalar selfattention pooling added on top of the BiLSTMs, same as in Lin et al. (2017) , and the blue lines indicate vector-based attention used in our generalized pooling methods."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-184",
"text": "It is obvious that the vector-based attention achieves improvement over scalar attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-185",
"text": "Different line styles are used to indicate selfattention using different numbers of multi-head, ranging from 1, 3, 5, 7 to 9."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-186",
"text": "For vector-based attention, the 9-head model achieves the best accuracy of 86.8% on the development set."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-187",
"text": "For scalar attention, the 7-head model achieves the best accuracy of 86.4% on the development set."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-188",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-189",
"text": "**EFFECT OF PENALIZATION TERMS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-190",
"text": "To analyze the effect of penalization terms, we show the results with/without penalization terms on the four datasets in Table 4 ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-191",
"text": "Without using any penalization terms, the proposed generalized pooling achieves an accuracy of 86.4% on the SNLI dataset, which is already slightly better than previous models (compared to accuracy 86.3% in Shen et al. (2018) )."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-192",
"text": "When we use penalization on parameter matrices, the proposed model achieves a further improvement with an accuracy of 86.6%."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-193",
"text": "In addition, we also observe a significant improvement on MultiNLI, Yelp and Age dataset after using the penalization terms."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-194",
"text": "For the MultiNLI dataset, the proposed model with penalization on parameter matrices achieves an accuracy of 73.8% and 74.0% on the in-domain and the cross-domain test set, respectively, which outperform the accuracy of 73.7% and 73.4% of the model without penalization, respectively."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-195",
"text": "For the Yelp dataset, the proposed model with penalization on parameter matrices achieves the best results among the three penalization methods, which also improve the accuracy of 65.25% to 66.55% compared to the models without penalization."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-196",
"text": "For the Age dataset, the proposed model with penalization on attention matrices achieves the best accuracy of 82.63%, compared to the 82.18% accuracy of the model without penalization."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-197",
"text": "In general, the penalization on parameter matrices achieves the most effective improvement among most of these tasks, except for the Age dataset."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-198",
"text": "To verify whether the penalization term P discourages the redundancy in the sentence embedding, we Table 4 : Performance with/without the penalization term."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-199",
"text": "The penalization weight is selected from [1,1e-1,1e-2,1e-3,1e-4] on the development sets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-200",
"text": "visualize the vectorial multi-head attention according."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-201",
"text": "We compare two models with the same hyperparameters except that one is with penalization on attention matrices and the other without penalization."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-202",
"text": "We pick a sentence from the development set of the Age data: Martin Luther King \"I was not afraid of the words of the violent, but of the silence of the honest\" , with the gold label being the category of 65+."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-203",
"text": "We plot all 5 heads of attention matrices as in Figure 2 ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-204",
"text": "From the figure we can tell that the model trained without the penalization term has much more redundancy between different heads of attention ( Figure 3b ), resulting in putting significant focus on the word \"Martin\" in the 1st, 3rd and 5th head, and on the word \"violent\" in the 2nd and 4th head."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-205",
"text": "However in Figure 3a , the model with penalization shows much more variation between different heads."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-206",
"text": "----------------------------------"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-207",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-208",
"text": "In this paper, we propose a generalized pooling method for sentence embedding through vector-based multi-head attention, which includes the widely used max pooling, mean pooling, and scalar selfattention as its special cases."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-209",
"text": "Specifically the proposed model aims to use vectors to enrich the expressiveness of attention mechanism and leverage proper penalty terms to reduce redundancy in multi-head attention."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-210",
"text": "We evaluate the proposed approach on three different tasks: natural language inference, author profiling, and sentiment classification."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-211",
"text": "The experiments show that the proposed model achieves significant improvement over strong sentence-encoding-based methods, resulting in state-of-the-art performances on four datasets."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-212",
"text": "The proposed approach can be easily implemented for more problems than we discuss in this paper."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-213",
"text": "Our future work includes exploring more effective MLP to use the structures of multi-head vectors, inspired by the idea from Lin et al. (2017) ."
},
{
"sent_id": "4588d13c734d1ca0f348e056b1d39e-C001-214",
"text": "Leveraging structure information from syntactic and semantic parses is another direction interesting to us."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"4588d13c734d1ca0f348e056b1d39e-C001-20",
"4588d13c734d1ca0f348e056b1d39e-C001-21",
"4588d13c734d1ca0f348e056b1d39e-C001-22",
"4588d13c734d1ca0f348e056b1d39e-C001-23"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-46",
"4588d13c734d1ca0f348e056b1d39e-C001-47",
"4588d13c734d1ca0f348e056b1d39e-C001-48"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-173"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-183",
"4588d13c734d1ca0f348e056b1d39e-C001-184",
"4588d13c734d1ca0f348e056b1d39e-C001-185",
"4588d13c734d1ca0f348e056b1d39e-C001-186",
"4588d13c734d1ca0f348e056b1d39e-C001-187"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-213"
]
],
"cite_sentences": [
"4588d13c734d1ca0f348e056b1d39e-C001-21",
"4588d13c734d1ca0f348e056b1d39e-C001-173",
"4588d13c734d1ca0f348e056b1d39e-C001-183",
"4588d13c734d1ca0f348e056b1d39e-C001-213"
]
},
"@SIM@": {
"gold_contexts": [
[
"4588d13c734d1ca0f348e056b1d39e-C001-20",
"4588d13c734d1ca0f348e056b1d39e-C001-21",
"4588d13c734d1ca0f348e056b1d39e-C001-22",
"4588d13c734d1ca0f348e056b1d39e-C001-23"
]
],
"cite_sentences": [
"4588d13c734d1ca0f348e056b1d39e-C001-21"
]
},
"@MOT@": {
"gold_contexts": [
[
"4588d13c734d1ca0f348e056b1d39e-C001-20",
"4588d13c734d1ca0f348e056b1d39e-C001-21",
"4588d13c734d1ca0f348e056b1d39e-C001-22",
"4588d13c734d1ca0f348e056b1d39e-C001-23"
]
],
"cite_sentences": [
"4588d13c734d1ca0f348e056b1d39e-C001-21"
]
},
"@DIF@": {
"gold_contexts": [
[
"4588d13c734d1ca0f348e056b1d39e-C001-46",
"4588d13c734d1ca0f348e056b1d39e-C001-47",
"4588d13c734d1ca0f348e056b1d39e-C001-48"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-100",
"4588d13c734d1ca0f348e056b1d39e-C001-101",
"4588d13c734d1ca0f348e056b1d39e-C001-102",
"4588d13c734d1ca0f348e056b1d39e-C001-103"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-174"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-183",
"4588d13c734d1ca0f348e056b1d39e-C001-184",
"4588d13c734d1ca0f348e056b1d39e-C001-185",
"4588d13c734d1ca0f348e056b1d39e-C001-186",
"4588d13c734d1ca0f348e056b1d39e-C001-187"
]
],
"cite_sentences": [
"4588d13c734d1ca0f348e056b1d39e-C001-100",
"4588d13c734d1ca0f348e056b1d39e-C001-174",
"4588d13c734d1ca0f348e056b1d39e-C001-183"
]
},
"@USE@": {
"gold_contexts": [
[
"4588d13c734d1ca0f348e056b1d39e-C001-129"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-133"
],
[
"4588d13c734d1ca0f348e056b1d39e-C001-137"
]
],
"cite_sentences": [
"4588d13c734d1ca0f348e056b1d39e-C001-129",
"4588d13c734d1ca0f348e056b1d39e-C001-133",
"4588d13c734d1ca0f348e056b1d39e-C001-137"
]
},
"@FUT@": {
"gold_contexts": [
[
"4588d13c734d1ca0f348e056b1d39e-C001-213"
]
],
"cite_sentences": [
"4588d13c734d1ca0f348e056b1d39e-C001-213"
]
}
}
},
"ABC_310272015a781b05c42015c0559b18_6": {
"x": [
{
"sent_id": "310272015a781b05c42015c0559b18-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-2",
"text": "Several computational simulations of how children solve the word segmentation problem have been proposed, but most have been applied only to a limited number of languages."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-3",
"text": "One model with some experimental support uses distributional statistics of sound sequence predictability (Saffran et al. 1996) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-4",
"text": "However, the experimental design does not fully specify how predictability is best measured or modeled in a simulation."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-5",
"text": "Saffran et al. (1996) assume transitional probability, but Brent (1999a) claims mutual information (MI) is more appropriate."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-6",
"text": "Both assume predictability is measured locally, relative to neighboring segment-pairs."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-7",
"text": "This paper replicates Brent's (1999a) mutualinformation model on a corpus of childdirected speech in Modern Greek, and introduces a variant model using a global threshold."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-8",
"text": "Brent's finding regarding the superiority of MI is confirmed; the relative performance of local comparisons and global thresholds depends on the evaluation metric."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-11",
"text": "A substantial portion of research in child language acquisition focuses on the word segmentation problem-how children learn to extract words (or word candidates) from a continuous speech signal prior to having acquired a substantial vocabulary."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-12",
"text": "While a number of robust strategies have been proposed and tested for infants learning English and a few other languages (discussed in Section 1.1), it is not clear whether or how these apply to all or most languages."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-13",
"text": "In addition, experiments on infants often leave undetermined many details of how particular cues are actually used."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-14",
"text": "Computational simulations of word segmentation have also focused mainly on data from English corpora, and should also be extended to cover a broader range of the corpora available."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-15",
"text": "The line of research proposed here is twofold: on the one hand we wish to understand the nature of the cues present in Modern Greek, on the other we wish to establish a framework for orderly comparison of word segmentation algorithms across the desired broad range of languages."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-16",
"text": "Finite-state techniques, used by e.g., Belz (1998) in modeling phonotactic constraints and syllable within various languages, provide one straightforward way to formulate some of these comparisons, and may be useful in future testing of multiple cues."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-17",
"text": "Previous research (Rytting, 2004) examined the role of utterance-boundary information in Modern Greek, implementing a variant of Aslin and colleagues' (1996) model within a finite-state framework."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-18",
"text": "The present paper examines more closely the proposed cue of segment predictability."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-19",
"text": "These two studies lay the groundwork for examining the relative worth of various cues, separately and as an ensemble."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-21",
"text": "**INFANT STUDIES**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-22",
"text": "Studies of English-learning infants find the earliest evidence for word segmentation and acquisition between 6 and 7.5 months (Jusczyk and Aslin, 1995) although many of the relevant cues and strategies seem not to be learned until much later."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-23",
"text": "Several types of information in the speech signal have been identified as likely cues for infants, including lexical stress, co-articulation, and phonotactic constraints (see e.g., Johnson & Jusczyk, 2001 for a review)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-24",
"text": "In addition, certain heuristics using statistical patterns over (strings of) segments have also been shown to be helpful in the absence of other cues."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-25",
"text": "One of these (mentioned above) is extrapolation from the segmental context near utterance boundaries to predict word boundaries ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-26",
"text": "Another proposed heuristic utilizes the relative predictability of the following segment or syllable."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-27",
"text": "For example, Saffran et al. (1996) have confirmed the usefulness of distributional cues for 8-month-olds on artificially designed micro-languages-albeit with English-learning infants only."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-28",
"text": "The exact details of how infants use these cues are unknown, since the patterns in their stimuli fit several distinct models (see Section 1.2)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-29",
"text": "Only further research will tell how and to what degree these strategies are actually useful in the context of natural language-learning settings-particularly for a broad range of languages."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-30",
"text": "However, what is not in doubt is that infants are sensitive to the cues in question, and that this sensitivity begins well before the infant has acquired a large vocabulary."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-32",
"text": "**IMPLEMENTATIONS AND AMBIGUITIES**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-33",
"text": "While the infant studies discussed above focus primarily on the properties of particular cues, computational studies of word-segmentation must also choose between various implementations, which further complicates comparisons."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-34",
"text": "Several models (e.g., Batchelder, 2002; Brent's (1999a) MBDP-1 model; Davis, 2000; de Marcken, 1996; Olivier, 1968) simultaneously address the question of vocabulary acquisition, using previously learned word-candidates to bootstrap later segmentations."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-35",
"text": "(It is beyond the scope of this paper to discuss these in detail; see Brent 1999a,b for a review.)"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-36",
"text": "Other models do not accumulate a stored vocabulary, but instead rely on the degree of predictability of the next syllable (e.g., Saffran et al., 1996) or segment (e.g., Christiansen et al., 1998) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-37",
"text": "The intuition here, first articulated by Harris (1954) , is that word boundaries are marked by a spike in unpredictability of the following phoneme."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-38",
"text": "The results from Saffran et al. (1996) show that English-learning infants do respond to areas of unpredictability; however, it is not clear from the experiment how this unpredictability is best measured."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-39",
"text": "Two specific ambiguities in measuring (un)predictability are examined here."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-40",
"text": "Brent (1999a) points out one type of ambiguity, namely that Saffran and colleagues' (1996) results can be modeled as favoring word-breaks at points of either low transitional probability or low mutual information."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-41",
"text": "Brent reports results for models relying on each of these measures."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-42",
"text": "It should be noted that these models are not the main focus of his paper, but provided for illustrative purposes; nevertheless, these models provide the best comparison to Saffran and colleagues' experiment, and may be regarded as an implementation of the same."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-43",
"text": "Brent (1999a) compares these two models in terms of word tokens correctly segmented (see Section 3 for exact criteria), reporting approximately 40% precision and 45% recall for transitional probability (TP) and 50% precision and 53% recall for mutual information (MI) on the first 1000 utterances of his corpus (with improvements given larger corpora)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-44",
"text": "Indeed, their performance on word tokens is surpassed only by Brent's main model (MBDP-1), which seems to have about 73% precision and 67% recall for the same range."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-46",
"text": "**1**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-47",
"text": "Another question which Saffran et al. (1996) leave unanswered is whether the segmentation depends on local or global comparisons of predictability."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-48",
"text": "Saffran et al. assume implicitly, and Brent (1999a) explicitly, that the proper comparison is local-in Brent, dependent solely on the adjacent pairs of segments."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-49",
"text": "However, predictability measures for segmental bigrams (whether TP or MI) may be compared in any number of ways."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-50",
"text": "One straightforward alternative to the local comparison is to compare the predictability measures compare to some global threshold."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-51",
"text": "Indeed, and Christiansen et al. (1998) simply assumed the mean activation level as a global activation threshold within their neural network framework."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-53",
"text": "**GLOBAL AND LOCAL COMPARISONS**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-54",
"text": "The global comparison, taken on its own, seems a rather simplistic and inflexible heuristic: for any pair of phonemes xy, either a word boundary is always hypothesized between x and y, or it never is."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-55",
"text": "Clearly, there are many cases where x and y sometimes straddle a word boundary and sometimes do not."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-56",
"text": "The heuristic also takes no account of lengths of possible words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-57",
"text": "However, the local comparison may take length into account too much, disallowing words of certain lengths."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-58",
"text": "In order to see that, we must examine Brent's (1999a) suggested implementation of Saffran et al. (1996) more closely."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-59",
"text": "In the local comparison, given some string \u2026wxyz\u2026, in order for a word boundary to be inserted between x and y, the predictability measure for xy must be lower than both that of wx and of yz."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-60",
"text": "It follows that neither wx nor yz can have word boundaries between them, since they cannot simultaneously have a lower predictability measure than xy."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-61",
"text": "This means that, within an utterance, word boundaries must have at least two segments between them, so this heuristic will not correctly segment utterance-internal one-phoneme words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-62",
"text": "3 Granted, only a few one-phoneme word types exist in either English or Greek (or other languages)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-63",
"text": "However, these words are often function words and so are less likely to appear at edges of utterances (e.g., ends of utterances for articles and prepositions; beginnings for postposed elements)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-64",
"text": "Neither Brent's (1999a) implementation of Saffran's et al. (1996) heuristic nor utterance-boundary heuristic can explain how these might be learned."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-65",
"text": "Brent (1999a) himself points out another lengthrelated limitation-namely, the relative difficulty that the 'local comparison' heuristic has in segmenting learning longer words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-66",
"text": "The bigram MI frequencies may be most strongly influenced byand thus as an aggregate largely encode-the most frequent, shorter words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-67",
"text": "Longer words cannot be memorized in this representation (although common ends of words such as prefixes and suffixes might be)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-68",
"text": "In order to test for this, Brent proposes that precision for word types (which he calls \"lexicon precision\") be measured as well as for word tokens."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-69",
"text": "While the word-token metric emphasizes the correct segmentation of frequent words, the word-type metric does not share this bias."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-70",
"text": "Brent defines this metric as follows: \"After each block [of 500 utterances], each word type that the algorithm produced was labeled a true positive if that word type had occurred anywhere in the portion of the corpus processed so far; otherwise it is labeled a false positive."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-71",
"text": "\" Measured this way, MI yields a word type precision of only about 27%; transitional probability yields a precision of approximately 24% for the first 1000 utterances, compared to 42% for MBDP-1."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-72",
"text": "He does not measure word type recall."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-73",
"text": "This same limitation in finding longer, less frequent types may apply to comparisons against a global threshold as well."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-74",
"text": "This is also in need of testing."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-75",
"text": "It seems that both global and local comparisons, used on their own as sole or decisive heuristics, may have serious limitations."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-76",
"text": "It is not clear a priori which limitation is most serious; hence both comparisons are tested here."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-78",
"text": "**CONSTRUCTING A FINITE-STATE MODEL**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-80",
"text": "**OUTLINE OF CURRENT RESEARCH**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-81",
"text": "While in its general approach the study reported here replicates the mutual-information and transitional-probability models in Brent (1999a) , it differs slightly in the details of their use."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-82",
"text": "First, whereas Brent dynamically updated his measures over a single corpus, and thus blurred the line between training and testing data, our model precompiles statistics for each distinct bigram-type offline, over a separate training corpus."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-83",
"text": "4 Secondly, we compare the use of a global threshold (described in more detail in Section 2.3, below) to Brent's (1999a) use of the local context (as described in Section 1.3 above)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-84",
"text": "Like (Brent, 1999a) , but unlike Saffran et al. (1996) , our model focuses on pairs of segments, not on pairs of syllables."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-85",
"text": "While Modern Greek syllabic structure is not as complicated as English's, it is still more complicated than the CV structure assumed in Saffran et al. (1996) ; hence, access to syllabification cannot be assumed."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-87",
"text": "**CORPUS DATA**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-88",
"text": "In addition to the technical differences discussed above, this replication breaks new ground in terms of the language from which the training and test corpora are drawn."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-89",
"text": "Modern Greek differs from English in having only five vowels, generally simpler syllable structures, and a substantial amount of inflectional morphology, particularly at the ends of words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-90",
"text": "It also contains not only preposed function words (e.g., determiners) but postposed ones as well, such as the possessive pronoun, which cannot appear utterance-initially."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-91",
"text": "For an in-depth discussion of Modern Greek, see (Holton et al., 1997) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-92",
"text": "While it is not anticipated that Modern Greek will be substantially more challenging to segment than English, the choice does serve as an additional check on current assumptions."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-93",
"text": "The Stephany corpus (Stephany, 1995) is a database of conversations between children and caretakers, broadly transcribed, currently with no notations for lexical stress, included as part of the CHILDES database (MacWhinney, 2000) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-94",
"text": "In order to preserve adequate unseen data for future simulations and experiments, and also to use data most closely approximating children of a very young age, files from the youngest child only were used in this study."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-95",
"text": "However, since the heuristics and cues used are very simple compared to vocabulary-learning models such as Brent's MDLP-1, it is anticipated that they will require relatively little context, and so the small size of the training and testing corpora will not adversely effect the results to a great degree."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-96",
"text": "As in other studies, only adult input was used for training and testing."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-97",
"text": "In addition, non-segmental information such as punctuation, dysfluencies, parenthetical references to real-world objects, etc. were removed."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-98",
"text": "Spaces were taken to represent word boundaries without comment or correction; however, it is worth noting that the transcribers sometimes departed from standard orthographic practice when transcribing certain types of wordclitic combinations."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-99",
"text": "The text also contains a significant number of unrealized vowels, such as [ap] for /apo/ 'from', or [in] or even [n] for /ine/ 'is'."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-100",
"text": "Such variation was not regularized, but treated as part of the learning task."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-101",
"text": "The training corpus contains 367 utterance tokens with a total of 1066 word tokens (319 types)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-102",
"text": "Whereas the average number of words per utterance (2.9) is almost identical to that in the Korman (1984) corpus used by Christiansen et al. (1998) , utterances and words were slightly longer in terms of phonemes (12.8 and 4.4 phonemes respectively, compared to 9.0 and 3.0 in Korman)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-103",
"text": "The test corpus consists of 373 utterance tokens with a total of 980 words (306 types)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-104",
"text": "All utterances were uttered by adults to the same child as in the training corpus."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-105",
"text": "As with the training corpus, dysfluencies, missing words, or other irregularities were removed; the word boundaries were kept as given by the annotators, even when this disagreed with standard orthographic word breaks."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-107",
"text": "**MODEL DESIGN**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-108",
"text": "Used as a solitary cue (as it is in the tests run here), comparison against a global threshold may be implemented within the same framework as Brent's (1999) TP and MI heuristics."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-109",
"text": "However, it may be implemented within a finite-state framework as well, with equivalent behavior."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-110",
"text": "This section will describe how the 'global comparison' heuristic is modeled within a finite-state framework."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-111",
"text": "While such an implementation is not technically necessary here, one advantage of the finite-state framework is the compositionality of finite state machines, which allows for later composition of this approach with other heuristics depending on other cues, analogous to Christiansen et al. (1998) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-112",
"text": "Since the finite-state framework selects the best path over the whole utterance, it also allows for optimization over a sequence of decisions, rather than optimizing each local decision separately."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-113",
"text": "6 Unlike Belz (1998) , where the actual FSM structure (including classes of phonemes that could be group onto one arc) was learned, here the structure of each FSM is determined in advance."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-114",
"text": "Only the weight on each arc is derived from data."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-115",
"text": "No attempt is made to combine phonemes to produce more minimal FSMs; each phoneme (and phoneme-pair) is modeled separately."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-116",
"text": "Like Brent (1999a) and indeed most models in the literature, this model assumes (for sake of convenience and simplicity) that the child hears each segment produced within an utterance without error."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-117",
"text": "This assumption translates into the finitestate domain as a simple acceptor (or equivalently, an identity transducer) over the segment sequence for a given utterance."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-118",
"text": "7 Word boundaries are inserted by means of a transducer that computes the cost of word boundary insertion from the predictability scores."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-119",
"text": "In the MI model, the cost of inserting a word boundary is proportional to the mutual information."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-120",
"text": "For ease in modeling, this was represented with a finite state transducer with two paths between every pair of phonemes (x,y), with zero-counts modeled with a maximum weight of 99."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-121",
"text": "The direct path, representing a path with no word boundary inserted, costs \u2212MI(x,y), which is positive for bigrams of low predictability (negative MI), where word boundaries are more likely."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-122",
"text": "The other path, representing a word boundary insertion, carries the cost of the global threshold, in this case arbitrarily set to zero (although it could be optimized with held-out data)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-123",
"text": "A small subset of the resulting FST, representing the connections over the alphabet {ab} is illustrated in Figure 1 , below: The best (least-cost) path over this subset model inserts boundaries between two adjacent a's and two adjacent b's, but not between ab or ba; thus the (non-Greek) string \u2026ababaabbaaa\u2026 would be segmented \u2026ababa#ab#ba#a#a\u2026 by the FSM."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-124",
"text": "The FSM for transitional probability has the same structure as that of MI, but with different weights on each path."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-125",
"text": "For each pair of phonemes xy, the cost for the direct path from x to y is \u2212log (P(y|x) )."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-126",
"text": "The global threshold cost of inserting a word boundary was set (again, arbitrarily) as the negative log of the mean of all TP values."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-127",
"text": "In the two-phoneme subset (shown in Figure 2 ), the only change is that the direct pathway from a to b is now more expensive than the threshold path, so the best path over the FSM will insert word boundaries between a and b as well."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-128",
"text": "Hence our example string \u2026ababaabbaaa\u2026 would be segmented \u2026a#ba#ba#a#b#ba#a#a\u2026 by the FSM."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-129",
"text": "(The stranded 'word' #b# would of course be an error, but this problem does not arise in actual Greek input, since two adjacent b's, like all geminate consonants, are ruled out by Greek phonotactics.) The output projection of the best path from the resulting FST was converted back into text and compared to the text of the original utterance."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-130",
"text": "These compositions, best-path projections, and conversions were performed using the AT&T finite state toolkit (Mohri et al., 1998 In this example, the correct boundaries fall between the pairs (a,T), (s,n), (a,a), and (e,a)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-131",
"text": "Both the mutual information and the transitional probability for the first three of these pairs are above the global mean, so word boundaries are posited under both global models."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-132",
"text": "9 (Since each of these is also a local maximum, the local models also posit boundaries between these three pairs."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-133",
"text": ") The pair (e,a) is above threshold for MI but not for TP, so the global TP model fails to posit a boundary here."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-134",
"text": "Finally, the two local models posit a number of spurious boundaries at the other local maxima, shown by the italic numbers in the table."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-135",
"text": "The resulting predictions for each model are:"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-136",
"text": "Global MI: #tora#Telis#na#aniksume#afto# Global TP: #tora#Telis#na#aniksumeafto# Local MI: #tora#Te#lis#na#an#iks#ume#afto# Local TP: #to#ra#Te#lis#na#ani#ks#ume#afto#"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-138",
"text": "**RESULTS**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-139",
"text": "The four model variants (global MI, global TP, local MI, and local TP) were each evaluated on three metrics: word boundaries, word tokens, and word types."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-140",
"text": "Note that the first metric reported, simple boundary placement, considers only utterance-internal word-boundaries, rather than including those word boundaries which are detected 'for free' by virtue of being utteranceboundaries also."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-141",
"text": "This boundary measure may be more conservative than that reported by other authors, but is easily convertible into other metrics."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-142",
"text": "The second metric, the percentage of word tokens detected, is the same as Brent (1999a) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-143",
"text": "In order for a word to be counted as correctly found, three conditions must be met: (a) the word's beginning (left boundary) is correctly detected, (b) the word's ending (right boundary) is correctly detected, and (c) these two are consecutive (i.e., no false boundaries are posited within the word)."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-144",
"text": "The last metric (word type) is slightly more conservative than Brent's (1999a) in that the word type must have been actually spoken in the same utterance (not the same block of 500 utterances) in which it was detected to count as a match."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-145",
"text": "This lessens the possibility that a mismatch that happens to be segmentally identical to an actual word (but whose semantic context may not be conducive to learning its correct meaning) is counted as a match."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-146",
"text": "However, this situation is presumably rather rare."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-147",
"text": "Tables 2 and 3 present the results over the test set for both the global and the local comparisons of the predictability statistics proposed by Saffran et al. (1996) and Brent (1999a) ."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-149",
"text": "**COMPARING THE FOUR VARIANTS**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-150",
"text": "The findings here confirm Brent's (1999a) contention that mutual information is a better measure of predictability than is transitional probability-at least for the task of identifying words, not just boundaries."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-151",
"text": "This is particularly true in the global comparison."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-152",
"text": "Transitional probability finds more word boundaries in the 'local comparison' model, but this does not carry over to the task of pulling out the word themselves, which is arguably the infant's main concern."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-153",
"text": "This result should be kept in mind when interpreting or replicating (Saffran et al., 1996) or similar studies."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-154",
"text": "While Brent's 'local comparison' heuristic was unable to pull out one-phoneme-long words, as predicted above, this did not adversely affect it as much as anticipated."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-155",
"text": "On the contrary, both the local and global comparison heuristics tended to postulate too many word boundaries, as Brent had observed."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-156",
"text": "This is not necessarily a bad thing for infants, for several reasons."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-157",
"text": "First, infants may have a preference for finding short words, since these will presumably be easier to remember and learn, particularly if the child's phonetic memory is limited."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-158",
"text": "Second, it is probably easier to reject a hypothesized word (for example, on failing to find a consistent semantic cue for it) than to obtain a word not correctly segmented; hence false positives are less of a problem than false negatives for the child."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-159",
"text": "Third and most importantly, this cue is not likely to operate on its own, but rather as one among many contributing cues."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-160",
"text": "Other cues may act as filters on the boundaries suggested by this cue."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-161",
"text": "One example of this is the distribution of segments before utterance edges, as used by e.g., and Christiansen et al. (1998) However, as far as these results go, the word type metric shows that the finite-state model using a global threshold suffered slightly less from this problem than the local comparison model."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-162",
"text": "For the MI variants, both recall and precision for word type were about 2% higher on the global threshold variant."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-163",
"text": "For transitional probability, the precision of the local and global models was roughly equal, but recall for the global comparison model was 5.5% higher."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-164",
"text": "Not only were the global models better at pulling out a variety of words, but they also managed to learn longer ones (especially the global TP variant), including a few four-syllable words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-165",
"text": "The local model learned no four-syllable words, and relatively few three-syllable words."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-166",
"text": "The mixed nature of these results suggests that evaluation depends fairly crucially on what performance metric needs to be optimized."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-167",
"text": "This demands stronger prior hypotheses regarding the process and needed input of a vocabularyacquiring child."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-168",
"text": "However, it cannot be blindly assumed that children are selecting low points over as short a window as Brent's (1999a) MI and TP models suggest."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-169",
"text": "Quite possibly the best model would involve either a hybrid of local and global comparisons, or a longer window, or even a 'gradient' window where far neighbors count less than near ones in a computed average."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-170",
"text": "However, further speculation on point this of less importance than considering how this cue interacts with others known experimentally to be salient to infants."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-171",
"text": "Christiansen et al. (1998) and Johnson and Jusczyk (2001) have already began simulating and testing these interactions in English."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-172",
"text": "However, more work needs to be done to understand better the nature of these interactions cross-linguistically."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-174",
"text": "**FURTHER RESEARCH**"
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-175",
"text": "As mentioned above, one obvious area for future research is the interaction between predictability cues like MI and utterance-final information; this is one of the cue combinations explored in Christiansen et al. (1998) in English."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-176",
"text": "Previous research (Rytting, 2004) examined the role of utterance-final information in Greek, and found that this cue performs better than chance on its own."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-177",
"text": "However, it seems that utterance-final information would be more useful as a filter on the heuristics explored here to restrain them from oversegmenting the utterance."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-178",
"text": "Since nearly all Greek words end in /a/, /e/, /i/, /o/, /u/, /n/, or /s/, just restricting word boundaries to positions after these seven phonemes boosts boundary precision considerably with little effect on recall."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-179",
"text": "10 10 Naturally, in unrestricted speech the characteristics Preliminary testing suggests that this filter boosts both precision and recall at the word level."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-180",
"text": "However, a model that incorporates the likelihoods of word boundaries after each of these final segments, properly weighted, may be even more helpful than this simple, unweighted filter."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-181",
"text": "Another fruitful direction is the exploration of prosodic information such as lexical stress."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-182",
"text": "With the exception of a certain class of clitic groups, Greek words have at most one stress."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-183",
"text": "Hence, at least one word boundary must occur between two stressed vowels."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-184",
"text": "Relations between stress and the beginnings and endings of words, while not predicted to be as robust a cue as in English (see e.g., Cutler, 1996) , should also provide useful information, both alone and in combination with segmental cues."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-185",
"text": "Finally, the relationship between these more 'static' cues and the cues that emerge as vocabulary begins to be acquired (as in Brent's main MBDP-1 model and others discussed above) seems not to have received much attention in the literature."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-186",
"text": "As vocabulary is learned, it can help bootstrap these cues by augmenting heuristic cues with actual probabilities derived from its parses."
},
{
"sent_id": "310272015a781b05c42015c0559b18-C001-187",
"text": "Hence, the combination of e.g., MLDP-1 and these heuristics may prove more powerful than either approach alone."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-2",
"310272015a781b05c42015c0559b18-C001-5",
"310272015a781b05c42015c0559b18-C001-6"
],
[
"310272015a781b05c42015c0559b18-C001-7",
"310272015a781b05c42015c0559b18-C001-8"
],
[
"310272015a781b05c42015c0559b18-C001-34",
"310272015a781b05c42015c0559b18-C001-35"
],
[
"310272015a781b05c42015c0559b18-C001-40",
"310272015a781b05c42015c0559b18-C001-41",
"310272015a781b05c42015c0559b18-C001-42"
],
[
"310272015a781b05c42015c0559b18-C001-43",
"310272015a781b05c42015c0559b18-C001-44"
],
[
"310272015a781b05c42015c0559b18-C001-47",
"310272015a781b05c42015c0559b18-C001-48",
"310272015a781b05c42015c0559b18-C001-49",
"310272015a781b05c42015c0559b18-C001-50"
],
[
"310272015a781b05c42015c0559b18-C001-54",
"310272015a781b05c42015c0559b18-C001-55",
"310272015a781b05c42015c0559b18-C001-56",
"310272015a781b05c42015c0559b18-C001-57",
"310272015a781b05c42015c0559b18-C001-58"
],
[
"310272015a781b05c42015c0559b18-C001-59",
"310272015a781b05c42015c0559b18-C001-60",
"310272015a781b05c42015c0559b18-C001-61",
"310272015a781b05c42015c0559b18-C001-62",
"310272015a781b05c42015c0559b18-C001-63",
"310272015a781b05c42015c0559b18-C001-64"
],
[
"310272015a781b05c42015c0559b18-C001-65"
],
[
"310272015a781b05c42015c0559b18-C001-67",
"310272015a781b05c42015c0559b18-C001-68",
"310272015a781b05c42015c0559b18-C001-69",
"310272015a781b05c42015c0559b18-C001-70",
"310272015a781b05c42015c0559b18-C001-71",
"310272015a781b05c42015c0559b18-C001-72",
"310272015a781b05c42015c0559b18-C001-73",
"310272015a781b05c42015c0559b18-C001-74",
"310272015a781b05c42015c0559b18-C001-75",
"310272015a781b05c42015c0559b18-C001-76"
],
[
"310272015a781b05c42015c0559b18-C001-81",
"310272015a781b05c42015c0559b18-C001-82"
],
[
"310272015a781b05c42015c0559b18-C001-83"
],
[
"310272015a781b05c42015c0559b18-C001-84",
"310272015a781b05c42015c0559b18-C001-85"
],
[
"310272015a781b05c42015c0559b18-C001-112",
"310272015a781b05c42015c0559b18-C001-113",
"310272015a781b05c42015c0559b18-C001-114",
"310272015a781b05c42015c0559b18-C001-115",
"310272015a781b05c42015c0559b18-C001-116",
"310272015a781b05c42015c0559b18-C001-117"
],
[
"310272015a781b05c42015c0559b18-C001-147"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-5",
"310272015a781b05c42015c0559b18-C001-7",
"310272015a781b05c42015c0559b18-C001-34",
"310272015a781b05c42015c0559b18-C001-48",
"310272015a781b05c42015c0559b18-C001-58",
"310272015a781b05c42015c0559b18-C001-64",
"310272015a781b05c42015c0559b18-C001-81",
"310272015a781b05c42015c0559b18-C001-83",
"310272015a781b05c42015c0559b18-C001-84",
"310272015a781b05c42015c0559b18-C001-116",
"310272015a781b05c42015c0559b18-C001-147"
]
},
"@SIM@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-7",
"310272015a781b05c42015c0559b18-C001-8"
],
[
"310272015a781b05c42015c0559b18-C001-84",
"310272015a781b05c42015c0559b18-C001-85"
],
[
"310272015a781b05c42015c0559b18-C001-142"
],
[
"310272015a781b05c42015c0559b18-C001-150"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-7",
"310272015a781b05c42015c0559b18-C001-84",
"310272015a781b05c42015c0559b18-C001-142",
"310272015a781b05c42015c0559b18-C001-150"
]
},
"@DIF@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-7",
"310272015a781b05c42015c0559b18-C001-8"
],
[
"310272015a781b05c42015c0559b18-C001-81",
"310272015a781b05c42015c0559b18-C001-82"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-7",
"310272015a781b05c42015c0559b18-C001-81"
]
},
"@MOT@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-47",
"310272015a781b05c42015c0559b18-C001-48",
"310272015a781b05c42015c0559b18-C001-49",
"310272015a781b05c42015c0559b18-C001-50"
],
[
"310272015a781b05c42015c0559b18-C001-54",
"310272015a781b05c42015c0559b18-C001-55",
"310272015a781b05c42015c0559b18-C001-56",
"310272015a781b05c42015c0559b18-C001-57",
"310272015a781b05c42015c0559b18-C001-58"
],
[
"310272015a781b05c42015c0559b18-C001-59",
"310272015a781b05c42015c0559b18-C001-60",
"310272015a781b05c42015c0559b18-C001-61",
"310272015a781b05c42015c0559b18-C001-62",
"310272015a781b05c42015c0559b18-C001-63",
"310272015a781b05c42015c0559b18-C001-64"
],
[
"310272015a781b05c42015c0559b18-C001-67",
"310272015a781b05c42015c0559b18-C001-68",
"310272015a781b05c42015c0559b18-C001-69",
"310272015a781b05c42015c0559b18-C001-70",
"310272015a781b05c42015c0559b18-C001-71",
"310272015a781b05c42015c0559b18-C001-72",
"310272015a781b05c42015c0559b18-C001-73",
"310272015a781b05c42015c0559b18-C001-74",
"310272015a781b05c42015c0559b18-C001-75",
"310272015a781b05c42015c0559b18-C001-76"
],
[
"310272015a781b05c42015c0559b18-C001-112",
"310272015a781b05c42015c0559b18-C001-113",
"310272015a781b05c42015c0559b18-C001-114",
"310272015a781b05c42015c0559b18-C001-115",
"310272015a781b05c42015c0559b18-C001-116",
"310272015a781b05c42015c0559b18-C001-117"
],
[
"310272015a781b05c42015c0559b18-C001-144",
"310272015a781b05c42015c0559b18-C001-145",
"310272015a781b05c42015c0559b18-C001-146"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-48",
"310272015a781b05c42015c0559b18-C001-58",
"310272015a781b05c42015c0559b18-C001-64",
"310272015a781b05c42015c0559b18-C001-116",
"310272015a781b05c42015c0559b18-C001-144"
]
},
"@USE@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-112",
"310272015a781b05c42015c0559b18-C001-113",
"310272015a781b05c42015c0559b18-C001-114",
"310272015a781b05c42015c0559b18-C001-115",
"310272015a781b05c42015c0559b18-C001-116",
"310272015a781b05c42015c0559b18-C001-117"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-116"
]
},
"@EXT@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-144",
"310272015a781b05c42015c0559b18-C001-145",
"310272015a781b05c42015c0559b18-C001-146"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-144"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"310272015a781b05c42015c0559b18-C001-167",
"310272015a781b05c42015c0559b18-C001-168",
"310272015a781b05c42015c0559b18-C001-169"
]
],
"cite_sentences": [
"310272015a781b05c42015c0559b18-C001-168"
]
}
}
},
"ABC_3395c9ed8ad9f2d048bf8ebf950d16_6": {
"x": [
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-2",
"text": "Pragmatic reasoning allows humans to go beyond the literal meaning when interpreting language in context."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-3",
"text": "Previous work has shown that such reasoning can improve the performance of already-trained language understanding systems."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-4",
"text": "Here, we explore whether pragmatic reasoning during training can improve the quality of learned meanings."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-5",
"text": "Our experiments on reference game data show that end-to-end pragmatic training produces more accurate utterance interpretation models, especially when data is sparse and language is complex."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-8",
"text": "We often draw pragmatic inferences about a speaker's intentions from what they choose to say, but also from what they choose not to say in context."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-9",
"text": "This pragmatic reasoning arises from listeners' inferences based on speakers' cooperativity (Grice, 1975) , and prior work has observed that such reasoning enables human children to more quickly learn word meanings (Frank and Goodman, 2014) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-10",
"text": "This suggests that pragmatic reasoning might allow modern neural network models to more efficiently learn on grounded language data from cooperative reference games."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-11",
"text": "As a motivating case, consider an instance of the color reference task from Monroe et al. (2017) shown in the first row of Table 1 ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-12",
"text": "In this task, a speaker communicates a target color to a listener in a context containing two distractor colors; the listener picks out the target based on what the speaker says."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-13",
"text": "In the first instance from Table 1 , the speaker utters \"dark blue\" to describe the target."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-14",
"text": "Whereas \"dark\" and \"blue\" also apply to the target, they lose their informativity in the presence of the distractors, and so the speaker pragmatically opts for \"dark blue\"."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-15",
"text": "A listener who is learning the language from such examples might draw several inferences from the speaker's utterance."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-16",
"text": "First, under the assumption that the speaker is informative, a \"literal\" learner might infer that \"dark blue\" applies to the target shade more than the distractors."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-17",
"text": "Second, a \"pragmatic\" learner might consider the cheaper alternatives-\"dark\" and \"blue\"-that have occurred in the presence of the same target in prior contexts, and infer that these alternative utterances must also apply to the distractors given the speaker's failure to use them."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-18",
"text": "The pragmatic learner might thus gain more semantic knowledge from the same training instances than the literal learner: pragmatic reasoning can reduce the data complexity of learning."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-19",
"text": "The pragmatic learning effects just described depend on the existence of low cost alternative utterances that the learner already knows can apply to the target object."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-20",
"text": "The existence of short alternatives will be more likely when the target objects are more complex (as in row 2 of Table 1), because these objects require longer utterances (with therefore more short alternatives) to individuate."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-21",
"text": "Thus, we further hypothesize that pragmatic inference will reduce data complexity especially in contexts that elicit more complex language."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-22",
"text": "In light of these arguments, we leverage the pragmatic inference described here in training neural network models to play reference games."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-23",
"text": "For formal, probabilistic representations of contextual reasoning in our training objectives, we embed neural language models within pragmatic listener and speaker distributions, as specified by the Rational Speech Acts (RSA) framework (Goodman and Frank, 2016; Frank and Goodman, 2012) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-24",
"text": "Pragmatic inference allows our models to learn from indirect pragmatic evidence of the sort described above, yielding better calibrated, context-sensitive models and more efficient use Target Distractors Utterance Cheaper Alternative Utterances 1. x x x \"dark blue\" \"blue\", \"dark\". . ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-26",
"text": "**X X X**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-27",
"text": "x x x x x x \"left dark blue\" \"dark blue\", \"left dark\", \"right black\". . ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-28",
"text": "Table 1 : Speaker utterances describing (1) colors and (2) color grids to differentiate them from distractors."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-29",
"text": "A learner might draw inferences about fine-grained linguistic distinctions by explaining the speaker's failure to use cheaper alternatives in context (e.g. they might infer that \"blue\" and \"dark\" apply to some distractors in 1)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-30",
"text": "These inferences have the potential to increase in number and in strength as dimensionality of the referents and utterance complexity increase (as in 2)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-31",
"text": "of the training data."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-32",
"text": "We compare pragmatic and non-pragmatic models at training and at test, while varying conditions on the training data to test hypotheses regarding the utility of pragmatic inference for learning."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-33",
"text": "In particular, we show that incorporating pragmatic reasoning at training time yields improved, state-of-the-art accuracy for listener models on the color reference task from Monroe et al. (2017) , and the effect demonstrated by this improvement is especially large under small training data sizes."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-34",
"text": "We further introduce a new color-grid reference task and data set consisting of higher dimensional objects and more complex speaker language; we find that the effect of pragmatic listener training is even larger in this setting."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-36",
"text": "**RELATED WORK**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-37",
"text": "Prior work has shown that neural network models trained to capture the meanings of utterances can be improved using pragmatic reasoning at test time via the RSA framework (Andreas and Klein, 2016; Monroe et al., 2017; Goodman and Frank, 2016; Frank and Goodman, 2012) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-38",
"text": "For instance, Monroe et al. (2017) train context-agnostic (i.e. non-pragmatic) neural network models to learn the meanings of color utterances using a corpus of examples of the form shown in the first line of Table 1 ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-39",
"text": "At evaluation, they add an RSA layer on top of the trained model to draw pragmatic, context-sensitive inferences about intended color referents."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-40",
"text": "Other related work explores additional approaches to create context-aware models that generate color descriptions (Meo et al., 2014) , image captions (Vedantam et al., 2017) , spatial references (Golland et al., 2010) , and utterances in simple reference games (Andreas and Klein, 2016) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-41",
"text": "Each of these shows that adding pragmatics at test time improves performance on tasks where context is relevant."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-42",
"text": "Whereas this prior work showed the effectiveness of pragmatic inferences for models trained non-pragmatically, our current work shows that these pragmatic inferences can also inform the training procedure, providing additional gains in performance."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-43",
"text": "More similar to our work, Monroe and Potts (2015) improve model performance by incorporating pragmatic reasoning into the learning procedure for an RSA pragmatic speaker model."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-44",
"text": "However, in contrast to our work, they consider a much simpler corpus, and a simple non-neural semantics."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-45",
"text": "We consider richer corpora with sequential utterances and continuous referent objects that pose several algorithmic challenges which we solve using neural networks and Monte Carlo methods."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-47",
"text": "**APPROACH**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-48",
"text": "We compare neural nets trained pragmatically and non-pragmatically on a new color-grid reference game corpus as well as the color reference corpus from Monroe et al. (2017) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-49",
"text": "In this section, we describe our tasks and models."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-51",
"text": "**REFERENCE GAME LISTENER TASKS**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-52",
"text": "The color reference game from Monroe et al. (2017) consists of rounds played between a speaker and a listener."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-53",
"text": "Each round has a context of two distractors and a target color (Figure 1a )."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-54",
"text": "Only the speaker knows the target, and must communicate it to the listener-who must pick out the target based on the speaker's English utterance."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-55",
"text": "Similarly, each round of our new color-grid reference game contains target and distractor color-grid objects, and the speaker must communicate the target grid to the listener (Figure 1b) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-56",
"text": "We train neural network models to play the listener role in these games."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-58",
"text": "**MODELS**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-59",
"text": "In both reference games, our listener models reason about a round r represented by a single train- ing/testing example of the form (O (r) , U (r) , t (r) ) where O (r) is the set of objects observed in the round (colors or color-grids), U (r) is a sequence of utterances produced by the speaker about the target (represented as a token sequence), and t (r) is the target index in O (r) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-60",
"text": "The models predict the most likely referent O (r) t of an utterance within a context O (r) according to an RSA listener distribution l(t (r) | U (r) , O (r) ) over targets given the utterances and a context."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-61",
"text": "In pragmatic models, a nested structure allows the listener to form its beliefs about the intended referent by reasoning recursively about speaker intentions with respect to a hypothetical \"literal\" (non-pragmatic) listener's interpretations of utterances."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-62",
"text": "This recursive reasoning allows listener models to account for the speaker's context-sensitive, pragmatic adjustments to the semantic content of utterances."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-63",
"text": "Formally, our pragmatic RSA model l 1 , with learnable semantic parameters \u03b8, for target referent t, given an observed context O and speaker utterances U , is computed as:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-64",
"text": "In these equations, the top-level l 1 listener model estimates the target referent by computing a pragmatic speaker s 1 and a target prior p(t)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-65",
"text": "Similarly, the pragmatic speaker s 1 computes an utterance distribution with respect to a literal listener l 0 , an utterance prior p(U | O), and a rationality parameter \u03b1."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-66",
"text": "Finally, the \"literal\" listener computes its expectation about the target referent from the target prior p(t) and the literal meaning, L \u03b8 U,Ot , which captures the extent to which utterance U applies to O t ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-67",
"text": "In both the l 0 and l 1 distributions, we take p(t) to be a uniform distribution over target indices."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-68",
"text": "Literal meanings The literal meanings L \u03b8 U,Ot in l 0 are computed by an LSTM (Hochreiter and Schmidhuber, 1997 ) that takes an input utterance and an object (color or color-grid), and produces output in the interval (0, 1) representing the degree to which the utterance applies to the object (see Figure 2b )."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-69",
"text": "The object is represented as a single continuous vector, and is mapped into the initial hidden state of the LSTM by a dense linear layer in the case of colors, and an averagepooled, convolutional layer in the case of grids (with weights shared across the grid-cell representations described in Section 4.1.2)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-70",
"text": "Given the initialized hidden state, the LSTM runs over embeddings of the tokens of an utterance."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-71",
"text": "The final hidden state is passed through an affine layer, and squished by a sigmoid to produce output in (0, 1)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-72",
"text": "This neural net contains all learnable parameters \u03b8 of our listeners."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-73",
"text": "Utterance prior The utterance prior p(U | O) in s 1 is a non-uniform distribution over sequences of English tokens-represented by a pre-trained LSTM language model conditioned on an input color or grid (see Figure 2a )."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-74",
"text": "Similar to the literal meaning LSTM, we apply a linear transformation to the input object to initialize the LSTM hidden state."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-75",
"text": "Then, each step of the LSTM applies to and outputs successive tokens of an utterance."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-76",
"text": "In addition, when operating over grid inputs, we apply a layer of multiplicative attention given by the \"general\" scoring function in (Luong et al., 2015) between the LSTM output and the convolutional grid output before the final Softmax."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-77",
"text": "This allows the language model to \"attend\" to individual grid cells when producing output tokens, yielding an improvement in utterance prior sample quality."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-78",
"text": "The language model is pre-trained over speaker utterances paired with targets, but the support of the distribution encoded by this LSTM is too large for the s 1 normalization term within the RSA listener to be computed efficiently."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-79",
"text": "Similar to Monroe et al. (2017), we resolve this issue by taking a small set of samples from the pre-trained LSTM applied to each object in a context, to approximate p(U | O), each time l 1 is computed during training and evaluation."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-81",
"text": "**LEARNING**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-82",
"text": "The full l 1 neural RSA architecture for computing pragmatic predictions over batches of input utterances and contexts is given by Algorithm 1. 1 During training, we backpropagate gradients through the full architecture, including the RSA layers, and optimize the pragmatic likelihood max \u03b8 log l 1 (t | U, O; \u03b8)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-83",
"text": "For clarity, we can rewrite this optimization problem for a single (O, U, t) training example in the following simplified form by manipulating the RSA distributional equations from the previous section:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-84",
"text": "Here, Z l 1 , Z s 0 , and Z l 0 are the normalization terms in the denominators of the nested RSA distributions, which we can rewrite using the log-sum-exp function (LSE) as:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-85",
"text": "Given this representation of the optimization problem, we can see its relationship to the intuitive characterization of pragmatic learning that we gave in the introduction."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-86",
"text": "First, the two terms log L \u03b8 U,Ot \u2212 log Z l 0 (U | O; \u03b8) can be seen as finding the optimal non-pragmatic parameters; the first log L \u03b8 U,Ot term upweights the model's estimate of the literal applicability of the observed U to its intended target referent, and the \u2212 log Z l 0 (U | O; \u03b8) term maximizes the margin between this estimate and the applicability of U to the contextual distractors."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-87",
"text": "2 Next, the \u2212 log Z s 1 (t | O; \u03b8) term makes pragmatic adjustments to the parameter estimates by enforcing a margin between the l 0 predictions given by low cost alternatives U and the observed utterance U on a referent object t. The enforcement of this margin pushes L \u03b8 U ,O t upward for distractors t , simulating the pragmatic reasoning described in the introduction, and drawing additional information about the low cost alternative utterances from their omission in context."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-88",
"text": "Finally, the \u2212 log Z l 1 (U | O; \u03b8) term enforces a margin between the speaker prediction s 1 (U | t, O; \u03b8) and predictions on the true utterance U given distractors O t ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-89",
"text": "This ensures that the true utterance is down-weighted on distractor objects following the speaker's pragmatic adjustments, such that our l 1 listener predictions are well-calibrated with respect to the s 1 distribution's cost-sensitive adjustments learned through \u2212 log Z s 1 (t | O; \u03b8)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-91",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-92",
"text": "We investigate the value of pragmatic training by estimating the parameters \u03b8 in the RSA \"literal meaning\" function L \u03b8 for l 1 (pragmatic) and l 0 (non-pragmatic) distributions according to the maximal likelihood of the training data for the color and grid reference tasks."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-93",
"text": "We then evaluate meanings L \u03b8 from each training procedure using pragmatic l 1 inference (and non-pragmatic l 0 inference, for completeness)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-94",
"text": "We perform this comparison repeatedly to evaluate the value of pragmatics at training and test under various data conditions."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-95",
"text": "In particular, we evaluate the hypotheses that (1) the pragmatic inferences enabled by the l 1 training will reduce sample complexity, leading to more accurate meaning functions especially under small data sizes, and (2) the effectiveness Algorithm 1 RSA pragmatic listener (l 1 ) neural network forward computation."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-96",
"text": "The l 1 function is applied to batches of input utterances and observed contexts, and produces batches of distributions over objects in the contexts, representing the listener's beliefs about intended utterance referents."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-97",
"text": "1: b \u2190 data batch size 2: l \u2190 maximum utterance length 3: k \u2190 number of objects per context (i.e. colors or color-grids) 4: d \u2190 dimension of each object 5: u \u2190 number utterances to sample per object in context to make speaker distribution supports 6: z \u2190 ku + 1 number of utterances in each support including input utterance 7: s0 \u2190 pre-trained LSTM language model (Figure 2a ) 8: L \u2190 LSTM meaning function architecture (Figure 2b ) 9: function l1(utterances U \u2208 R b\u00d7l , observations O \u2208 R b\u00d7k\u00d7d ) 10:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-98",
"text": "Pt \u2190 (S = (0, . . . , k \u2212 1)"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-99",
"text": "target distributions conditioned on utterances in U 13: 14: function s1(possible targets T \u2208 R b\u00d7k , observations O \u2208 R b\u00d7k\u00d7d , fixed input utterances U \u2208 R b\u00d7l ) 15:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-100",
"text": "Putt \u2190 SAMPLE-UTTERANCE-PRIORS(U, O) sample batch of utterance priors of size b \u00d7 z 16:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-101",
"text": "initialize supports and probabilities in utterance prior tensor 26:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-102",
"text": "for i = 1 to b do for each round in batch 27:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-103",
"text": "for j = 1 to k do for each object in a round 28:"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-104",
"text": "Sample Monroe et al. (2017) )."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-105",
"text": "Both architectures apply a tanh layer to an input object o (a grid or color), and use the result as the initial hidden state of an LSTM layer."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-106",
"text": "In each case, the LSTM operates over embeddings of tokens u 1 , u 2 , . . . from utterance U ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-107",
"text": "of the l 1 training over l 0 training will increase on a more difficult reference game task containing higher-dimensional objects and utterancesi.e."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-108",
"text": "pragmatic training will help more in the grids task than in the colors task."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-110",
"text": "**DATA**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-112",
"text": "**COLOR REFERENCE**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-113",
"text": "For the color reference task, we use the data collected by Monroe et al. (2017) from human play on the color reference task through Amazon Mechanical Turk using the framework of Hawkins (2015)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-114",
"text": "Each game consists of 50 rounds played by a human speaker and listener."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-115",
"text": "In each round, the speaker describes a target color surrounded by a context of two other distractor colors, and a listener clicks on the targets based on the speaker's description (see Figure 1a) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-116",
"text": "The resulting data consists of 46, 994 rounds across 948 games, where the colors of some rounds are sampled to be more likely to require pragmatic reasoning than others."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-117",
"text": "In particular, 15, 516 trials are close with both distractors within a small distance to the target color in RGB space, 15, 782 are far with both distractors far from the target, and 15, 693 are split with one distractor near the target and one far from the target."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-118",
"text": "For model development, we use the train/dev/test split from Monroe et al. (2017) with 15, 665 training, 15, 670 dev, and 15, 659 test rounds."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-119",
"text": "Within our models, we represent color objects using a 3-dimensional CIELAB color spacenormalized so that the values of each dimension are in [\u22121, 1]."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-120",
"text": "Our use of the CIELAB color space departs from prior work on the color data which used a 54-dimensional Fourier space (Monroe et al., 2017 (Monroe et al., , 2016 Zhang and Lu, 2002) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-121",
"text": "We found that both the CIELAB and Fourier spaces gave similar model performance, so we chose the CIELAB space due to its smaller dimensionality."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-122",
"text": "Our speaker utterances are represented as sequences of cleaned English token strings."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-123",
"text": "Following Monroe et al. (2017) , we preprocess the tokens by lowercasing, splitting off punctuation, and replacing tokens that appear only once with [unk] ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-124",
"text": "In the color data, we also follow the prior work and split off -er, -est, and -ish, suffixes."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-125",
"text": "Whereas the prior work concatenated speaker messages into a single utterance without limit, we limit the full sequence length to 40 tokens for efficiency."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-127",
"text": "**GRID REFERENCE**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-128",
"text": "Because initial simulations suggested that pragmatic training would be more valuable in more complex domains (where data sparsity is a greater issue), we collected a new data set from human play on the color-grid reference task described in Section 3.1."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-129",
"text": "Data was collected on Amazon Mechanical Turk using an open source framework for collaborative games (Hawkins, 2015) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-130",
"text": "Each game consists of 60 rounds played between a human speaker and listener, where the speaker describes a target grid in the presence of two distractor grids (see Figure 1b) , resulting in a data set of 10,666 rounds spread across 197 games."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-131",
"text": "3 Each round consists of three 3 \u00d7 3 grid objects, with the grid colors at each cell location sampled according to the same close, split, and far conditions as the in color reference data-yielding 3,575 close, 3,549 far, and 3,542 split rounds."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-132",
"text": "We also varied the number of cells that differ between objects in a round from 1 to 9."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-133",
"text": "As shown in Figure 3 , these grid trials result in more complex speaker utterances than the color data."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-134",
"text": "We partitioned this data into 158 train, 21 dev, 18 test games containing 8,453 training, 1,236 dev, and 977 test rounds."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-135",
"text": "In our models, we represent a single color-grid object from the data as a concatenation of 9 vectors representing the 9 grid cells."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-136",
"text": "Each of the 9 cell vectors consists of the normalized CIELAB representation used in the color data appended to a one-hot vector representing the position of the cell within the grid."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-137",
"text": "For speaker utterances, we use the same representation as in the color data, except that we do not split off the -er, -est, and -ish endings."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-139",
"text": "**MODEL TRAINING DETAILS**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-140",
"text": "We implement our models in PyTorch (Paszke et al., 2017) , and train them using the Adam variant of stochastic gradient descent (Kingma and Ba, 2015) with default parameters (\u03b2 1 , \u03b2 2 ) = (0.9, 0.999) and = 10 \u22128 ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-141",
"text": "We train with early-stopping based on dev set log-likelihood (for speaker) or accuracy (for listener) model evaluations."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-142",
"text": "Before training our listeners, we pre-train an LSTM language model to provide samples for the utterance priors on target colors paired with speaker utterances of length at most 12 on examples where human listeners picked the correct color."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-143",
"text": "We follow Monroe et al. (2017) for language model hyper-parameters, with embedding and LSTM layers of size 100."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-144",
"text": "Also following this prior work, we use a learning rate of 0.004, batch size 128, and apply 0.5 dropout to each layer."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-145",
"text": "We train for 7, 000 iterations, evaluating the model's accuracy on the dev set every 100 iterations."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-146",
"text": "We pick the model with the best dev set log-likelihood from evaluations at 100 iteration intervals."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-147",
"text": "To train and compare various listeners, we optimize likelihoods under non-pragmatic l 0 and pragmatic l 1 with a literal meaning function computed by the LSTM architecture described in Section 3.2, sampling new utterance priors for each mini-batch from our pre-trained language model applied to the round's three colors for use within the s 1 module of RSA (see Algorithm 1)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-148",
"text": "We draw 30 samples per round (10 per color or grid) at a maximum length of 12."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-149",
"text": "We generally use speaker rationality \u03b1 = 8.0 based on dev set tuning, and we follow Monroe et al. (2017) for other hyper-parameters-with embedding size of 100 and LSTM size of 100 in our meaning functions."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-150",
"text": "Also following this prior work, we allow the LSTM to be bidirectional with learning rate of 0.005, batch size 128, and gradient clipping at 5.0."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-151",
"text": "We train listeners for 10, 000 iterations on the color data and 15, 000 iterations on grid data, evaluating dev set accuracy every 500 iterations."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-152",
"text": "We pick the model with the best accuracy from those evaluated at 500 iteration intervals."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-154",
"text": "**RESULTS**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-156",
"text": "**COLOR REFERENCE**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-157",
"text": "The accuracies of color target predictions by l 0 and l 1 models under both l 0 and l 1 training are shown in the left columns of Table 2 ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-158",
"text": "For robustness, average accuracies and standard errors were computed by repeatedly retraining and evaluating with different weight initializations and training data orderings using 4 different random seeds."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-159",
"text": "The results in the top left panel of Table 2 show that l 1 pragmatic training coupled with l 1 pragmatic evaluation gives the best average accuracies."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-160",
"text": "The previously studied l 1 pragmatic usage with l 0 non-pragmatic training is next best."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-161",
"text": "These results provide evidence that literal meanings estimated through pragmatic training are better calibrated for pragmatic usage than meanings estimated through non-pragmatic l 0 training."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-162",
"text": "Furthermore, relative to state-of-the-art in Monroe et al. (2017) , Table 2 shows that our pragmatically trained model yields improved accuracy over their best \"blended\" pragmatic L e model which computed predictions based on the product of two separate non-pragmatically trained models."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-163",
"text": "The effect sizes are small for the pragmatic to non-pragmatic comparisons when training on the full color data (though approaching the ceiling 0.9108 human accuracy), but we hypothesized that the effect of pragmatic training would increase for training with smaller data sizes (as motivated by arguments in the introduction)."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-164",
"text": "To test this, we trained the listener models on smaller subsets of the training data, and evaluated accuracy."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-165",
"text": "As shown by the top left plot of Figure 4 , pragmatic training results in a larger gain in accuracy when less data is available for training."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-166",
"text": "Lastly, we also considered the effect of pragmatic training under the varying close, split, and far data conditions."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-167",
"text": "As shown in the three plots at the right of the top row of Figure 4 , the effect of l 1 training over l 0 is especially pronounced for inferences on close and split data conditions where the target is more similar to the distractors, and the language is more context-dependent."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-168",
"text": "This makes sense, as these conditions contain examples where the pragmatic, cost-sensitive adjustments to the learned meanings would be the most necessary."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-170",
"text": "**GRID REFERENCE**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-171",
"text": "For the more complex grid reference task, the listener accuracies in the right columns of Table 2 show an even larger gain from pragmatic l 1 training, and no gain is seen for pragmatic evaluation with non-pragmatic training."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-172",
"text": "This result is consistent with the hypothesis motivated by arguments in the introduction that pragmatic training should be more effective in contexts containing targets and distractors for which many low-cost alternative utterance are applicable."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-173",
"text": "Furthermore, the grid-reference data-complexity exploration in the bottom row of Figure 4 shows that this improvement given by pragmatic training remains large across data sizes; the exception is the smallest amount of training data under the most difficult close condition, where the language is so sparse that meanings may be difficult to estimate, even with pragmatic adjustments."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-174",
"text": "Altogether, these results suggest that pragmatic training helps with an intermediate amount of data relative to the domain complexity-with too little data, pragmatics has no signal to work with, but with too much data, the indirect evidence provided by pragmatics is less necessary."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-175",
"text": "Since real-world linguistic contexts are more complex than either of our experimental domains, we hypothesize that they often fit into this intermediate data regime."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-177",
"text": "**LITERAL MEANING COMPARISONS**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-178",
"text": "To improve our understanding of the quantitative results, we also investigate qualitative differences between meaning functions L \u03b8 estimated under Figure 2b ) for an utterance U on colors O for regions of color space."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-179",
"text": "l 0 and l 1 on the color reference task."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-180",
"text": "Table 3 shows representations of these meaning functions for several utterances."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-181",
"text": "For each utterance U , we plot the extension L \u03b8 U estimated under l 0 and l 1 , with the darkness of a pixel at c representing L \u03b8 U,c -the degree to which an utterance U applies to a color c within a Hue \u00d7 Saturation color space."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-182",
"text": "In these plots, the larger areas of medium gray shades for l 1 extensions suggest that the pragmatic training yields more permissive interpretations for a given utterance."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-183",
"text": "This makes sense, as pragmatics allows looser meanings to be effectively tightened at interpretation time."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-184",
"text": "Furthermore, the meanings learned by the l 1 also have lower curvature across the color space, consistent with a view of pragmatics as providing a regularizer (Section 3.3)-preventing overfitting."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-185",
"text": "This view is further supported by the plots on the right-hand side of Table 3, which show that the meanings learned by l 0 from smaller amounts of training data tend to overfit to idiosyncratic regions of the color space, whereas the pragmatic l 1 training tends to smooth out these irregularities."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-186",
"text": "These qualitative observations are also consistent with the data complexity results shown in Figure 4 , where the l 1 training gives an especially large improvement over l 0 for small data sizes."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-187",
"text": "----------------------------------"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-188",
"text": "**CONCLUSION**"
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-189",
"text": "Our experiments provide evidence that using pragmatic reasoning during training can yield improved neural semantics models."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-190",
"text": "This was true in the existing color reference corpus, where we achieved state-of-the art results, and even more so in the new color-grid corpus."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-191",
"text": "We thus found that pragmatic training is more effective when data is relatively sparse and the domain yields complex, high-cost utterances and low-cost omissions over which pragmatic inferences might proceed."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-192",
"text": "Future work should provide further exploration of the data regime in which pragmatic learning is most beneficial and its correspondence to realworld language use."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-193",
"text": "This might include scaling with linguistic complexity and properties of referents."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-194",
"text": "In particular, the argument in our introduction suggests that especially frequent objects and low-cost utterances are the seed from which pragmatic inference can proceed over more complex language and infrequent objects."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-195",
"text": "This asymmetry in object reference rates is expected for longtail, real-world regimes consistent with Zipf's law (Zipf, 1949) ."
},
{
"sent_id": "3395c9ed8ad9f2d048bf8ebf950d16-C001-196",
"text": "Overall, we have shown that pragmatic reasoning regarding alternative utterances provides a useful inductive bias for learning in grounded language understanding systems-leveraging inferences over what speakers choose not to say to reduce the data complexity of learning."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-10",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-11",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-12",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-13",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-14",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-15",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-16",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-17",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-18",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-8"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-32",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-33",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-34"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-37"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-38",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-39"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-48"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-52",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-53",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-54",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-55",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-56"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-78",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-79"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-113",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-114",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-115",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-116",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-117"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-118",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-119",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-120"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-161",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-162"
]
],
"cite_sentences": [
"3395c9ed8ad9f2d048bf8ebf950d16-C001-11",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-33",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-37",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-38",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-48",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-52",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-79",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-113",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-118",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-120",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-162"
]
},
"@DIF@": {
"gold_contexts": [
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-32",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-33",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-34"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-161",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-162"
]
],
"cite_sentences": [
"3395c9ed8ad9f2d048bf8ebf950d16-C001-33",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-162"
]
},
"@SIM@": {
"gold_contexts": [
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-78",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-79"
]
],
"cite_sentences": [
"3395c9ed8ad9f2d048bf8ebf950d16-C001-79"
]
},
"@USE@": {
"gold_contexts": [
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-104"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-113",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-114",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-115",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-116",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-117"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-143",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-144",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-145"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-149",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-150",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-151",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-152"
]
],
"cite_sentences": [
"3395c9ed8ad9f2d048bf8ebf950d16-C001-104",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-113",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-143",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-149"
]
},
"@EXT@": {
"gold_contexts": [
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-118",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-119",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-120"
],
[
"3395c9ed8ad9f2d048bf8ebf950d16-C001-123",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-124"
]
],
"cite_sentences": [
"3395c9ed8ad9f2d048bf8ebf950d16-C001-118",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-120",
"3395c9ed8ad9f2d048bf8ebf950d16-C001-123"
]
}
}
},
"ABC_a3dbc3362016cdcfc0c4da429b98cc_6": {
"x": [
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-2",
"text": "Although Seq2Seq models for table-to-text generation have achieved remarkable progress, modeling table representation in one dimension is inadequate."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-3",
"text": "This is because (1) the table consists of multiple rows and columns, which means that encoding a table should not depend only on one dimensional sequence or set of records and (2) most of the tables are time series data (e.g. NBA game data, stock market data), which means that the description of the current table may be affected by its historical data."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-4",
"text": "To address aforementioned problems, not only do we model each table cell considering other records in the same row, we also enrich table's representation by modeling each table cell in context of other cells in the same column or with historical (time dimension) data respectively."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-5",
"text": "In addition, we develop a table cell fusion gate to combine representations from row, column and time dimension into one dense vector according to the saliency of each dimension's representation."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-6",
"text": "We evaluated our methods on ROTOWIRE, a benchmark dataset of NBA basketball games."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-7",
"text": "Both automatic and human evaluation results demonstrate the effectiveness of our model with improvement of 2.66 in BLEU over the strong baseline and outperformance of state-of-the-art model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-10",
"text": "to-text generation is an important and challenging task in natural language processing, which aims to produce the summarization of numerical table (Reiter and Dale, 2000; Gkatzia, 2016) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-11",
"text": "The related methods can be empirically divided into two categories, pipeline model and end-toend model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-12",
"text": "The former consists of content selection, document planning and realisation, mainly for early industrial applications, such as weather * Email corresponding."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-13",
"text": "The Charlotte Hornets ( 21 -27 ) defeated the Washington Wizards ( 31 -18 ) 92 -88 on Monday \u2026The Hornets were led by Al Jefferson in this game , who went 9 -for -19 from the floor to score 18 points ... It was the second time in the last three games he 's posted a double -double , while the two steals matched a season -high for the center \u2026 Beal has turned it on over his last two games , combining for 44 points and 14 rebounds ... This double -double marked the second in a row for Wall , who 's combined for 44 points and 22 asssists over his last two games \u2026"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-14",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-15",
"text": "**GOLD**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-16",
"text": "Figure 1: Generated example on ROTOWIRE by using Conditional Copy (CC) as baseline (Wiseman et al., 2017) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-17",
"text": "Text that accurately reflects records in the table is in red, and text that contradicts the records is in blue."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-18",
"text": "forecasting and medical monitoring, etc."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-19",
"text": "The latter generates text directly from the table through a standard neural encoder-decoder framework to avoid error propagation and has achieved remarkable progress."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-20",
"text": "In this paper, we particularly focus on exploring how to improve the performance of neural methods on table-to-text generation."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-21",
"text": "Recently, ROTOWIRE, which provides tables of NBA players' and teams' statistics with a descriptive summary, has drawn increasing attention from academic community."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-22",
"text": "Figure 1 shows an example of parts of a game's statistics and its corresponding computer generated summary."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-23",
"text": "We can see that the tables has a formal structure including table row header, table column header and table cells."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-24",
"text": "\"Al Jefferson\" is a table row header that represents a player, \"PTS\" is a table column header indicating the column contains player's score and \"18\" is the value of the table cell, that is, Al Jefferson scored 18 points."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-25",
"text": "Several related models have been proposed ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-26",
"text": "They typically encode the table's records separately or as a long sequence and generate a long descriptive summary by a standard Seq2Seq decoder with some modifications."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-27",
"text": "Wiseman et al. (2017) explored two types of copy mechanism and found conditional copy model (Gulcehre et al., 2016) perform better ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-28",
"text": "Puduppully et al. (2019) enhanced content selection ability by explicitly selecting and planning relevant records."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-29",
"text": "Li and Wan (2018) improved the precision of describing data records in the generated texts by generating a template at first and filling in slots via copy mechanism."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-30",
"text": "Nie et al. (2018) utilized results from pre-executed operations to improve the fidelity of generated texts."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-31",
"text": "However, we claim that their encoding of tables as sets of records or a long sequence is not suitable."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-32",
"text": "Because (1) the table consists of multiple players and different types of information as shown in Figure 1 ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-33",
"text": "The earlier encoding approaches only considered the table as sets of records or one dimensional sequence, which would lose the information of other (column) dimension."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-34",
"text": "(2) the table cell consists of time-series data which change over time."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-35",
"text": "That is to say, sometimes historical data can help the model select content."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-36",
"text": "Moreover, when a human writes a basketball report, he will not only focus on the players' outstanding performance in the current match, but also summarize players' performance in recent matches."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-37",
"text": "Lets take Figure 1 again."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-38",
"text": "Not only do the gold texts mention Al Jefferson's great performance in this match, it also states that \"It was the second time in the last three games he's posted a double-double\"."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-39",
"text": "Also gold texts summarize John Wall's \"doubledouble\" performance in the similar way."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-40",
"text": "Summarizing a player's performance in recent matches requires the modeling of table cell with respect to its historical data (time dimension) which is absent in baseline model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-41",
"text": "Although baseline model Conditional Copy (CC) tries to summarize it for Gerald Henderson, it clearly produce wrong statements since he didn't get \"double-double\" in this match."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-42",
"text": "To address the aforementioned problems, we present a hierarchical encoder to simultaneously model row, column and time dimension information."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-43",
"text": "In detail, our model is divided into three layers."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-44",
"text": "The first layer is used to learn the representation of the table cell."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-45",
"text": "Specifically, we employ three self-attention models to obtain three representations of the table cell in its row, column and time dimension."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-46",
"text": "Then, in the second layer, we design a record fusion gate to identify the more important representation from those three dimension and combine them into a dense vector."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-47",
"text": "In the third layer, we use mean pooling method to merge the previously obtained table cell representations in the same row into the representation of the table's row."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-48",
"text": "Then, we use self-attention with content selection gate (Puduppully et al., 2019) to filter unimportant rows' information."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-49",
"text": "To the best of our knowledge, this is the first work on neural"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-51",
"text": "**PRELIMINARIES**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-53",
"text": "**NOTATIONS**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-54",
"text": "The input to the model are tables S = {s 1 , s 2 , s 3 }."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-55",
"text": "s 1 , s 2 , and s 3 contain records about players' performance in home team, players' performance in visiting team and team's overall performance respectively."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-56",
"text": "We regard each cell in the table as record."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-57",
"text": "Each record r consists of four types of information including value r.v (e.g. 18), entity r.e (e.g. Al Jefferson), type r.c (e.g. POINTS) and a feature r.f (e.g. visiting) which indicate whether a player or a team compete in home court or not."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-58",
"text": "Each player or team takes one row in the table and each column contains a type of record such as points, assists, etc."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-59",
"text": "Also, tables contain the date when the match happened and we let k denote the date of the record."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-60",
"text": "We also create timelines for records."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-61",
"text": "The details of timeline construction is described in Section 2.2."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-62",
"text": "For simplicity, we omit table id l and record date k in the following sections and let r i,j denotes a record of i th row and j th column in the table."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-63",
"text": "We assume the records come from the same table and k is the date of the mentioned record."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-64",
"text": "Given those information, the model is expected to generate text y = (y 1 , ..., y t , ..., y T ) describing these tables."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-65",
"text": "T denotes the length of the text."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-67",
"text": "**RECORD TIMELINE CONSTRCUTION**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-68",
"text": "In this paper, we construct timelines tl = {tl e,c } E,C e=1,c=1 for records."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-69",
"text": "E denotes the number of distinct record entities and C denotes the number of record types."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-70",
"text": "For each timeline tl e,c , we first extract records with the same entity e and type c from dataset."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-71",
"text": "Then we sort them into a sequence according to the record's date from old to new."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-72",
"text": "This sequence is considered as timeline tl e,c ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-73",
"text": "For example, in Figure 2 , the \"Timeline\" part in the lower-left corner represents a timeline for entity Al Jefferson and type PTS (points)."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-75",
"text": "**BASELINE MODEL**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-76",
"text": "We use Seq2Seq model with attention (Luong et al., 2015) and conditional copy (Gulcehre et al., 2016) as the base model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-77",
"text": "During training, given tables S and their corresponding reference texts y, the model maximized the conditional probability P (y|S) = T t=1 P (y t |y Wiseman et al. (2017) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-82",
"text": "[; ] denotes the vector concatenation."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-83",
"text": "Then, we use a LSTM decoder with attention and conditional copy to model the conditional probability P (y t |y (Wiseman et al., 2017) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-176",
"text": "For each example, it provides three tables as described in Section 2.1 which consists of 628 records in total with a long game summary."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-177",
"text": "The average length of game summary is 337.1."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-178",
"text": "In this paper, we followed the data split introduced in Wiseman et al. (2017)"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-179",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-180",
"text": "**IMPLEMENTATION DETAILS**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-181",
"text": "Following configurations in Puduppully et al. (2019) , we set word embedding and LSTM decoder hidden size as 600."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-182",
"text": "The decoder's layer was set to be 2."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-183",
"text": "Input feeding (Luong et al., 2015) was also used for decoder."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-184",
"text": "We applied dropout at a rate 0.3."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-185",
"text": "For training, we used Adagrad (Duchi et al., 2010) optimizer with learning rate of 0.15, truncated BPTT (block length 100), batch size of 5 and learning rate decay of 0.97."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-186",
"text": "For inferring, we set beam size as 5."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-187",
"text": "We also set the history windows size as 3 from {3,5,7} based on the results."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-188",
"text": "Code of our model can be found at https://github.com/ernestgong/data2text-three-dimensions/. Table 1 displays the automatic evaluation results on both development and test set."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-189",
"text": "We chose Conditional Copy (CC) model as our baseline, which is the best model in Wiseman et al. (2017) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-190",
"text": "We included reported scores with updated IE model by Puduppully et al. (2019) and our implementation's result on CC in this paper."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-191",
"text": "Also, we compared our models with other existing works on this dataset including OpATT (Nie et al., 2018) and Neural Content Planning with conditional copy (NCP+CC) (Puduppully et al., 2019) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-192",
"text": "In addition, we implemented three other hierarchical encoders that encoded tables' row dimension information in both record-level and row-level to compare with the hierarchical structure of encoder in our model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-193",
"text": "The decoder was equipped with dual attention (Cohan et al., 2018) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-194",
"text": "The one with LSTM cell is similar to the one in Cohan et al. (2018) with 1 layer from {1,2,3}. The one with CNN cell (Gehring et al., 2017) has kernel width 3 from {3, 5} and 10 layer from {5,10,15,20}. The one with transformer-style encoder (MHSA) (Vaswani et al., 2017) has 8 head from {8, 10} and 5 layer from {2,3,4,5,6}. The heads and layers mentioned above were for both record-level encoder and rowlevel encoder respectively."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-195",
"text": "The self-attention (SA) cell we used, as described in Section 3, achieved better overall performance in terms of F1% of CS, CO and BLEU among the hierarchical encoders."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-196",
"text": "Also we implemented a template system same as the one used in Wiseman et al. (2017) which outputted eight sentences: an introductory sentence (two teams' points and who win), six top players' statistics (ranked by their points) and a conclusion sentence."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-197",
"text": "We refer the readers to Wiseman et al. (2017) 's paper for more detailed information on templates."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-198",
"text": "The gold reference's result is also included in Table 1 ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-199",
"text": "Overall, our model performs better than other neural models on both development and test set in terms of RG's P%, F1% score of CS, CO and BLEU, indicating our model's clear improvement on generating high-fidelity, informative and fluent texts."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-200",
"text": "Also, our model with three dimension representations outperforms hierarchical encoders with only row dimension representation on development set."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-201",
"text": "This indicates that cell and time dimension representation are important in representing the tables."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-202",
"text": "Compared to reported baseline result in Wiseman et al. (2017) , we achieved improvement of 22.27% in terms of RG, 26.84% in terms of CS F1%, 35.28% in terms of CO and 18.75% in terms of BLEU on test set."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-203",
"text": "Unsurprisingly, template system achieves best on RG P% and CS R% due to the included domain knowledge."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-204",
"text": "Also, the high RG # and low CS P% indicates that template will include vast information while many of them are deemed redundant."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-205",
"text": "In addition, the low CO and low BLEU indicates that the rigid structure of the template will produce texts that aren't as adaptive to the given tables and natural as those produced by neural models."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-206",
"text": "Also, we conducted ablation study on our model to evaluate each component's contribution on development set."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-207",
"text": "Based on the results, the absence of row-level encoder hurts our model's performance across all metrics especially the content selection ability."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-208",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-209",
"text": "**RESULTS**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-210",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-211",
"text": "**AUTOMATIC EVALUATION**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-212",
"text": "Row, column and time dimension information are important to the modeling of tables because subtracting any of them will result in performance Table 2 : Automatic evaluation results on test set."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-213",
"text": "Results were obtained using Wiseman et al. (2017) 's trained extractive evaluation models with relexicalization (Li and Wan, 2018) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-214",
"text": "* We include delayed copy (DEL)'s result in the paper (Li and Wan, 2018) for comparison."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-215",
"text": "drop."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-216",
"text": "Also, position embedding is critical when modeling time dimension information according to the results."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-217",
"text": "In addition, record fusion gate plays an important role because BLEU, CO, RG P% and CS P% drop significantly after subtracting it from full model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-218",
"text": "Results show that each component in the model contributes to the overall performance."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-219",
"text": "In addition, we compare our model with delayed copy model (DEL) (Li and Wan, 2018) along with gold text, template system (TEM), conditional copy (CC) (Wiseman et al., 2017) and NCP+CC (NCP) (Puduppully et al., 2019) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-220",
"text": "Li and Wan (2018) 's model generate a template at first and then fill in the slots with delayed copy mechanism."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-221",
"text": "Since its result in Li and Wan (2018) 's paper was evaluated by IE model trained by Wiseman et al. (2017) and \"relexicalization\" by Li and Wan (2018) , we adopted the corresponding IE model and re-implement \"relexicalization\" as suggested by Li and Wan (2018) for fair comparison."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-222",
"text": "Please note that CC's evaluation results via our reimplemented \"relexicalization\" is comparable to the reported result in Li and Wan (2018) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-223",
"text": "We applied them on models other than DEL as shown in Table 2 and report DEL's result from (Li and Wan, 2018) 's paper."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-224",
"text": "It shows that our model outperform Li and Wan (2018) 's model significantly across all automatic evaluation metrics in Table 2 ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-225",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-226",
"text": "**HUMAN EVALUATION**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-227",
"text": "In this section, we hired three graduates who passed intermediate English test (College English Test Band 6) and were familiar with NBA games to perform human evaluation."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-228",
"text": "First, in order to check if history information is important, we sampled 100 summaries from train- ing set and asked raters to manually check whether the summary contained expressions that need to be inferred from history information."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-229",
"text": "It turns out that 56.7% summaries of the sampled summaries need history information."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-230",
"text": "Following human evaluation settings in Puduppully et al. (2019), we conducted the following human evaluation experiments at the same scale."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-231",
"text": "The second experiment is to assess whether the improvement on relation generation metric reported in automatic evaluation is supported by human evaluation."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-232",
"text": "We compared our full model with gold texts, template-based system, CC (Wiseman et al., 2017) and NCP+CC (NCP) (Puduppully et al., 2019) ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-233",
"text": "We randomly sampled 30 examples from test set."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-234",
"text": "Then, we randomly sampled 4 sentences from each model's output for each example."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-235",
"text": "We provided the raters of those sampled sentences with the corresponding NBA game statistics."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-236",
"text": "They were asked to count the number of supporting and contradicting facts in each sentence."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-237",
"text": "Each sentence is rated independently."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-238",
"text": "We report the average number of supporting facts (#Sup) and contradicting facts (#Cont) in Table 3 ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-239",
"text": "Unsurprisingly, template-based system includes most supporting facts and least contradicting facts in its texts because the template consists of a large number of facts and all of those facts are extracted from the table."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-240",
"text": "Also, our model produces less contradicting facts than other two neural models."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-241",
"text": "Although our model produces less supporting facts than NCP and CC, it still includes enough supporting facts (slightly more than gold texts)."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-242",
"text": "Also, comparing to NCP+CC (NCP)s tendency to include vast information that contain redundant information, our models ability to select and accurately convey information is better."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-243",
"text": "All other results (Gold, CC, NCP and ours) are significantly different from template-based system's results in terms of number of supporting facts according to one-way ANOVA with posthoc Tukey HSD tests."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-244",
"text": "All significance difference reported in this paper are less than 0.05."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-245",
"text": "Our model is also significantly different from the NCP model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-246",
"text": "As for average number of contradicting facts, our model is significantly different from other two neural models."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-247",
"text": "Surprisingly, gold texts were found containing contradicting facts."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-248",
"text": "We checked the raters's result and found that gold texts occasionally include wrong field-goal or three-point percent or wrong points difference between the winner and the defeated team."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-249",
"text": "We can treat the average contradicting facts number of gold texts as a lower bound."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-250",
"text": "In the third experiment, following Puduppully et al. (2019) , we asked raters to evaluate those models in terms of grammaticality (is it more fluent and grammatical?), coherence (is it easier to read or follows more natural ordering of facts? ) and conciseness (does it avoid redundant information and repetitions?)."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-251",
"text": "We adopted the same 30 examples from above and arranged every 5-tuple of summaries into 10 pairs."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-252",
"text": "Then, we asked the raters to choose which system performs the best given each pair."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-253",
"text": "Scores are computed as the difference between percentage of times when the model is chosen as the best and percentage of times when the model is chosen as the worst."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-254",
"text": "Gold texts is significantly more grammatical than others across all three metrics."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-255",
"text": "Also, our model performs significantly better than other two neural models (CC, NCP) in all three metrics."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-256",
"text": "Template-based system generates significantly more grammatical and concise but significantly less coherent results, compared to all three neural models."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-257",
"text": "Because the rigid structure of texts ensures the correct grammaticality and no repetition in template-based system's output."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-258",
"text": "However, since the templates are stilted and lack variability compared to others, it was deemed less coherent than the others by the raters."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-259",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-260",
"text": "**QUALITATIVE EXAMPLE**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-261",
"text": "Our model: The Charlotte Hornets ( 21 -27 ) defeated the Washington Wizards ( 31 -18 ) 92 -88 on Monday \u2026 The Hornets were led by Al Jefferson , who recorded a double -double of his own with 18 points ( 9 -19 FG , 0 -2 FT ) and 12 rebounds ."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-262",
"text": "It was his second doubledouble over his last three games \u2026 The only other Wizard to reach double -digit points was Kris Humphries , who came off the bench for 13 points ( 4 -8 FG , 5 -6 FT ) and five rebounds in 26 minutes \u2026 Figure 3 shows an example generated by our model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-263",
"text": "It evidently has several nice properties: it can accurately select important player \"Al Jefferson\" from the tables who is neglected by baseline model, which need the model to understand performance difference of a type of data (column) between each rows (players)."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-264",
"text": "Also it correctly summarize performance of \"Al Jefferson\" in this match as \"double-double\" which requires ability to capture dependency from different columns (different type of record) in the same row (player)."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-265",
"text": "In addition, it models \"Al Jefferson\" history performance and correctly states that \"It was his second double-double over his last three games\", which is also mentioned in gold texts included in Figure 1 in a similar way."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-266",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-267",
"text": "**RELATED WORK**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-268",
"text": "In recent years, neural data-to-text systems make remarkable progress on generating texts directly from data."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-269",
"text": "Mei et al. (2016) proposes an encoderaligner-decoder model to generate weather forecast, while Jain et al. (2018) propose a mixed hierarchical attention."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-270",
"text": "proposes a hybrid content-and linkage-based attention mechanism to model the order of content."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-271",
"text": "propose to integrate field information into table representation and enhance decoder with dual attention."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-272",
"text": "Bao et al. (2018) develops a table-aware encoder-decoder model."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-273",
"text": "Wiseman et al. (2017) introduced a document-scale data-totext dataset, consisting of long text with more redundant records, which requires the model to select important information to generate."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-274",
"text": "We describe recent works in Section 1."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-275",
"text": "Also, some studies in abstractive text summarization encode long texts in a hierarchical manner."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-276",
"text": "Cohan et al. (2018) uses a hierarchical encoder to encode input, paired with a discourse-aware decoder."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-277",
"text": "Ling and Rush (2017) encode document hierarchically and propose coarse-to-fine attention for decoder."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-278",
"text": "Recently, Liu et al. (2019) propose a hierarchical encoder for data-to-text generation which uses LSTM as its cell."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-279",
"text": "Murakami et al. (2017) propose to model stock market time-series data and generate comments."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-280",
"text": "As for incorporating historical background in generation, Robin (1994) proposed to build a draft with essential new facts at first, then incorporate background facts when revising the draft based on functional unification grammars."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-281",
"text": "Different from that, we encode the historical (time dimension) information in the neural datato-text model in an end-to-end fashion."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-282",
"text": "Existing works on data-to-text generation neglect the joint representation of tables' row, column and time dimension information."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-283",
"text": "In this paper, we propose an effective hierarchical encoder which models information from row, column and time dimension simultaneously."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-284",
"text": "----------------------------------"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-285",
"text": "**CONCLUSION**"
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-286",
"text": "In this work, we present an effective hierarchical encoder for table-to-text generation that learns table representations from row, column and time dimension."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-287",
"text": "In detail, our model consists of three layers, which learn records' representation in three dimension, combine those representations via their sailency and obtain row-level representation based on records' representation."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-288",
"text": "Then, during decoding, it will select important table row before attending to records."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-289",
"text": "Experiments are conducted on ROTOWIRE, a benchmark dataset of NBA games."
},
{
"sent_id": "a3dbc3362016cdcfc0c4da429b98cc-C001-290",
"text": "Both automatic and human evaluation results show that our model achieves the new state-of-the-art performance."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-16"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-27"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-81"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-189"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-197"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-212",
"a3dbc3362016cdcfc0c4da429b98cc-C001-213"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-219"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-221"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-232",
"a3dbc3362016cdcfc0c4da429b98cc-C001-233",
"a3dbc3362016cdcfc0c4da429b98cc-C001-234",
"a3dbc3362016cdcfc0c4da429b98cc-C001-235",
"a3dbc3362016cdcfc0c4da429b98cc-C001-236",
"a3dbc3362016cdcfc0c4da429b98cc-C001-237",
"a3dbc3362016cdcfc0c4da429b98cc-C001-238",
"a3dbc3362016cdcfc0c4da429b98cc-C001-239",
"a3dbc3362016cdcfc0c4da429b98cc-C001-240"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-273"
]
],
"cite_sentences": [
"a3dbc3362016cdcfc0c4da429b98cc-C001-16",
"a3dbc3362016cdcfc0c4da429b98cc-C001-81",
"a3dbc3362016cdcfc0c4da429b98cc-C001-189",
"a3dbc3362016cdcfc0c4da429b98cc-C001-197",
"a3dbc3362016cdcfc0c4da429b98cc-C001-213",
"a3dbc3362016cdcfc0c4da429b98cc-C001-219",
"a3dbc3362016cdcfc0c4da429b98cc-C001-221",
"a3dbc3362016cdcfc0c4da429b98cc-C001-232"
]
},
"@SIM@": {
"gold_contexts": [
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-81"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-196"
]
],
"cite_sentences": [
"a3dbc3362016cdcfc0c4da429b98cc-C001-81",
"a3dbc3362016cdcfc0c4da429b98cc-C001-196"
]
},
"@USE@": {
"gold_contexts": [
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-175",
"a3dbc3362016cdcfc0c4da429b98cc-C001-176"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-178"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-189"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-196"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-212",
"a3dbc3362016cdcfc0c4da429b98cc-C001-213"
],
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-221"
]
],
"cite_sentences": [
"a3dbc3362016cdcfc0c4da429b98cc-C001-175",
"a3dbc3362016cdcfc0c4da429b98cc-C001-178",
"a3dbc3362016cdcfc0c4da429b98cc-C001-189",
"a3dbc3362016cdcfc0c4da429b98cc-C001-196",
"a3dbc3362016cdcfc0c4da429b98cc-C001-213",
"a3dbc3362016cdcfc0c4da429b98cc-C001-221"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-202",
"a3dbc3362016cdcfc0c4da429b98cc-C001-203"
]
],
"cite_sentences": [
"a3dbc3362016cdcfc0c4da429b98cc-C001-202"
]
},
"@DIF@": {
"gold_contexts": [
[
"a3dbc3362016cdcfc0c4da429b98cc-C001-232",
"a3dbc3362016cdcfc0c4da429b98cc-C001-233",
"a3dbc3362016cdcfc0c4da429b98cc-C001-234",
"a3dbc3362016cdcfc0c4da429b98cc-C001-235",
"a3dbc3362016cdcfc0c4da429b98cc-C001-236",
"a3dbc3362016cdcfc0c4da429b98cc-C001-237",
"a3dbc3362016cdcfc0c4da429b98cc-C001-238",
"a3dbc3362016cdcfc0c4da429b98cc-C001-239",
"a3dbc3362016cdcfc0c4da429b98cc-C001-240"
]
],
"cite_sentences": [
"a3dbc3362016cdcfc0c4da429b98cc-C001-232"
]
}
}
},
"ABC_8c26fb4c81c121103c1d5851edb41e_6": {
"x": [
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-2",
"text": "Sex trafficking is a global epidemic."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-3",
"text": "Escort websites are a primary vehicle for selling the services of such trafficking victims and thus a major driver of trafficker revenue."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-4",
"text": "Many law enforcement agencies do not have the resources to manually identify leads from the millions of escort ads posted across dozens of public websites."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-5",
"text": "We propose an ordinal regression neural network to identify escort ads that are likely linked to sex trafficking."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-6",
"text": "Our model uses a modified cost function to mitigate inconsistencies in predictions often associated with nonparametric ordinal regression and leverages recent advancements in deep learning to improve prediction accuracy."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-7",
"text": "The proposed method significantly improves on the previous state-of-the-art on Trafficking-10K, an expert-annotated dataset of escort ads."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-8",
"text": "Additionally, because traffickers use acronyms, deliberate typographical errors, and emojis to replace explicit keywords, we demonstrate how to expand the lexicon of trafficking flags through word embeddings and t-SNE."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-11",
"text": "Globally, human trafficking is one of the fastest growing crimes and, with annual profits estimated to be in excess of 150 billion USD, it is also among the most lucrative (Amin, 2010) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-12",
"text": "Sex trafficking is a form of human trafficking which involves sexual exploitation through coercion."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-13",
"text": "Recent estimates suggest that nearly 4 million adults and 1 million children are being victimized globally on any given day; furthermore, it is estimated that 99 percent of victims are female (International Labour Organization et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-14",
"text": "Escort websites are an increasingly popular vehicle for selling the services of trafficking victims."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-15",
"text": "According to a recent survivor survey (THORN and Bouch\u00e9, 2018) , 38% of underage trafficking victims who were enslaved prior to 2004 were advertised online, and that number rose to 75% for those enslaved after 2004."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-16",
"text": "Prior to its shutdown in April 2018, the website Backpage was the most frequently used online advertising platform; other popular escort websites include Craigslist, Redbook, SugarDaddy, and Facebook (THORN and Bouch\u00e9, 2018) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-17",
"text": "Despite the seizure of Backpage, there were nearly 150,000 new online sex advertisements posted per day in the U.S. alone in late 2018 (Tarinelli, 2018) ; even with many of these new ads being re-posts of existing ads and traffickers often posting multiple ads for the same victims (THORN and Bouch\u00e9, 2018) , this volume is staggering."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-18",
"text": "Because of their ubiquity and public access, escort websites are a rich resource for antitrafficking operations."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-19",
"text": "However, many law enforcement agencies do not have the resources to sift through the volume of escort ads to identify those coming from potential traffickers."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-20",
"text": "One scalable and efficient solution is to build a statistical model to predict the likelihood of an ad coming from a trafficker using a dataset annotated by anti-trafficking experts."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-21",
"text": "We propose an ordinal regression neural network tailored for text input."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-22",
"text": "This model comprises three components: (i) a Word2Vec model (Mikolov et al., 2013b ) that maps each word from the text input to a numeric vector, (ii) a gated-feedback recurrent neural network (Chung et al., 2015) that sequentially processes the word vectors, and (iii) an ordinal regression layer (Cheng et al., 2008) that produces a predicted ordinal label."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-23",
"text": "We use a modified cost function to mitigate inconsistencies in predictions associated with nonparametric ordinal regression."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-24",
"text": "We also leverage several regularization techniques for deep neural networks to further improve model performance, such as residual con-nection (He et al., 2016) and batch normalization (Ioffe and Szegedy, 2015) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-25",
"text": "We conduct our experiments on Trafficking-10k (Tong et al., 2017) , a dataset of escort ads for which anti-trafficking experts assigned each sample one of seven ordered labels ranging from \"1: Very Unlikely (to come from traffickers)\" to \"7: Very Likely\"."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-26",
"text": "Our proposed model significantly outperforms previously published models (Tong et al., 2017) on Trafficking-10k as well as a variety of baseline ordinal regression models."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-27",
"text": "In addition, we analyze the emojis used in escort ads with Word2Vec and t-SNE (van der Maaten and Hinton, 2008) , and we show that the lexicon of trafficking-related emojis can be subsequently expanded."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-28",
"text": "In Section 2, we discuss related work on human trafficking detection and ordinal regression."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-29",
"text": "In Section 3, we present our proposed model and detail its components."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-30",
"text": "In Section 4, we present the experimental results, including the Trafficking-10K benchmark, a qualitative analysis of the predictions on raw data, and the emoji analysis."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-31",
"text": "In Section 5, we summarize our findings and discuss future work."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-33",
"text": "**RELATED WORK**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-34",
"text": "Trafficking detection: There have been several software products designed to aid anti-trafficking efforts."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-35",
"text": "Examples include Memex 1 which focuses on search functionalities in the dark web; Spotlight 2 which flags suspicious ads and links images appearing in multiple ads; Traffic Jam 3 which seeks to identify patterns that connect multiple ads to the same trafficking organization; and TraffickCam 4 which aims to construct a crowd-sourced database of hotel room images to geo-locate victims."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-36",
"text": "These research efforts have largely been isolated, and few research articles on machine learning for trafficking detection have been published."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-37",
"text": "Closest to our work is the Human Trafficking Deep Network (HTDN) (Tong et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-38",
"text": "HTDN has three main components: a language network that uses pretrained word embeddings and a long shortterm memory network (LSTM) to process text input; a vision network that uses a convolutional network to process image input; and another convolutional network to combine the output of the previous two networks and produce a binary classification."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-39",
"text": "Compared to the language network in HTDN, our model replaces LSTM with a gatedfeedback recurrent neural network, adopts certain regularizations, and uses an ordinal regression layer on top."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-40",
"text": "It significantly improves HTDN's benchmark despite only using text input."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-41",
"text": "As in the work of E. Tong et al. (2017) , we pre-train word embeddings using a skip-gram model (Mikolov et al., 2013b) applied to unlabeled data from escort ads, however, we go further by analyzing the emojis' embeddings and thereby expand the trafficking lexicon."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-42",
"text": "Ordinal regression: We briefly review ordinal regression before introducing the proposed methodology."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-43",
"text": "We assume that the training data are"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-44",
"text": ", where X i \u2208 X are the features and Y i \u2208 Y is the response; Y is the set of k ordered labels {1, 2, . . . , k} with 1 \u227a 2 . . ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-45",
"text": "\u227a k. Many ordinal regression methods learn a composite map \u03b7 = h \u2022 g, where g : X \u2192 R and h : R \u2192 {1, 2, . . . , k} have the interpretation that g(X) is a latent \"score\" which is subsequently discretized into a category by h. \u03b7 is often estimated by empirical risk minimization, i.e., by minimizing a loss function C{\u03b7(X), Y } averaged over the training data."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-46",
"text": "Standard choices of \u03b7 and C are reviewed by J. Rennie & N. Srebro (2005) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-47",
"text": "Another common approach to ordinal regression, which we adopt in our proposed method, is to transform the label prediction into a series of k \u2212 1 binary classification sub-problems, wherein the ith sub-problem is to predict whether or not the true label exceeds i (Frank and Hall, 2001; Li and Lin, 2006) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-48",
"text": "For example, one might use a series of logistic regression models to estimate the conditional probabilities f i (X) = P (Y > i X) for each i = 1, . . . , k \u2212 1."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-49",
"text": "J. Cheng et al. (2008) estimated these probabilities jointly using a neural network; this was later extended to image data (Niu et al., 2016) as well as text data (Irsoy and Cardie, 2015; Ruder et al., 2016) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-50",
"text": "However, as acknowledged by J. Cheng et al. (2008) , the estimated probabilities need not respect the ordering f i (X) \u2265 f i+1 (X) for all i and X. We force our estimator to respect this ordering through a penalty on its violation."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-52",
"text": "**METHOD**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-53",
"text": "Our proposed ordinal regression model consists of the following three components: Word embeddings pre-trained by a Skip-gram model, a gatedfeedback recurrent neural network that constructs summary features from sentences, and a multilabeled logistic regression layer tailored for ordinal regression."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-54",
"text": "See Figure 1 for a schematic."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-55",
"text": "The details of its components and their respective alternatives are discussed below."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-57",
"text": "**WORD EMBEDDINGS**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-58",
"text": "Vector representations of words, also known as word embeddings, can be obtained through unsupervised learning on a large text corpus so that certain linguistic regularities and patterns are encoded."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-59",
"text": "Compared to Latent Semantic Analysis (Dumais, 2004) , embedding algorithms using neural networks are particularly good at preserving linear regularities among words in addition to grouping similar words together (Mikolov et al., 2013a) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-60",
"text": "Such embeddings can in turn help other algorithms achieve better performances in various natural language processing tasks (Mikolov et al., 2013b) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-61",
"text": "Unfortunately, the escort ads contain a plethora of emojis, acronyms, and (sometimes deliberate) typographical errors that are not encountered in more standard text data, which suggests that it is likely better to learn word embeddings from scratch on a large collection of escort ads instead of using previously published embeddings (Tong et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-62",
"text": "We use 168,337 ads scraped from Backpage as our training corpus and the Skipgram model with Negative sampling (Mikolov et al., 2013b) as our model."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-64",
"text": "**GATED-FEEDBACK RECURRENT NEURAL NETWORK**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-65",
"text": "To process entire sentences and paragraphs after mapping the words to embeddings, we need a model to handle sequential data."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-66",
"text": "Recurrent neural networks (RNNs) have recently seen great success at modeling sequential data, especially in natural language processing tasks (LeCun et al., 2015) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-67",
"text": "On a high level, an RNN is a neural network that processes a sequence of inputs one at a time, taking the summary of the sequence seen so far from the previous time point as an additional input and producing a summary for the next time point."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-68",
"text": "One of the most widely used variations of RNNs, a Long short-term memory network (LSTM), uses various gates to control the information flow and is able to better preserve long-term dependencies in the running summary compared to a basic RNN (see Goodfellow et al., 2016 , and references therein)."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-69",
"text": "In our implementation, we use a further refinement of multi-layed LSTMs, Gatedfeedback recurrent neural networks (GF-RNNs), which tend to capture dependencies across different timescales more easily (Chung et al., 2015) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-70",
"text": "Regularization techniques for neural networks including Dropout (Srivastava et al., 2014) , Residual connection (He et al., 2016) , and Batch normalization (Ioffe and Szegedy, 2015) are added to GF-RNN for further improvements."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-71",
"text": "After GF-RNN processes an entire escort ad, the average of the hidden states of the last layer becomes the input for the multi-labeled logistic regression layer which we discuss next."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-73",
"text": "**MULTI-LABELED LOGISTIC REGRESSION LAYER**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-74",
"text": "As noted previously, the ordinal regression problem can be cast into a series of binary classification problems and thereby utilize the large repository of available classification algorithms (Frank and Hall, 2001; Li and Lin, 2006; Niu et al., 2016) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-75",
"text": "One formulation is as follows."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-76",
"text": "Given k total ranks, the i-th binary classifier is trained to predict the probability that a sample X has rank larger than i : f i (X) = P(Y > i|X)."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-77",
"text": "Then the predicted rank is"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-78",
"text": "In a classification task, the final layer of a deep neural network is typically a softmax layer with dimension equal to the number of classes (Goodfellow et al., 2016) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-79",
"text": "Using the ordinalregression-to-binary-classifications formulation described above, J. Cheng et al. (2008) replaced the softmax layer in their neural network with a (k \u2212 1)-dimensional sigmoid layer, where each neuron serves as a binary classifier (see Figure 2 but without the order penalty to be discussed later)."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-80",
"text": "With the sigmoid activation function, the output of the ith neuron can be viewed as the predicted probability that the sample has rank greater 5 than i. Alternatively, the entire sigmoid layer can be viewed as performing multi-labeled logistic regression, where the ith label is the indicator of the sample's rank being greater than i. The training data are thus re-formatted accordingly so that response variable for a sample with rank i becomes (1 i\u22121 , 0 k\u2212i ) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-81",
"text": "The k \u2212 1 binary classifiers share the features constructed by the earlier layers of the neural network and can be trained jointly with mean squared error loss."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-82",
"text": "A key difference between the multi-labeled logistic regression and the naive classification (ignoring the order and treating all ranks as separate classes) is that the loss for Y = Y is constant in the naive classification but proportional to | Y \u2212 Y | in the multi-labeled logistic regression."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-83",
"text": "J. Cheng et al.'s (2008) final layer was preceded by a simple feed-forward network."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-84",
"text": "In our case, word embeddings and GF-RNN allow us to construct a feature vector of fixed length from text input, so we can simply attach the multi-labeled logistic regression layer to the output of GF-RNN to complete an ordinal regression neural network for text input."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-85",
"text": "The violation of the monotonicity in the estimated probabilities (e.g., f i (X) < f i+1 (X) for some X and i) has remained an open issue since the original ordinal regression neural network proposal of J. Cheng et al (2008) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-86",
"text": "This is perhaps owed in part to the belief that correcting this issue would significantly increase training complexity (Niu et al., 2016) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-87",
"text": "We propose an effective and computationally efficient solution to avoid the conflicting predictions as follows: penalize such 5 Actually, in J. Cheng et al.'s original formulation, the final layer is k-dimensional with the i-th neuron predicting the probability that the sample has rank greater than or equal to i. This is redundant because the first neuron should always be equal to 1."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-88",
"text": "Hence we make the slight adjustment of using only k \u2212 1 neurons."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-89",
"text": "conflicts in the training phase by adding"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-90",
"text": "to the loss function for a sample X, where \u03bb is a penalty parameter (Figure 2 )."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-91",
"text": "For sufficiently large \u03bb the estimated probabilities will respect the monotonicity condition; respecting this condition improves the interpretability of the predictions, which is vital in applications like the one we consider here as stakeholders are given the estimated probabilities."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-92",
"text": "We also hypothesize that the order penalty may serve as a regularizer to improve each binary classifier (see the ablation test in Section 4.3)."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-93",
"text": "All three components of our model (word embeddings, GF-RNN, and multi-labeled logistic regression layer) can be trained jointly, with word embeddings optionally held fixed or given a smaller learning rate for fine-tuning."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-94",
"text": "The hyperparameters for all components are given in the Appendix. They are selected according to either literature or grid-search."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-96",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-97",
"text": "We first describe the datasets we use to train and evaluate our models."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-98",
"text": "Then we present a detailed comparison of our proposed model with commonly used ordinal regression models as well as the previous state-of-the-art classification model by E. Tong et al. (2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-99",
"text": "To assess the effect of each component in our model, we perform an ablation test where the components are swapped by their more standard alternatives one at a time."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-100",
"text": "Next, we perform a qualitative analysis on the model predictions on the raw data, which are scraped from a different escort website than the one that provides the labeled training data."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-101",
"text": "Finally, we conduct an emoji analysis using the word embeddings trained on raw escort ads."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-103",
"text": "**DATASETS**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-104",
"text": "We use raw texts scraped from Backpage and TNABoard to pre-train the word embeddings, and use the same labeled texts E. Tong et al. (2017) used to conduct model comparisons."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-105",
"text": "The raw text dataset consists of 44,105 ads from TNABoard and 124,220 ads from Backpage."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-106",
"text": "Data cleaning/preprocessing includes joining the title and the body of an ad; adding white spaces around every emoji so that it can be tokenized properly; stripping tabs, line breaks, punctuations, and extra white spaces; removing phone numbers; and converting all letters to lower case."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-107",
"text": "We have ensured that the raw dataset has no overlap with the labeled dataset to avoid bias in test accuracy."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-108",
"text": "While it is possible to scrape more raw data, we did not observe significant improvements in model performances when the size of raw data increased from \u223c70,000 to \u223c170,000, hence we assume that the current raw dataset is sufficiently large."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-109",
"text": "The labeled dataset is called Trafficking-10k."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-110",
"text": "It consists of 12,350 ads from Backpage labeled by experts in human trafficking detection 6 (Tong et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-111",
"text": "Each label is one of seven ordered levels of likelihood that the corresponding ad comes from a human trafficker."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-112",
"text": "Descriptions and sample proportions of the labels are in Table 1 ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-113",
"text": "The original Trafficking-10K includes both texts and images, but as mentioned in Section 1, only the texts are used in our case."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-114",
"text": "We apply the same preprocessing to Trafficking-10k as we do to raw data."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-116",
"text": "**COMPARISON WITH BASELINES**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-117",
"text": "We compare our proposed ordinal regression neural network (ORNN) to Immediate-Threshold ordinal logistic regression (IT) (Rennie and Srebro, 2005) , All-Threshold ordinal logistic regression (AT) (Rennie and Srebro, 2005) , Least Absolute Deviation (LAD) (Bloomfield and Steiger, 1980; Narula and Wellington, 1982) , and multi-class logistic regression (MC) which ignores the ordering."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-118",
"text": "The primary evaluation metrics are Mean Absolute Error (MAE) and macro-averaged Mean Absolute Error (MAE M ) (Baccianella et al., 2009) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-119",
"text": "6 Backpage was seized by FBI in April 2018, but we have observed that escort ads across different websites are often similar, and a survivor survey shows that traffickers post their ads on multiple websites (THORN and Bouch\u00e9, 2018) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-120",
"text": "Thus, we argue that the training data from Backpage are still useful, which is empirically supported by our qualitative analysis in Section 4.4."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-121",
"text": "To compare our model with the previous stateof-the-art classification model for escort ads, the Human Trafficking Deep Network (HTDN) (Tong et al., 2017) , we also polarize the true and predicted labels into two classes, \"1-4: Unlikely\" and \"5-7: Likely\"; then we compute the binary classification accuracy (Acc.) as well as the weighted binary classification accuracy (Wt. Acc.) The text data need to be vectorized before they can be fed into the baseline models (whereas vectorization is built into ORNN)."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-122",
"text": "The standard practice is to tokenize the texts using n-grams and then create weighted term frequency vectors using the term frequency (TF)-inverse document frequency (IDF) scheme (Beel et al., 2016; Manning et al., 2009) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-123",
"text": "The specific variation we use is the recommended unigram + sublinear TF + smooth IDF (Manning et al., 2009; Pedregosa et al., 2011) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-124",
"text": "Dimension reduction techniques such as Latent Semantic Analysis (Dumais, 2004) can be optionally applied to the frequency vectors, but B. Schuller et al. (2015) concluded from their experiments that dimension reduction on frequency vectors actually hurts model performance, which our preliminary experiments agree with."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-125",
"text": "All models are trained and evaluated using the same (w.r.t."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-126",
"text": "data shuffle and split) 10-fold crossvalidation (CV) on Trafficking-10k, except for HTDN, whose result is read from the original paper (Tong et al., 2017) 7 ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-127",
"text": "During each train-test split, 2/9 of the training set is further reserved as the validation set for tuning hyperparameters such as L2-penalty in IT, AT and LAD, and learning rate in ORNN."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-128",
"text": "So the overall train-validation-test ratio is 70%-20%-10%."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-129",
"text": "We report the mean metrics from the CV in Table 2 ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-130",
"text": "As previous research has pointed out that there is no unbiased estimator of the variance of CV (Bengio and 2004), we report the naive standard error treating metrics across CV as independent."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-131",
"text": "We can see that ORNN has the best MAE, MAE M and Acc. as well as a close 2nd best Wt."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-132",
"text": "Acc. among all models."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-133",
"text": "Its Wt."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-134",
"text": "Acc. is a substantial improvement over HTDN despite the fact that the latter use both text and image data."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-135",
"text": "It is important to note that HTDN is trained using binary labels, whereas the other models are trained using ordinal labels and then have their ordinal predictions converted to binary predictions."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-136",
"text": "This is most likely the reason that even the baseline models except for LAD can yield better Wt."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-137",
"text": "Acc."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-138",
"text": "than HTDN, confirming our earlier claim that polarizing the ordinal labels during training may lead to information loss."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-140",
"text": "**ABLATION TEST**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-141",
"text": "To ensure that we do not unnecessarily complicate our ORNN model, and to assess the impact of each component on the final model performance, we perform an ablation test."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-142",
"text": "Using the same CV and evaluation metrics, we make the following replacements separately and re-evaluate the model: 1. Replace word embeddings pre-trained from skip-gram model with randomly initialized word embeddings; 2. replace gated-feedback recurrent neural network with long short-term memory network (LSTM); 3. disable batch normalization; 4. disable residual connection; 5. replace the multilabeled logistic regression layer with a softmax layer (i.e., let the model perform classification, treating the ordinal response variable as a categorical variable with k classes); 6. replace the multi-labeled logistic regression layer with a 1-dimensional linear layer (i.e., let the model perform regression, treating the ordinal response variable as a continuous variable) and round the prediction to the nearest integer during testing; 7. set the order penalty to 0."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-143",
"text": "The results are shown in Table 3 ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-144",
"text": "The proposed ORNN once again has all the best metrics except for Wt."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-145",
"text": "Acc. which is the 2nd best."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-146",
"text": "This suggests that each component indeed makes a contribution."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-147",
"text": "Note that if we disregard the ordinal labels and perform classification or regression, MAE falls off by a large margin."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-148",
"text": "Setting order penalty to 0 does not deteriorate the performance by much, however, the percent of conflicting binary predictions (see Section 3.3) rises from 1.4% to 5.2%."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-149",
"text": "So adding an order penalty helps produce more interpretable results 8 ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-151",
"text": "**QUALITATIVE ANALYSIS OF PREDICTIONS**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-152",
"text": "To qualitatively evaluate how well our model predicts on raw data and observe potential patterns in the flagged samples, we obtain predictions on the 44,105 unlabelled ads from TNABoard with the ORNN model trained on Trafficking-10k, then we examine the samples with high predicted likelihood to come from traffickers."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-153",
"text": "Below are the top three samples that the model considers likely:"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-154",
"text": "\u2022 \"amazing reviewed crystal only here till fri book now please check our site for the services the girls provide all updates specials photos rates reviews njfantasygirls . . . look who s back amazing reviewed model samantha. . ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-155",
"text": "brand new spinner jessica special rate today 250 hr 21 5 4 120 34b total gfe total anything goes no limits. . . \""
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-156",
"text": "\u2022 \"2 hot toght 18y o spinners 4 amazing providers today specials. . . \""
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-157",
"text": "\u2022 \"asian college girl is visiting bellevue service type escort hair color brown eyes brown age 23 height 5 4 body type slim cup size c cup ethnicity asian service type escort i am here for you settle men i am a tiny asian girl who is waiting for a gentlemen. . . \""
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-158",
"text": "Some interesting patterns in the samples with high predicted likelihood (here we only showed three) include: mentioning of multiple names or > 1 providers in a single ad; possibly intentional typos and abbreviations for the sensitive words such as \"tight\" \u2192 \"toght\" and \"18 year old\" \u2192 \"18y o\"; keywords that indicate traveling of the providers such as \"till fri\", \"look who s back\", and \"visiting\"; keywords that hint on the providers potentially being underage such as \"18y o\", \"college girl\", and \"tiny\"; and switching between third person and first person narratives."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-160",
"text": "**EMOJI ANALYSIS**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-161",
"text": "The fight against human traffickers is adversarial and dynamic."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-162",
"text": "Traffickers often avoid using explicit keywords when advertising victims, but instead use acronyms, intentional typos, and emojis (Tong et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-163",
"text": "Law enforcement maintains a lexicon of trafficking flags mapping certain emojis to their potential true meanings (e.g., the cherry emoji can indicate an underaged victim), but compiling such a lexicon manually is expensive, requires frequent updating, and relies on domain expertise that is hard to obtain (e.g., insider information from traffickers or their victims)."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-164",
"text": "To make matters worse, traffickers change their dictionaries over time and regularly switch to new emojis to replace certain keywords (Tong et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-165",
"text": "In such a dynamic and adversarial environment, the need for a data-driven approach in updating the existing lexicon is evident."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-166",
"text": "As mentioned in Section 3.1, training a skipgram model on a text corpus can map words (including emojis) used in similar contexts to similar numeric vectors."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-167",
"text": "Besides using the vectors learned from the raw escort ads to train ORNN, we can directly visualize the vectors for the emojis to help identify their relationships, by mapping the vectors to a 2-dimensional space using t-SNE 9 (van der Maaten and Hinton, 2008) (Figure 3) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-168",
"text": "We can first empirically assess the quality of the emoji map by noting that similar emojis do seem clustered together: the smileys near the coordinate (2, 3), the flowers near (-6, -1), the heart shapes near (-8, 1), the phones near (-2, 4) and so on."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-169",
"text": "It is worth emphasizing that the skip-gram model learns the vectors of these emojis based on their contexts in escort ads and not their visual representations, so the fact that the visually similar emojis are close to one another in the map suggests that Figure 3 : Emoji map produced by applying t-SNE to the emojis' vectors learned from escort ads using skip-gram model."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-170",
"text": "For visual clarity, only the emojis that appeared most frequently in the escort ads we scraped are shown out of the total 968 emojis that appeared."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-171",
"text": "the vectors have been learned as desired."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-172",
"text": "The emoji map can assist anti-trafficking experts in expanding the existing lexicon of trafficking flags."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-173",
"text": "For example, according to the lexicon we obtained from Global Emancipation Network 10 , the cherry emoji and the lollipop emoji are both flags for underaged victims."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-174",
"text": "Near (-3, -4) in the map, right next to these two emojis are the porcelain dolls emoji, the grapes emoji, the strawberry emoji, the candy emoji, the ice cream emojis, and maybe the 18-slash emoji, indicating that they are all used in similar contexts and perhaps should all be flags for underaged victims in the updated lexicon."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-175",
"text": "If we re-train the skip-gram model and update the emoji map periodically on new escort ads, when traffickers switch to new emojis, the map can link the new emojis to the old ones, assisting anti-trafficking experts in expanding the lexicon of trafficking flags."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-176",
"text": "This approach also works for acronyms and deliberate typos."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-178",
"text": "**DISCUSSION**"
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-179",
"text": "Human trafficking is a form of modern day slavery that victimizes millions of people."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-180",
"text": "It has become 10 Global Emancipation Network is a non-profit organization dedicated to combating human trafficking."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-181",
"text": "For more information see https://www.globalemancipation.ngo."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-182",
"text": "the norm for sex traffickers to use escort websites to openly advertise their victims."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-183",
"text": "We designed an ordinal regression neural network (ORNN) to predict the likelihood that an escort ad comes from a trafficker, which can drastically narrow down the set of possible leads for law enforcement."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-184",
"text": "Our ORNN achieved the state-of-the-art performance on Trafficking-10K (Tong et al., 2017) , outperforming all baseline ordinal regression models as well as improving the classification accuracy over the Human Trafficking Deep Network (Tong et al., 2017) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-185",
"text": "We also conducted an emoji analysis and showed how to use word embeddings learned from raw text data to help expand the lexicon of trafficking flags."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-186",
"text": "Since our experiments, there have been considerable advancements in language representation models, such as BERT (Devlin et al., 2018) ."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-187",
"text": "The new language representation models can be combined with our ordinal regression layer, replacing the skip-gram model and GF-RNN, to potentially further improve our results."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-188",
"text": "However, our contributions of improving the cost function for ordinal regression neural networks, qualitatively analyzing patterns in the predicted samples, and expanding the trafficking lexicon through a data-driven approach are not dependent on a particular choice of language representation model."
},
{
"sent_id": "8c26fb4c81c121103c1d5851edb41e-C001-189",
"text": "As for future work in trafficking detection, we can design multi-modal ordinal regression networks that utilize both image and text data. But given the time and resources required to label escort ads, we may explore more unsupervised learning or transfer learning algorithms, such as using object detection (Ren et al., 2015) and matching algorithms to match hotel rooms in the images."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"8c26fb4c81c121103c1d5851edb41e-C001-25"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-26"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-37",
"8c26fb4c81c121103c1d5851edb41e-C001-38",
"8c26fb4c81c121103c1d5851edb41e-C001-39",
"8c26fb4c81c121103c1d5851edb41e-C001-40"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-41"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-58",
"8c26fb4c81c121103c1d5851edb41e-C001-61"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-104",
"8c26fb4c81c121103c1d5851edb41e-C001-105"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-109",
"8c26fb4c81c121103c1d5851edb41e-C001-110",
"8c26fb4c81c121103c1d5851edb41e-C001-111",
"8c26fb4c81c121103c1d5851edb41e-C001-112",
"8c26fb4c81c121103c1d5851edb41e-C001-113",
"8c26fb4c81c121103c1d5851edb41e-C001-114"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-125",
"8c26fb4c81c121103c1d5851edb41e-C001-126"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-162"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-164",
"8c26fb4c81c121103c1d5851edb41e-C001-165"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-184"
]
],
"cite_sentences": [
"8c26fb4c81c121103c1d5851edb41e-C001-25",
"8c26fb4c81c121103c1d5851edb41e-C001-26",
"8c26fb4c81c121103c1d5851edb41e-C001-37",
"8c26fb4c81c121103c1d5851edb41e-C001-41",
"8c26fb4c81c121103c1d5851edb41e-C001-61",
"8c26fb4c81c121103c1d5851edb41e-C001-104",
"8c26fb4c81c121103c1d5851edb41e-C001-110",
"8c26fb4c81c121103c1d5851edb41e-C001-126",
"8c26fb4c81c121103c1d5851edb41e-C001-162",
"8c26fb4c81c121103c1d5851edb41e-C001-164",
"8c26fb4c81c121103c1d5851edb41e-C001-184"
]
},
"@USE@": {
"gold_contexts": [
[
"8c26fb4c81c121103c1d5851edb41e-C001-25"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-98"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-104",
"8c26fb4c81c121103c1d5851edb41e-C001-105"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-109",
"8c26fb4c81c121103c1d5851edb41e-C001-110",
"8c26fb4c81c121103c1d5851edb41e-C001-111",
"8c26fb4c81c121103c1d5851edb41e-C001-112",
"8c26fb4c81c121103c1d5851edb41e-C001-113",
"8c26fb4c81c121103c1d5851edb41e-C001-114"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-121"
]
],
"cite_sentences": [
"8c26fb4c81c121103c1d5851edb41e-C001-25",
"8c26fb4c81c121103c1d5851edb41e-C001-98",
"8c26fb4c81c121103c1d5851edb41e-C001-104",
"8c26fb4c81c121103c1d5851edb41e-C001-110",
"8c26fb4c81c121103c1d5851edb41e-C001-121"
]
},
"@DIF@": {
"gold_contexts": [
[
"8c26fb4c81c121103c1d5851edb41e-C001-26"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-37",
"8c26fb4c81c121103c1d5851edb41e-C001-38",
"8c26fb4c81c121103c1d5851edb41e-C001-39",
"8c26fb4c81c121103c1d5851edb41e-C001-40"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-41"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-184"
]
],
"cite_sentences": [
"8c26fb4c81c121103c1d5851edb41e-C001-26",
"8c26fb4c81c121103c1d5851edb41e-C001-37",
"8c26fb4c81c121103c1d5851edb41e-C001-41",
"8c26fb4c81c121103c1d5851edb41e-C001-184"
]
},
"@EXT@": {
"gold_contexts": [
[
"8c26fb4c81c121103c1d5851edb41e-C001-41"
]
],
"cite_sentences": [
"8c26fb4c81c121103c1d5851edb41e-C001-41"
]
},
"@MOT@": {
"gold_contexts": [
[
"8c26fb4c81c121103c1d5851edb41e-C001-58",
"8c26fb4c81c121103c1d5851edb41e-C001-61"
],
[
"8c26fb4c81c121103c1d5851edb41e-C001-164",
"8c26fb4c81c121103c1d5851edb41e-C001-165"
]
],
"cite_sentences": [
"8c26fb4c81c121103c1d5851edb41e-C001-61",
"8c26fb4c81c121103c1d5851edb41e-C001-164"
]
},
"@SIM@": {
"gold_contexts": [
[
"8c26fb4c81c121103c1d5851edb41e-C001-121"
]
],
"cite_sentences": [
"8c26fb4c81c121103c1d5851edb41e-C001-121"
]
}
}
},
"ABC_febb64368c09d03932742fc557f3d3_6": {
"x": [
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-2",
"text": "In this paper we report our experiments in creating a parallel corpus using German/Simple German documents from the web."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-3",
"text": "We require parallel data to build a statistical machine translation (SMT) system that translates from German into Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-4",
"text": "Parallel data for SMT systems needs to be aligned at the sentence level."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-5",
"text": "We applied an existing monolingual sentence alignment algorithm."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-6",
"text": "We show the limits of the algorithm with respect to the language and domain of our data and suggest ways of circumventing them."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-9",
"text": "Simple language (or, \"plain language\", \"easy-toread language\") is language with low lexical and syntactic complexity."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-10",
"text": "It provides access to information to people with cognitive disabilities (e.g., aphasia, dyslexia), foreign language learners, Deaf people, 1 and children."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-11",
"text": "Text in simple language is obtained through simplification."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-12",
"text": "Simplification is a text-to-text generation task involving multiple operations, such as deletion, rephrasing, reordering, sentence splitting, and even insertion (Coster and Kauchak, 2011a) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-13",
"text": "By contrast, paraphrasing and compression, two other text-to-text generation tasks, involve merely rephrasing and reordering (paraphrasing) and deletion (compression)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-14",
"text": "Text simplification also shares common ground with grammar and style checking as well as with controlled natural language generation."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-15",
"text": "Text simplification approaches exist for various languages, including English, French, Spanish, and Swedish."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-16",
"text": "As Matausch and Nietzio (2012) write, \"plain language is still underrepresented in the German speaking area and needs further development\"."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-17",
"text": "Our goal is to build a statistical machine translation (SMT) system that translates from German into Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-18",
"text": "SMT systems require two corpora aligned at the sentence level as their training, development, and test data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-19",
"text": "The two corpora together can form a bilingual or a monolingual corpus."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-20",
"text": "A bilingual corpus involves two different languages, while a monolingual corpus consists of data in a single language."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-21",
"text": "Since text simplification is a text-totext generation task operating within the same language, it produces monolingual corpora."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-22",
"text": "Monolingual corpora, like bilingual corpora, can be either parallel or comparable."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-23",
"text": "A parallel corpus is a set of two corpora in which \"a noticeable number of sentences can be recognized as mutual translations\" (Tom\u00e1s et al., 2008) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-24",
"text": "Parallel corpora are often compiled from the publications of multinational institutions, such as the UN or the EU, or of governments of multilingual countries, such as Canada (Koehn, 2005) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-25",
"text": "In contrast, a comparable corpus consists of two corpora created independently of each other from distinct sources."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-26",
"text": "Examples of comparable documents are news articles written on the same topic by different news agencies."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-27",
"text": "In this paper we report our experiments in creating a monolingual parallel corpus using German/Simple German documents from the web."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-28",
"text": "We require parallel data to build an SMT system that translates from German into Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-29",
"text": "Parallel data for SMT systems needs to be aligned at the sentence level."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-30",
"text": "We applied an existing monolingual sentence alignment algorithm."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-31",
"text": "We show the limits of the algorithm with respect to the language and domain of our data and suggest ways of circumventing them."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-32",
"text": "The remainder of this paper is organized as follows: In Section 2 we discuss the methodologies pursued and the data used in previous work deal-ing with automatic text simplification."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-33",
"text": "In Section 3 we describe our own approach to building a German/Simple German parallel corpus."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-34",
"text": "In particular, we introduce the data obtained from the web (Section 3.1), describe the sentence alignment algorithm we used (Section 3.2), present the results of the sentence alignment task (Section 3.3), and discuss them (Section 3.4)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-35",
"text": "In Section 4 we give an overview of the issues we tackled and offer an outlook on future work."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-37",
"text": "**APPROACHES TO TEXT SIMPLIFICATION**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-38",
"text": "The task of simplifying text automatically can be performed by means of rule-based, corpus-based, or hybrid approaches."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-39",
"text": "In a rule-based approach, the operations carried out typically include replacing words by simpler synonyms or rephrasing relative clauses, embedded sentences, passive constructions, etc."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-40",
"text": "Moreover, definitions of difficult terms or concepts are often added, e.g., the term web crawler is defined as \"a computer program that searches the Web automatically\"."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-41",
"text": "Gasperin et al. (2010) pursued a rule-based approach to text simplification for Brazilian Portuguese within the PorSimples project, 2 as did Brouwers et al. (2012) for French."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-42",
"text": "As part of the corpus-based approach, machine translation (MT) has been employed."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-43",
"text": "Yatskar et al. (2010) pointed out that simplification is \"a form of MT in which the two 'languages' in question are highly related\"."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-44",
"text": "As far as we can see, Zhu et al. (2010) were the first to use English/Simple English Wikipedia data for automatic simplification via machine translation."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-45",
"text": "3 They assembled a monolingual comparable corpus 4 of 108,016 sentence pairs based on the interlanguage links in Wikipedia and the sentence alignment algorithm of Nelken and Shieber (2006) the comparable Wikipedia data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-46",
"text": "At runtime, an input sentence is parsed and zero or more simplification operations are carried out based on the model probabilities."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-47",
"text": "Specia (2010) used the SMT system Moses (Koehn et al., 2007) to translate from Brazilian Portuguese into a simpler version of this language."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-48",
"text": "Her work is part of the PorSimples project mentioned above."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-49",
"text": "As training data she used 4483 sentences extracted from news texts that had been manually translated into Simple Brazilian Portuguese."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-50",
"text": "5 The results, evaluated automatically with BLEU (Papineni et al., 2002) and NIST (Doddington, 2002) as well as manually, show that the system performed lexical simplification and sentence splitting well, while it exhibited problems in reordering phrases and producing subjectverb-object (SVO) order."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-51",
"text": "To further improve her system Specia (2010) suggested including syntactic information through hierarchical SMT (Chiang, 2005 ) and part-of-speech tags through factored SMT (Hoang, 2007) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-52",
"text": "Coster and Kauchak (2011a; 2011b) translated from English into Simple English using English/Simple English Wikipedia data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-53",
"text": "Like Specia (2010), they applied Moses as their MT system but in addition to the default configuration allowed for phrases to be empty."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-54",
"text": "This was motivated by their observation that 47% of all Simple English Wikipedia sentences were missing at least one phrase compared to their English Wikipedia counterparts."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-55",
"text": "Coster and Kauchak (2011a; 2011b) used four baselines to evaluate their system: input=output, 6 two text compression systems, and vanilla Moses."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-56",
"text": "Their system, Moses-Del, achieved higher automatic MT evaluation scores (BLEU) than all of the baselines."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-57",
"text": "In particular, it outperformed vanilla Moses (lacking the phrase deletion option)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-58",
"text": "Wubben et al. (2012) also worked with English/Simple English Wikipedia data and Moses."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-59",
"text": "They added a post-hoc reranking step: Following their conviction that the output of a simplification system has to be a modified version of the input, 7 they rearranged the 10-best sentences output by Moses such that those differing from the input sentences were given preference over those that were identical."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-60",
"text": "Difference was calculated on the basis of the Levenshtein score (edit distance)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-61",
"text": "Wubben et al. (2012) found their system to work better than that of Zhu et al. (2010) when evaluated with BLEU, but not when evaluated with the Flesch-Kincaid grade level, a common readability metric."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-62",
"text": "Bott and Saggion (2011) presented a monolingual sentence alignment algorithm, which uses a Hidden Markov Model for alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-63",
"text": "In contrast to other monolingual alignment algorithms, Bott and Saggion (2011) introduced a monotonicity restriction, i.e., they assumed the order of sentences to be the same for the original and simplified texts."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-64",
"text": "Apart from purely rule-based and purely corpus-based approaches to text simplification, hybrid approaches exist."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-89",
"text": "Wir freuen uns \u00fcber Ihr Interesse an unserer Arbeit mit und f\u00fcr Menschen mit Behinderung."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-65",
"text": "For example, Bott et al. (2012) in their Simplext project for Spanish 8 let a statistical classifier decide for each sentence of a text whether it should be simplified (corpus-based approach)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-66",
"text": "The actual simplification was then performed by means of a rule-based approach."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-67",
"text": "As has been shown, many MT approaches to text simplification have used English/Simple English Wikipedia as their data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-68",
"text": "The only exception we know of is Specia (2010) , who together with her colleagues in the PorSimples project built her own parallel corpus."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-69",
"text": "This is presumably because there exists no Simple Brazilian Portuguese Wikipedia."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-70",
"text": "The same is true for German: To date, no Simple German Wikipedia has been created."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-71",
"text": "Therefore, we looked for data available elsewhere for our machine translation system designated to translate from German to Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-72",
"text": "We discovered that German/Simple German parallel data is slowly becoming available on the web."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-73",
"text": "In what follows, we describe the data we harvested and report our experience in creating a monolingual parallel corpus from this data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-74",
"text": "3 Building a German/Simple German Parallel Corpus from the Web"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-76",
"text": "**DATA**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-77",
"text": "As mentioned in Section 1, statistical machine translation (SMT) systems require parallel data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-78",
"text": "A common approach to obtain such material is to look for it on the web."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-79",
"text": "9 The use of already available data offers cost and time advantages."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-80",
"text": "Many websites, including that of the German government, 10 contain documents in Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-81",
"text": "However, these documents are often not linked to a single corresponding German document; instead, they are high-level summaries of multiple German documents."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-82",
"text": "A handful of websites exist that offer articles in two versions: a German version, often called Alltagssprache (AS, \"everyday language\"), and a Simple German version, referred to as Leichte Sprache (LS, \"simple language\")."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-83",
"text": "Table 1 lists the websites we used to compile our corpus."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-84",
"text": "The numbers indicate how many parallel articles were extracted."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-85",
"text": "The websites are mainly of organizations that support people with disabilities."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-86",
"text": "We crawled the articles with customized Python scripts that located AS articles and followed the links to their LS correspondents."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-87",
"text": "A sample sentence pair from our data is shown in Example 1."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-88",
"text": "(1) German:"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-90",
"text": "(\"We appreciate your interest in our work with and for people with disabilities.\")"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-91",
"text": "Simple German: Sch\u00f6n, dass Sie sich f\u00fcr unsere Arbeit interessieren."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-92",
"text": "Wir arbeiten mit und f\u00fcr Menschen mit Behinderung."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-93",
"text": "(\"Great that you are interested in our work."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-94",
"text": "We work with and for people with disabilities.\")"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-95",
"text": "The extracted data needed to be cleaned from HTML tags."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-96",
"text": "For our purpose, we considered text and paragraph structure markers as important information; therefore, we retained them."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-97",
"text": "We subsequently tokenized the articles."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-98",
"text": "The resulting corpus consisted of 7755 sentences, which amounted to 82,842 tokens."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-99",
"text": "However, caution is advised when looking at these numbers: Firstly, the tokenization module overgenerated tokens."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-100",
"text": "Secondly, some of the LS articles were identical, either because they summarized multiple AS articles or because they were generic placeholders."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-101",
"text": "Hence, the SMT systems rely on data aligned at the sentence level."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-102",
"text": "Since the data we extracted from the web was aligned at the article level only, we had to perform sentence alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-103",
"text": "For this we split our corpus into a training set (70% of the texts), development set (10%), and test set (20%)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-104",
"text": "We manually annotated sentence alignments for all of the data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-105",
"text": "Example 2 shows an aligned AS/LS sentence pair."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-106",
"text": "(2) German:"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-107",
"text": "In To measure the amount of parallel sentences in our data, we calculated the alignment diversity measure (ADM) of Nelken and Shieber (2006) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-108",
"text": "ADM measures how many sentences are aligned."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-109",
"text": "It is calculated as , where matches is the number of alignments between the two texts T 1 and T 2."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-110",
"text": "ADM is 1.0 in a perfectly parallel corpus, where every sentence from one text is aligned to exactly one sentence in another text."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-111",
"text": "ADM for our corpus was 0.786, which means that approximately 78% of the sentences were aligned."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-112",
"text": "This is a rather high number compared to the values reported by Nelken and Shieber (2006) : Their texts (consisting of encyclopedia articles and gospels) resulted in an ADM of around 0.3."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-113",
"text": "A possible explanation for the large difference in ADM is the fact that most simplified texts in our corpus are solely based on the original texts, whereas the simple versions of the encyclopedia articles might have been created by drawing on external information in addition."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-115",
"text": "**SENTENCE ALIGNMENT ALGORITHM**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-116",
"text": "Sentence alignment algorithms differ according to whether they have been developed for bilingual or monolingual corpora."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-117",
"text": "For bilingual parallel corpora many-typically length-based-algorithms exist."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-118",
"text": "However, our data was monolingual."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-119",
"text": "While the length of a regular/simple language sentence pair might be different, an overlap in vocabulary can be expected."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-120",
"text": "Hence, monolingual sentence alignment algorithms typically exploit lexical similarity."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-121",
"text": "We applied the monolingual sentence alignment algorithm of Barzilay and Elhadad (2003) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-122",
"text": "The algorithm has two main features: Firstly, it uses a hierarchical approach by assigning paragraphs to clusters and learning mapping rules."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-123",
"text": "Secondly, it aligns sentences despite low lexical similarity if the context suggests an alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-124",
"text": "This is achieved through local sequence alignment, a dynamic programming algorithm."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-125",
"text": "The overall algorithm has two phases, a training and a testing phase."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-126",
"text": "The training phase in turn consists of two steps: Firstly, all paragraphs of the texts of one side of the parallel corpus (henceforth referred to as \"AS texts\") are clustered independently of all paragraphs of the texts of the other side of the parallel corpus (henceforth termed \"LS texts\"), and vice versa."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-127",
"text": "Secondly, mappings between the two sets of clusters are calculated, given the reference alignments."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-128",
"text": "As a preprocessing step to the clustering process, we removed stopwords, lowercased all words, and replaced dates, numbers, and names by generic tags."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-129",
"text": "Barzilay and Elhadad (2003) additionally considered every word starting with a capital letter inside a sentence to be a proper name."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-130",
"text": "In German, all nouns (i.e., regular nouns as well as proper names) are capitalized; thus, this approach does not work."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-131",
"text": "We used a list of 61,228 first names to remove at least part of the proper names."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-132",
"text": "We performed clustering with scipy (Jones et al., 2001) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-133",
"text": "We adapted the hierarchical completelink clustering method of Barzilay and Elhadad (2003) : While the authors claimed to have set a specific number of clusters, we believe this is not generally possible in hierarchical agglomerative clustering."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-134",
"text": "Therefore, we used the largest number of clusters in which all paragraph pairs had a cosine similarity strictly greater than zero."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-135",
"text": "Following the formation of the clusters, lexical similarity between all paragraphs of corresponding AS and LS texts was computed to establish probable mappings between the two sets of clusters."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-136",
"text": "Barzilay and Elhadad (2003) used the boosting tool Boostexter (Schapire and Singer, 2000) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-137",
"text": "All possible cross-combinations of paragraphs from the parallel training data served as training instances."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-138",
"text": "An instance consisted of the cosine similarity of the two paragraphs and a string combining the two cluster IDs."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-139",
"text": "The classification result was extracted from the manual alignments."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-140",
"text": "In order for an AS and an LS paragraph to be aligned, at least one sentence from the LS paragraph had to be aligned to one sentence in the AS paragraph."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-141",
"text": "Like Barzilay and Elhadad (2003) , we performed 200 iterations in Boostexter."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-142",
"text": "After learning the mapping rules, the training phase was complete."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-143",
"text": "The testing phase consisted of two additional steps."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-144",
"text": "Firstly, each paragraph of each text in the test set was assigned to the cluster it was closest to."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-145",
"text": "This was done by calculating the cosine similarity of the word frequencies in the clusters."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-146",
"text": "Then, every AS paragraph was combined with all LS paragraphs of the parallel text, and Boostexter was used in classification mode to predict whether the two paragraphs were to be mapped."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-147",
"text": "Secondly, within each pair of paragraphs mapped by Boostexter, sentences with very high lexical similarity were aligned."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-148",
"text": "In our case, the threshold for an alignment was a similarity of 0.5."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-149",
"text": "For the remaining sentences, proximity to other aligned or similar sentences was used as an indicator."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-150",
"text": "This was implemented by local sequence alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-151",
"text": "We set the mismatch penalty to 0.02, as a higher mismatch penalty would have reduced recall."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-152",
"text": "We set the skip penalty to 0.001 conforming to the value of Barzilay and Elhadad (2003) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-153",
"text": "The resulting alignments were written to files."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-154",
"text": "Example 3 shows a successful sentence alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-155",
"text": "(3) German:"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-156",
"text": "Die GWW ist in den Landkreisen B\u00f6blingen und Calw aktiv und bietet an den folgenden Standorten Wohnm\u00f6glichkeiten f\u00fcr Menschen mit Behinderung an -ganz in Ihrer N\u00e4he! (\"The GWW is active in the counties of B\u00f6blingen and Calw and offers housing options for people with disabilities at the following locations -very close to you!\")"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-157",
"text": "Simple German: Die GWW gibt es in den Landkreisen Calw und B\u00f6blingen."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-158",
"text": "Wir haben an den folgenden Orten Wohn-M\u00f6glichkeiten f\u00fcr Sie."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-159",
"text": "(\"The GWW exists in the counties of Calw and B\u00f6blingen."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-160",
"text": "We have housing options for you in the following locations.\")"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-161",
"text": "The algorithm described has been modified in various ways."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-162",
"text": "Nelken and Shieber (2006) used TF/IDF instead of raw term frequency, logistic regression on the cosine similarity instead of clustering, and an extended version of the local alignment recurrence."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-163",
"text": "Both Nelken and Shieber (2006) and Quirk et al. (2004) found that the first sentence of each document is likely to be aligned."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-164",
"text": "We observed the same for our corpus."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-165",
"text": "Therefore, in our algorithm we adopted the strategy of unconditionally aligning the first sentence of each document."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-166",
"text": "Table 2 shows the results of evaluating the algorithm described in the previous section with respect to precision, recall, and F1 measure."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-167",
"text": "We introduced two baselines:"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-169",
"text": "**RESULTS**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-170",
"text": "Adapted algorithm of Barzilay and Elhadad (2003) 27.7% 5.0% 8.5% Baseline I: First sentence 88.1% 4.8% 9.3% Baseline II: Word in common 2.2% 8.2% 3.5% Table 2 : Alignment results on test set 1. Aligning only the first sentence of each text (\"First sentence\") 2."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-171",
"text": "Aligning every sentence with a cosine similarity greater than zero (\"Word in common\")"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-172",
"text": "As can be seen from Table 2 , by applying the sentence alignment algorithm of Barzilay and Elhadad (2003) we were able to extract only 5% of all reference alignments, while precision was below 30%."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-173",
"text": "The rule of aligning the first sentences performed well with a precision of 88%."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-174",
"text": "Aligning all sentences with a word in common clearly showed the worst performance; this is because many sentences have a word in common."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-175",
"text": "Nonetheless, recall was only slightly higher than with the other methods."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-176",
"text": "In conclusion, none of the three approaches (adapted algorithm of Barzilay and Elhadad (2003) , two baselines \"First sentence\" and \"Word in common\") performed well on our test set."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-177",
"text": "We analyzed the characteristics of our data that hampered high-quality automatic alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-179",
"text": "**DISCUSSION**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-180",
"text": "Compared with the results of Barzilay and Elhadad (2003) , who achieved 77% precision at 55.8% recall for their data, our alignment scores were considerably lower (27.7% precision, 5% recall)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-181",
"text": "We found two reasons for this: language challenges and domain challenges."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-182",
"text": "In what follows, we discuss each reason in more detail."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-183",
"text": "While Barzilay and Elhadad (2003) aligned English/Simple English texts, we dealt with German/Simple German data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-184",
"text": "As mentioned in Section 3.2, in German nouns (regular nouns as well as proper names) are capitalized."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-185",
"text": "This makes named entity recognition, a preprocessing step to clustering, more difficult."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-186",
"text": "Moreover, German is an example of a morphologically rich language: Its noun phrases are marked with case, leading to different inflectional forms for articles, pronouns, adjectives, and nouns."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-187",
"text": "English morphology is poorer; hence, there is a greater likelihood of lexical overlap."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-188",
"text": "Similarly, compounds are productive in German; an example from our corpus is Seniorenwohnanlagen (\"housing complexes for the elderly\")."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-189",
"text": "In contrast, English compounds are multiword units, where each word can be accessed separately by a clustering algorithm."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-190",
"text": "Therefore, cosine similarity is more effective for English than it is for German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-191",
"text": "One way to alleviate this problem would be to use extensive morphological decomposition and lemmatization."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-192",
"text": "In terms of domain, Barzilay and Elhadad (2003) used city descriptions from an encyclopedia for their experiments."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-193",
"text": "For these descriptions clustering worked well because all articles had the same structure (paragraphs about culture, sports, etc.)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-194",
"text": "The domain of our corpus was broader: It included information about housing, work, and events for people with disabilities as well as information about the organizations behind the respective websites."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-195",
"text": "Apart from language and domain challenges we observed heavy transformations from AS to LS in our data (Figure 1 shows a sample article in AS and LS)."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-196",
"text": "As a result, LS paragraphs were typically very short and the clustering process returned many singleton clusters."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-197",
"text": "Example 4 shows an AS/LS sentence pair that could not be aligned because of this."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-198",
"text": "(\"He provides them with advice and information.\") Figure 2 shows the dendrogram of the clustering of the AS texts."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-199",
"text": "A dendrogram shows the results of a hierarchical agglomerative clustering."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-200",
"text": "At the bottom of the dendrogram every paragraph is marked by an individual line."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-201",
"text": "At the points where two vertical paths join, the corresponding clusters are merged to a new larger cluster."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-202",
"text": "The Y-axis is the dissimilarity value of the two clusters."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-203",
"text": "In our experiment the resulting clusters are the clusters at dissimilarity 1 \u2212 1 \u221210 ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-204",
"text": "Geometrically this is a horizontal cut just below dissimilarity 1.0."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-205",
"text": "As can be seen from Figure 2 , many of the paragraphs in the left half of the picture are never merged to a slightly larger cluster but are directly connected to the universal cluster that merges everything."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-206",
"text": "This is because they contain only stopwords or only words that do not appear in all paragraphs of another cluster."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-207",
"text": "Such an unbalanced clustering, where many paragraphs are clustered to one cluster and many other paragraphs remain singleton clusters, reduces the precision of the hierarchical approach."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-208",
"text": "----------------------------------"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-209",
"text": "**CONCLUSION AND OUTLOOK**"
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-210",
"text": "In this paper we have reported our experiments in creating a monolingual parallel corpus using German/Simple German documents from the web."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-211",
"text": "We have shown that little work has been done on automatic simplification of German so far."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-212",
"text": "We have described our plan to build a statistical machine translation (SMT) system that translates form German into Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-213",
"text": "SMT systems require parallel corpora."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-214",
"text": "The process of creating a parallel corpus for use in machine translation involves sentence alignment."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-215",
"text": "Sentence alignment algorithms for bilingual corpora differ from those for monolingual corpora."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-216",
"text": "Since all of our data was from the same language, we applied the monolingual sentence alignment approach of Barzilay and Elhadad (2003) ."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-217",
"text": "We have shown the limits of the algorithm with respect to the language and domain of our data."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-218",
"text": "For example, named entity recognition, a preprocessing step to clustering, is harder for German than for English, the language Barzilay and Elhadad (2003) worked with."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-219",
"text": "Moreover, German features richer morphology than English, which leads to less lexical overlap when working on the word form level."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-220",
"text": "The domain of our corpus was also broader than that of Barzilay and Elhadad (2003) , who used city descriptions from an encyclopedia for their experiments."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-221",
"text": "This made it harder to identify common article structures that could be exploited in clustering."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-222",
"text": "As a next step, we will experiment with other monolingual sentence alignment algorithms."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-223",
"text": "In addition, we will build a second parallel corpus for German/Simple German: A person familiar with the task of text simplification will produce simple versions of German texts."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-224",
"text": "We will use the resulting parallel corpus as data for our experiments in automatically translating from German to Simple German."
},
{
"sent_id": "febb64368c09d03932742fc557f3d3-C001-225",
"text": "The parallel corpus we compiled as part of the work described in this paper can be made available to interested parties upon request."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"febb64368c09d03932742fc557f3d3-C001-121"
],
[
"febb64368c09d03932742fc557f3d3-C001-141"
],
[
"febb64368c09d03932742fc557f3d3-C001-172"
]
],
"cite_sentences": [
"febb64368c09d03932742fc557f3d3-C001-121",
"febb64368c09d03932742fc557f3d3-C001-141",
"febb64368c09d03932742fc557f3d3-C001-172"
]
},
"@BACK@": {
"gold_contexts": [
[
"febb64368c09d03932742fc557f3d3-C001-129",
"febb64368c09d03932742fc557f3d3-C001-130",
"febb64368c09d03932742fc557f3d3-C001-131"
],
[
"febb64368c09d03932742fc557f3d3-C001-133",
"febb64368c09d03932742fc557f3d3-C001-134",
"febb64368c09d03932742fc557f3d3-C001-135"
],
[
"febb64368c09d03932742fc557f3d3-C001-136",
"febb64368c09d03932742fc557f3d3-C001-137",
"febb64368c09d03932742fc557f3d3-C001-138",
"febb64368c09d03932742fc557f3d3-C001-139",
"febb64368c09d03932742fc557f3d3-C001-140"
],
[
"febb64368c09d03932742fc557f3d3-C001-176"
],
[
"febb64368c09d03932742fc557f3d3-C001-192",
"febb64368c09d03932742fc557f3d3-C001-193",
"febb64368c09d03932742fc557f3d3-C001-194",
"febb64368c09d03932742fc557f3d3-C001-195",
"febb64368c09d03932742fc557f3d3-C001-196"
],
[
"febb64368c09d03932742fc557f3d3-C001-218",
"febb64368c09d03932742fc557f3d3-C001-219"
],
[
"febb64368c09d03932742fc557f3d3-C001-220",
"febb64368c09d03932742fc557f3d3-C001-221"
]
],
"cite_sentences": [
"febb64368c09d03932742fc557f3d3-C001-129",
"febb64368c09d03932742fc557f3d3-C001-133",
"febb64368c09d03932742fc557f3d3-C001-136",
"febb64368c09d03932742fc557f3d3-C001-176",
"febb64368c09d03932742fc557f3d3-C001-192",
"febb64368c09d03932742fc557f3d3-C001-218",
"febb64368c09d03932742fc557f3d3-C001-220"
]
},
"@EXT@": {
"gold_contexts": [
[
"febb64368c09d03932742fc557f3d3-C001-129",
"febb64368c09d03932742fc557f3d3-C001-130",
"febb64368c09d03932742fc557f3d3-C001-131"
],
[
"febb64368c09d03932742fc557f3d3-C001-133",
"febb64368c09d03932742fc557f3d3-C001-134",
"febb64368c09d03932742fc557f3d3-C001-135"
],
[
"febb64368c09d03932742fc557f3d3-C001-170"
]
],
"cite_sentences": [
"febb64368c09d03932742fc557f3d3-C001-129",
"febb64368c09d03932742fc557f3d3-C001-133",
"febb64368c09d03932742fc557f3d3-C001-170"
]
},
"@SIM@": {
"gold_contexts": [
[
"febb64368c09d03932742fc557f3d3-C001-141"
],
[
"febb64368c09d03932742fc557f3d3-C001-152"
],
[
"febb64368c09d03932742fc557f3d3-C001-216"
]
],
"cite_sentences": [
"febb64368c09d03932742fc557f3d3-C001-141",
"febb64368c09d03932742fc557f3d3-C001-152",
"febb64368c09d03932742fc557f3d3-C001-216"
]
},
"@DIF@": {
"gold_contexts": [
[
"febb64368c09d03932742fc557f3d3-C001-176"
],
[
"febb64368c09d03932742fc557f3d3-C001-180"
],
[
"febb64368c09d03932742fc557f3d3-C001-183"
],
[
"febb64368c09d03932742fc557f3d3-C001-192",
"febb64368c09d03932742fc557f3d3-C001-193",
"febb64368c09d03932742fc557f3d3-C001-194",
"febb64368c09d03932742fc557f3d3-C001-195",
"febb64368c09d03932742fc557f3d3-C001-196"
],
[
"febb64368c09d03932742fc557f3d3-C001-220",
"febb64368c09d03932742fc557f3d3-C001-221"
]
],
"cite_sentences": [
"febb64368c09d03932742fc557f3d3-C001-176",
"febb64368c09d03932742fc557f3d3-C001-180",
"febb64368c09d03932742fc557f3d3-C001-183",
"febb64368c09d03932742fc557f3d3-C001-192",
"febb64368c09d03932742fc557f3d3-C001-220"
]
}
}
},
"ABC_5a6684d978c0dbcfaabb4bc2314aeb_6": {
"x": [
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-2",
"text": "Word alignment is a fundamental step in machine translation."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-3",
"text": "Current statistical machine translation systems suffer from a major drawback: they only extract rules from 1-best alignments, which adversely affects the rule sets quality due to alignment mistakes."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-4",
"text": "To alleviate this problem, we extract hierarchical rules from weighted alignment matrix (Liu et al., 2009) ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-5",
"text": "Since the sub-phrase pairs would change the inside and outside areas in the weighted alignment matrix of the hierarchical rules, we propose a new algorithm to calculate the relative frequencies and lexical weights of hierarchical rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-6",
"text": "To achieve a balance between rule table size and performance, we construct a scoring measure that incorporates both frequency and lexical weight to select the best target phrase for each source phrase."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-7",
"text": "Experiments show that our approach improves BLEU score by ranging from 1.4 to 1.9 points over baseline for hierarchical phrase-based, and 1.4 to 1.5 points for tree-to-string model."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-10",
"text": "Word alignment plays an important role in statistical machine translation (SMT)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-11",
"text": "Most SMT systems, not only phrase-based models (Och and Ney, 2004; Koehn et al., 2003; Xiong et al., 2006) , but also syntax-based models (Chiang, 2005; Galley et al., 2006; Huang et al., 2006; Shen et al., 2008) , usually extract rules from word aligned corpora."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-12",
"text": "However, these systems suffer from a major drawback: they only extract rules from 1-best alignments, which adversely affects the rule sets quality due to alignment mistakes."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-13",
"text": "Typically, syntax-based models are more sensitive to word alignments because they care about inside (i.e., subtracted phrases)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-14",
"text": "Figure 1 (a) shows an alignment of a sentence pair."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-38",
"text": "Here N is an n-best list, p(a) is the probability of an alignment a in the n-best list."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-15",
"text": "Since there is a wrong link (de, of), we could not extract many useful hierarchical rules such as (zhongguo X 1 jingji, China X 1 economy).To alleviate this problem, a natural solution is to extract rules from nbest alignments (Venugopal et al., 2008) ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-16",
"text": "However, using n-best alignments still face two major challenges."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-17",
"text": "First, n-best alignments have to be processed individually although they share many links, see (zhongguo, China) and (jingji, economy) in Figure 1 ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-18",
"text": "Second, regardless of probabilities of links in each alignment, numerous wrong rule would be extracted from n-best alignments."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-19",
"text": "For example, a wrong rule (X 1 de jingji, of X 1 's economy) would be extracted from the alignment in Figure 1 (a)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-20",
"text": "Since Liu et al. (2009) show that weighted alignment matrix provides an elegant solution to these two drawbacks, we apply it to the hierarchical phrase-based model (Chiang, 2005) and the tree-to-string model Huang et al., 2006) ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-21",
"text": "While such an idea seems intuitive, it is non-trivial to extract hierarchical rules from weighted alignment matrices."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-22",
"text": "Our work faces two major challenges."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-23",
"text": "The first is how to calculate the relative frequencies and lex- ical weights of the rules with non-terminals (NTs)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-24",
"text": "The sub-phrase pairs that are replaced with NTs in a rule, would change the inside and outside areas in the weighted alignment matrix of the rule."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-25",
"text": "In addition, the sub-phrase pairs have their own probabilities and we should incorporate them to better estimate the probabilities of the hierarchical rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-26",
"text": "Therefore, the calculations of relative frequencies and lexical weights for hierarchical rules are more complicated."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-27",
"text": "Another challenge is how to achieve a balance between performance and rule table size."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-28",
"text": "Note that given a source phrase, there would be plenty of \"potential\" candidate target phrases in weighted matrices (Liu et al., 2009 )."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-29",
"text": "If we retain all of them, these phrase pairs would produce even more hierarchical rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-30",
"text": "For computational tractability, we need to design a measure to score the phrase pairs and wipe out the low-quality ones."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-31",
"text": "We propose a new algorithm to calculate the relative frequencies of rules, and construct a measure that incorporates both frequency and lexical weight to score target phrases."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-32",
"text": "Experiments (Section 4) show that our approach improves BLEU score by ranging from 1.4 to 1.9 points over baseline for hierarchical phrase-based, and 1.4 to 1.8 points for tree-to-string model."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-34",
"text": "**WEIGHTED ALIGNMENT MATRIX**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-35",
"text": "A weighted alignment matrix (Liu et al., 2009) m is a J \u00d7 I matrix to encode the probabilities of n-best alignments of the same sentence pair."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-36",
"text": "Each element in the matrix stores a link probability p m (j, i), which is estimated from an n-best list by calculating relative frequencies:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-37",
"text": "where"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-39",
"text": "The numbers in the cells in Figure 2 (c) are the corresponding p m ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-40",
"text": "Since p m (j, i) is the probability that f j and e i are aligned, the probability that the two words are not aligned is Figure 2 shows an example."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-41",
"text": "The probability for the two words zhongguo and China being aligned is 1.0 and the probability that they are not aligned is 0.0."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-42",
"text": "In another way, the two words are definitely aligned."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-43",
"text": "Given a phrase pair (f"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-44",
"text": "The key point to calculate the relative frequency of the phrase pair is to obtain its fractional count."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-45",
"text": "Liu et al. (2009) use the product of inside and outside probabilities as the fractional count of a phrase pair."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-46",
"text": "Liu et al. (2009) define that inside probability indicates the probability that at least one word in source phrase is aligned to a word in target phrase, and outside probability indicates the chance that no words in one phrase are aligned to a word outside the other phrase."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-47",
"text": "The fractional count is calculated:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-48",
"text": "where \u03b1(\u00b7) and \u03b2(\u00b7) denote the inside and outside probabilities respectively, which can be calculated as"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-49",
"text": "Here in(\u00b7) denotes the inside area, which includes elements that fall inside the phrase pair, while out(\u00b7) denotes the outside area including elements that fall outside the phrase pair while fall in the same row or the same column."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-50",
"text": "Figure 3 shows an example."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-51",
"text": "The light shading area is the outside area of phrase pair and the area inside the pane with bold lines is the inside area."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-52",
"text": "To calculate the lexical weights, Liu et al. (2009) adapt p m (j, i) as the fractional count count(f j , e i )."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-53",
"text": "The fractional counts of NULL words can be calculated as: Then the lexical weight can be calculated as:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-54",
"text": "where"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-55",
"text": "We apply weighted alignment matrix to the hierarchical phrase-based model (Chiang, 2007) and the tree-to-string model Huang et al., 2006) ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-57",
"text": "**RULE EXTRACTION**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-58",
"text": "In hierarchical rules, both source and target sides are strings with NTs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-59",
"text": "In tree-to-string rules, the source side is a tree with NTs, while the target side is a string with NTs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-60",
"text": "Since the tree structure of source side has no effect on the calculations of relative frequencies and lexical weights, we can represent both tree-to-string and hierarchical rules as below:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-61",
"text": "where X is a nonterminal, \u03b3 and \u03b1 are source and target strings (consist of terminals and NTs), and \u223c represents word alignments between NTs in \u03b3 and \u03b1."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-62",
"text": "The bulk of syntax grammars consists of two parts: phrase pairs and variable rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-63",
"text": "The difference between them is containing NTs or not."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-64",
"text": "Since we can calculate relative frequencies and lexical weights of phrase pairs as in Liu et al. (2009) , we only focus on the calculation of variable rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-66",
"text": "**EXTRACTION ALGORITHM**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-67",
"text": "Following Chiang (2007) and , our extraction algorithm involves two steps."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-68",
"text": "First, we extract phrase pairs from weighted alignment matrices."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-69",
"text": "Then, we obtain variable rules by replacing sub-phrase pairs with NTs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-70",
"text": "Figure 4 shows the algorithm of extracting phrase pairs from a weighted matrix for the hierarchical phrase-based model."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-71",
"text": "The input of the algorithm is a sentence pair (f J 1 , e I 1 ) that are both"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-72",
"text": "r \u2190 N U LL 10:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-73",
"text": "s(r) \u2190 \u22121 11:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-74",
"text": "for n \u2190 1 . ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-75",
"text": ". l do 12:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-76",
"text": "return R Figure 3 (suppose the structure of zhongguo de jingji is a complete sub-tree)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-77",
"text": "Here \u03b1 is inside probability, \u03b2 is outside probability, and count is fractional count."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-78",
"text": "strings, a weighted alignment matrix m, and a phrase length limit l. Note that we just retain the target phrase of highest score for each source phrase (lines 13-16)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-79",
"text": "We describe these in Section 3.2."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-80",
"text": "After we extract phrase pairs, we can obtain variable rules by replacing sub-phrase pairs with NTs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-81",
"text": "We can also extend this algorithm to tree-tostring model."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-82",
"text": "The difference is that the source sentence should be a tree instead of a string and additional syntactic constraints operate."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-83",
"text": "(Liu et al., 2009) show that given a source phrase, there would be multiple \"potential\" candidate target phrases in weighted matrices."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-84",
"text": "Table 1 lists some candidate target phrases of the source phrase zhongguo de jingji in Figure 3 . If we retain all of them, it will lead to an exponentially increasing rule table."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-85",
"text": "To achieve balance between rule table size and performance, we just select the best candidate target phrase."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-87",
"text": "**SELECTION CRITERIA**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-88",
"text": "An interesting finding is that a target phrase with the largest fractional count is not always the best one."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-89",
"text": "For example in Table 1 , the target phrase of China 's economy has a larger fractional count than China 's economy."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-90",
"text": "However, we can see that (zhongguo de jingji, China 's economy) is better."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-91",
"text": "To alleviate this problem, we incorporate lexical weight to distinguish good target phrases from bad ones."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-92",
"text": "While frequency indicates how often the source phrase and target phrase occur together, lexical weight models the correspondence between them."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-93",
"text": "Therefore, we can construct a scoring measure that incorporates both frequency and lexical weight."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-94",
"text": "The scoring equation below models this effect:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-95",
"text": "where \u03c9 is the interpolation weight, count(f ,\u1ebd) is calculated by Equation 5, and p w (\u1ebd|f , m) by Equation 8."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-96",
"text": "In practice, we set \u03c9 = 0.5."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-97",
"text": "1 Suppose p w (China 's economy | zhongguo de jingji) is 0.7 and p w (of China 's economy | zhongguo de jingji) is 0.4, then we should choose the target phrase China 's economy although of China 's economy has a larger fractional count."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-98",
"text": "Note that we select the best target phrase for each source phrase for just one sentence."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-99",
"text": "It means there could still be many target phrases for each source phrase during decoding."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-100",
"text": "Figure 5 shows an example of the matrix of a hierarchical rule, which is generated from the phrase pair in Figure 3 ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-101",
"text": "Due to the existence of subphrase pairs, the inside and outside areas changes (see the difference between Figure 3 and Figure 5 )."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-102",
"text": "Therefore, we can not simply calculate the outside probability of the hierarchical rule using the product of outside probabilities of phrase pair and subphrase pairs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-104",
"text": "**CALCULATING RELATIVE FREQUENCIES**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-105",
"text": "We follow Liu et al. (2009) to calculate relative frequencies using the product of inside and outside probabilities."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-106",
"text": "We now extend the definitions of inside and outside probabilities to hierarchical rules that contain NTs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-107",
"text": "Table 2 : Some hierarchical rules generated from the phrase pair (zhongguo de jingji, China's economy) in Figure 3 (suppose the structure of zhongguo de jingji is a complete sub-tree)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-108",
"text": "Here \u03b1 is inside probability, \u03b2 is outside probability, and count is fractional count."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-109",
"text": "Given a variable rule (f \u2032 , e \u2032 ), which is generated from the phrase pair (f"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-110",
"text": ") by replacing subphrase pairs with X. We denote R as the variable rule, P as the phrase pair (f j 2 j 1 , e i 2 i 1 ), and X k as the kth sub-phrase pair that is replaced with X. Therefore, the inside probability of a variable rule is calculated as:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-111",
"text": "We tried to follow the constraints of Chiang (2007): (1) unaligned words are not allowed at the edges of phrases; (2) a rule must have at least one pair of aligned words."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-112",
"text": "This would take into account the terminals in the variable rule, but make the calculation more complicated (especially constraint (1))."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-113",
"text": "However, it didn't work well."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-114",
"text": "Therefore, we only constraint that the rule should respect the word alignment, which means one terminal in a phrase could not align to another word outside the phrase (using outside probability)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-141",
"text": "Finally, we construct weighted alignment matrices from these n-best alignments."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-142",
"text": "We will first report results trained on a smallscaled corpus, and then scale to a larger one."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-143",
"text": "When extracting tree-to-string rules, we limit the maximal height of rules to 3."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-144",
"text": "We use the pruning threshold: t = 0.5."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-146",
"text": "**RESULTS ON SMALL DATA**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-147",
"text": "To test the effect of our approach, we firstly carried out experiments on FBIS corpus, which contains 230K sentence pairs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-148",
"text": "Table 3 shows the rule table size and translation quality."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-149",
"text": "Using n-best alignments slightly improved the BLEU score, but at the cost of much slower extraction, since each of top-n alignments has to be processed individually although they share many align links."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-150",
"text": "Matrixbased extraction, by contrast, is much faster due to packing and produces consistently better BLEU scores."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-151",
"text": "The absolute improvements of ranging from +1.6 to +1.8 BLEU points and +1.4 to +1.8 BLEU points over 1-best alignments for hierarchical phrase-based and tree-to-string models respectively, are statistically significant at p < 0.01 by using sign-test (Collins et al., 2005) ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-152",
"text": "Basically, in the matrix case of the hierarchical phrase-based model, we can use about twice as many rules as in the 1-best case, or 1.3 times of 10-best extraction."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-153",
"text": "However, in tree-to-string scenario, matrix-based extraction produces less rules than k-best extraction."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-154",
"text": "We contribute this to the extra complete sub-tree constraint."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-156",
"text": "**RESULTS ON LARGE DATA**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-157",
"text": "We also conducted experiments on a larger training data, which contains 1.5M sentence pairs coming from LDC dataset."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-158",
"text": "2 The ruletable size and BLEU score are shown in Table 4 ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-159",
"text": "An interesting finding is that BLEU scores decline when using k-best extraction in some cases."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-160",
"text": "We conjecture that some low-quality rules that harm the performance of decoder, are extracted from k-best alignments."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-161",
"text": "Using weighted matrices on larger corpus also achieved significant and consistent improvements over using 1-best and n-best lists."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-162",
"text": "These results confirm that our approach is a promising direction for syntaxbased machine translation."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-164",
"text": "**COMPARISON OF PARAMETER ESTIMATION**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-165",
"text": "In this section we investigated the question of how many rules are shared by n-best and matrix-based extractions on small data (FBIS corpus)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-166",
"text": "Our motivation is that weighted alignment matrices have been reported to be beneficial for better estimation of rule translation probabilities and lexical weights Table 3 : Results with different rule extraction methods on small data."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-167",
"text": "Here 1-best, 10-best and m(10) denote 1-best alignments, 10-best lists and weighted matrices estimated from 10-best lists respectively."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-168",
"text": "The rules are filtered on the corresponding test set."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-169",
"text": "\"Extraction\" denotes extraction time in millsecs per sentence pair."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-170",
"text": "We evaluate the translation quality using 4-grams case-insensitive BLEU metric."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-171",
"text": "Table 4 : Results with different rule extraction methods on large data."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-172",
"text": "We use m(10) for the weighted matrices estimated from 10-best lists."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-173",
"text": "(Liu et al., 2009 )."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-174",
"text": "The experiments are tested on NIST 2005 dataset."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-175",
"text": "Table 5 gives some statistics."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-176",
"text": "We use m(10) for the weighted matrices estimated from 10-best lists."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-177",
"text": "\"All\" denotes the full rule table, \"Shared\" denotes the intersection of two tables, and \"Nonshared\" denotes the complement."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-178",
"text": "There were 18.8% of rules learned from weighted matrices included by both tables in hierarchical phrase-based case, while 36.5% for tree-to-string rules, indicating that complete sub-tree constraint played an important role in matrix-based tree-to-string rule extraction."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-179",
"text": "Note that the probabilities of \"Shared\" rules are different for the two approaches."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-180",
"text": "Liu et al. (2009) shows that using matrices outperformed using n-best lists even with the same rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-181",
"text": "Our experiments confirmed these findings."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-183",
"text": "**BEST RULE OR MORE RULES**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-184",
"text": "Someone would argue that using more rules could improve the performance, especially for the treeto-string model."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-185",
"text": "Therefore, we carried out experiments on small data for tree-to-string model to investigate which one is better."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-186",
"text": "Note that even though we retain the best target side for each source side for each sentence, there could still be many target sides for each source side when decoding."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-187",
"text": "Table 6 shows the results of different criterions."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-188",
"text": "The first column \"Criteria\" indicates how many target phrases are preserved: the best one or all phrases that reach pruning threshold."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-189",
"text": "We can see that \"More Rules\" could not outperform \"Best Rule\" even using almost 2.5 times rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-190",
"text": "One possible reason is that it might introduce some lowquality target phrase such as of China 's economy in Table 1 , which will generate more substandard variable rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-191",
"text": "Table 6 : Comparison of rule tables learned from weighted matrices using different criterions."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-192",
"text": "\"Best Rule\" denotes the rule table using the criteria described in Section 3.2, \"More Rules\" denotes the rule table using the criteria that retains all candidate target phrases that reach pruning threshold."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-193",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-194",
"text": "**RELATED WORKS**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-195",
"text": "and Huang (2008) and Tu et al. (2010) use forests instead of 1-best trees; Venugopal et al. (2003) and Deng et al. (2008) soft the alignment consistency constraint to extract more rules; Dyer et al. (2008) use word lattices instead of 1-best segmentations to generate more alignments for a sentence pair; Venugopal et al. (2008) use n-best alignments directly for rule extraction."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-196",
"text": "To generate larger rule sets, de Gispert et al. (2010) extract hierarchical rules from alignment posterior probabilities."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-197",
"text": "They concern how to extract larger rule sets using simple yet powerful hierarchical grammar, while we focus on whether weighted alignment matrix could overcome the alignment errors for different translation models (e.g. phrase-based, hierarchical phrase-based and tree-based models)."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-198",
"text": "They use phrase posteriors as the fractional count, while we use the product of inside and outside probabilities."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-199",
"text": "Besides, they filter rules after extracting all rules from corpus, while we prune rules when extracting."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-200",
"text": "Liu et al. (2009) proposed a new structure named weighted alignment matrix that make a better use of noisy alignments."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-201",
"text": "Since weighted matrices proves effective for phrase-based model, we apply it to syntax-based models, which are more sensitive to word alignments."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-202",
"text": "Due to the difference in structure between phrases and hierarchical rule, we develop new algorithms to calculate relative frequencies and lexical weights of hierarchical rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-203",
"text": "To achieve a balance between rule table size and performance, we develop a scoring measure that incorporates both frequency and lexical weight to select the best target phrase for each source phrase."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-204",
"text": "Our experiments show that our approach improves BLEU score significantly, with reasonable extraction speed, indicating that weighted alignment matrix also works for syntaxbased models."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-205",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-206",
"text": "**CONCLUSION AND FUTURE WORKS**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-207",
"text": "Besides the hierarchical phrase-based model and tree-to-string model, our method is also applicable to other paradigms such as the string-totree models (Galley et al., 2006) and the string-todependency models (Shen et al., 2008) ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-208",
"text": "Another interesting direction is to use a simpler alignment model that can compute alignment point posteriors directly, such as word-based ITG model (Zhang and Gildea, 2005; Haghighi et al., 2009 )."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-115",
"text": "Accordingly, the outside probability is calculated as:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-116",
"text": "where"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-117",
"text": "For example, the inside probability of (X 1 de jingji, X 1 's economy) in Figure 5 is 1.0, and its outside probability is 0.4."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-118",
"text": "We also use Equation 5 to calculate the fractional counts of hierarchical rules."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-119",
"text": "We follow Liu et al. (2009) to prune rule table using a threshold of frequency."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-120",
"text": "Table 2 lists some hierarchical rules generated from the phrase pair (zhongguo de jingji, China's economy) in Figure 3 ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-121",
"text": "If the threshold is 0.2, we retain all the rules in Table 2 ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-123",
"text": "**CALCULATING LEXICAL WEIGHTS**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-124",
"text": "We denote S R as all words in source side of the inside area of variable rule R, and T R as the words in target side."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-125",
"text": "For the rule (X 1 de jingji, X 1 's economy) in Figure 5 , S R is {de, jingji} and T R is {'s, economy}. Then, we can calculate the lexical weight as:"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-126",
"text": "Note that we only consider each word pair (f j , e i ) in the inside area of the variable rule."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-127",
"text": "For example, the lexical weight of (X 1 de jingji,"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-128",
"text": "Here the probability that economy translates a source NULL token is 0.0."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-130",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-132",
"text": "**DATA PREPARATION**"
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-133",
"text": "Our experiments are on Chinese-English translation based on replications of hierarchical phrasebased system (Chiang, 2007) and tree-to-string system ."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-134",
"text": "We train a 4-gram language model on the Xinhua portion of GIGA-WORD corpus using the SRI Language Modeling Toolkit (Stolcke, 2002) with modified KneserNey smoothing (Kneser and Ney, 1995 To obtain weighted alignment matrices, we follow Venugopal et al. (2008) to produce n-best lists via GIZA++."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-135",
"text": "We produce 20-best lists in two translation directions, then used \"grow-diag-finaland\" (Koehn et al., 2003) to all 20 \u00d7 20 bidirectional alignment pairs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-136",
"text": "We follow Liu et al. (2009) to use p s2t \u00d7 p t2s as the probabilities of an alignment pair."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-137",
"text": "Analogously, we abandon duplicate alignments that are produced from different alignment pairs."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-138",
"text": "After these steps, there are 110 candidate alignments on average for each sentence pair."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-139",
"text": "We obtained n-best lists by selecting the top n alignments from 110-best lists."
},
{
"sent_id": "5a6684d978c0dbcfaabb4bc2314aeb-C001-140",
"text": "We re-estimated the probability of each alignment in the n-best list using re-normalization (Venugopal et al., 2008) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-3",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-4"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-27",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-28",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-29",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-30",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-31"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-35",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-36",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-37",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-38"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-45",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-46",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-47",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-48",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-49"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-52"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-64"
]
],
"cite_sentences": [
"5a6684d978c0dbcfaabb4bc2314aeb-C001-4",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-28",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-35",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-52",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-64"
]
},
"@MOT@": {
"gold_contexts": [
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-3",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-4"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-27",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-28",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-29",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-30",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-31"
]
],
"cite_sentences": [
"5a6684d978c0dbcfaabb4bc2314aeb-C001-4",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-28"
]
},
"@USE@": {
"gold_contexts": [
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-20",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-21"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-64"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-105"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-119"
],
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-136",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-137"
]
],
"cite_sentences": [
"5a6684d978c0dbcfaabb4bc2314aeb-C001-20",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-64",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-105",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-119",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-136"
]
},
"@DIF@": {
"gold_contexts": [
[
"5a6684d978c0dbcfaabb4bc2314aeb-C001-27",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-28",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-29",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-30",
"5a6684d978c0dbcfaabb4bc2314aeb-C001-31"
]
],
"cite_sentences": [
"5a6684d978c0dbcfaabb4bc2314aeb-C001-28"
]
}
}
},
"ABC_d70e69bb3eaa6b46ee3b7110126129_6": {
"x": [
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-85",
"text": "**DETECTING THE GENDER SPACE**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-2",
"text": "Gender bias is highly impacting natural language processing applications."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-3",
"text": "Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-4",
"text": "Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-5",
"text": "In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-6",
"text": "Our analysis includes different measures previously applied in the literature to standard word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-7",
"text": "Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-10",
"text": "Social biases in machine learning, in general and in natural language processing (NLP) applications in particular, are raising the alarm of the scientific community."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-11",
"text": "Examples of these biases are evidences such that face recognition systems or speech recognition systems works better for white men than for ethnic minorities (Buolamwini and Gebru, 2018) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-12",
"text": "Examples in the area of NLP are the case of machine translation that systems tend to ignore the coreference information in benefit of an stereotype (Font and Costa-juss\u00e0, 2019) or sentiment analysis where higher sentiment intensity prediction is biased for a particular gender (Kiritchenko and Mohammad, 2018) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-13",
"text": "In this work we focus on the particular NLP area of word embeddings (Mikolov et al., 2010) , which represent words in a numerical vector space."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-14",
"text": "Word embeddings representation spaces are known to present geometrical phenomena mimicking relations and analogies between words (e.g. man is to woman as king is to queen)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-15",
"text": "Following this property of finding relations or analogies, one popular example of gender bias is the word association between man to computer programmer as woman to homemaker (Bolukbasi et al., 2016) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-16",
"text": "Pre-trained word embeddings are used in many NLP downstream tasks, such as natural language inference (NLI), machine translation (MT) or question answering (QA)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-17",
"text": "Recent progress in word embedding techniques has been achieved with contextualized word embeddings (Peters et al., 2018) which provide different vector representations for the same word in different contexts."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-18",
"text": "While gender bias has been studied, detected and partially addressed for standard word embeddings techniques (Bolukbasi et al., 2016; Zhao et al., 2018a; Gonen and Goldberg, 2019) , it is not the case for the latest techniques of contextualized word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-19",
"text": "Only just recently, Zhao et al. (2019) present a first analysis on the topic based on the proposed methods in Bolukbasi et al. (2016) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-20",
"text": "In this paper, we further analyse the presence of gender biases in contextualized word embeddings by means of the proposed methods in Gonen and Goldberg (2019) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-21",
"text": "For this, in section 2 we provide an overview of the relevant work on which we build our analysis; in section 3 we state the specific request questions addressed in this work, while in section 4 we describe the experimental framework proposed to address them and in section 5 we present the obtained and discuss the results; finally, in section 6 we draw the conclusions of our work and propose some further research."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-22",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-23",
"text": "**BACKGROUND**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-24",
"text": "In this section we describe the relevant NLP techniques used along the paper, including word embeddings, their debiased version and contextualized word representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-26",
"text": "**WORDS EMBEDDINGS**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-27",
"text": "Word embeddings are distributed representations in a vector space."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-28",
"text": "These vectors are normally learned from large corpora and are then used in downstream tasks like NLI, MT, etc."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-29",
"text": "Several approaches have been proposed to compute those vector representations, with word2vec (Mikolov et al., 2013) being one of the dominant options."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-30",
"text": "Word2vec proposes two variants: continuous bag of words (CBoW) and skipgram, both consisting of a single hidden layer neural network train on predicting a target word from its context words for CBoW, and the opposite for the skipgram variant."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-31",
"text": "The outcome of word2vec is an embedding table, where a numeric vector is associated to each of the words included in the vocabulary."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-32",
"text": "These vector representations, which in the end are computed on co-occurrence statistics, exhibit geometric properties resembling the semantics of the relations between words."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-33",
"text": "This way, subtracting the vector representations of two related words and adding the result to a third word, results in a representation that is close to the application of the semantic relationship between the two first words to the third one."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-34",
"text": "This application of analogical relationships have been used to showcase the bias present in word embeddings, with the prototypical example that when subtracting the vector representation of man from that of computer and adding it to woman, we obtain homemaker."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-36",
"text": "**DEBIASED WORD EMBEDDINGS**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-37",
"text": "Human-generated corpora suffer from social biases."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-38",
"text": "Those biases are reflected in the cooccurrence statistics, and therefore learned into word embeddings trained in those corpora, amplifying them (Bolukbasi et al., 2016; Caliskan et al., 2017) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-39",
"text": "Bolukbasi et al. (2016) studied from a geometrical point of view the presence of gender bias in word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-40",
"text": "For this, they compute the subspace where the gender information concentrates by computing the principal components of the difference of vector representations of male and female gender-defining word pairs."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-41",
"text": "With the gender subspace, the authors identify direct and indirect biases in profession words."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-42",
"text": "Finally, they mitigate the bias by nullifying the information in the gender subspace for words that should not be associated to gender, and also equalize their distance to both elements of gender-defining word pairs."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-43",
"text": "Zhao et al. (2018b) proposed an extension to GloVe embeddings (Pennington et al., 2014) where the loss function used to train the embeddings is enriched with terms that confine the gender information to a specific portion of the embedded vector."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-44",
"text": "The authors refer to these pieces of information as protected attributes."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-45",
"text": "Once the embeddings are trained, the gender protected attribute can be simply removed from the vector representation, therefore eliminating any gender bias present in it."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-46",
"text": "The transformations proposed by both Bolukbasi et al. (2016) and Zhao et al. (2018b) are downstream task-agnostic."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-47",
"text": "This fact is used in the work of Gonen and Goldberg (2019) to showcase that, while apparently the embedding information is removed, there is still gender information remaining in the vector representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-49",
"text": "**CONTEXTUALIZED WORD EMBEDDINGS**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-50",
"text": "Pretrained Language Models (LM) like ULMfit (Howard and Ruder, 2018) , ELMo (Peters et al., 2018) , OpenAI GPT (Radford, 2018; Radford et al., 2019) and BERT (Devlin et al., 2018) , proposed different neural language model architectures and made their pre-trained weights available to ease the application of transfer learning to downstream tasks, where they have pushed the state-of-the-art for several benchmarks including question answering on SQuAD, NLI, cross-lingual NLI and named identity recognition (NER)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-51",
"text": "While some of these pre-trained LMs, like BERT, use subword level tokens, ELMo provides word-level representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-52",
"text": "and Liu et al. (2019) confirmed the viability of using ELMo representations directly as features for downstream tasks without re-training the full model on the target task."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-53",
"text": "Unlike word2vec vector representations, which are constant regardless of their context, ELMo representations depend on the sentence where the word appears, and therefore the full model has to be fed with each whole sentence to get the word representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-54",
"text": "The neural architecture proposed in ELMo (Peters et al., 2018) consists of a character-level convolutional layer processing the characters of each word and creating a word representation that is then fed to a 2-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997), trained on language modeling task on a large corpus."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-55",
"text": "Given the high impact of contextualized word embeddings in the area of NLP and the social consequences of having biases in such embeddings, in this work we analyse the presence of bias in these contextualized word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-56",
"text": "In particular, we focus on gender biases, and specifically on the following questions:"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-57",
"text": "\u2022 Do contextualized word embeddings exhibit gender bias and how does this bias compare to standard and debiased word embeddings?"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-58",
"text": "\u2022 Do different evaluation techniques identify bias similarly and what would be the best measure to use for gender bias detection in contextualized embeddings?"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-59",
"text": "To address these questions, we adapt and contrast with the evaluation measures proposed by Bolukbasi et al. (2016) and Gonen and Goldberg (2019) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-61",
"text": "**EXPERIMENTAL FRAMEWORK**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-62",
"text": "As follows, we define the data and resources that we are using for performing our experiments."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-63",
"text": "We also motivate the approach that we are using for contextualized word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-64",
"text": "We worked with the English-German news corpus from the WMT18 1 ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-65",
"text": "We used the English side with 464,947 lines and 1,004,6125 tokens."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-66",
"text": "To perform our analysis we used a set of lists from previous work (Bolukbasi et al., 2016; Gonen and Goldberg, 2019) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-67",
"text": "We refer to the list of definitional pairs 2 as 'Definitonal List' (e.g. shehe, girl-boy)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-68",
"text": "We refer to the list of female and male professions 3 as 'Professional List' (e.g. accountant, surgeon)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-69",
"text": "The 'Biased List' is the list used in the clustering experiment and it consists of biased male and female words (500 female biased tokens and 500 male biased token)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-70",
"text": "This list is generated by taking the most biased words, where the bias of a word is computed by taking its projection on the gender direction ( \u2212 \u2192 he-\u2212\u2192 she) (e.g. breastfeeding, bridal and diet for female and hero, cigar and teammates for male)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-71",
"text": "The 'Extended Biased List' is the list used in classification experiment , which contains 5000 male and female biased tokens, 2500 for each gender, generated in the same way of the Biased List 4 ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-72",
"text": "A note to be considered, is that the lists we used in our experiments (and obtained from Bolukbasi et al. (2016) and Gonen and Goldberg (2019) ) may contain words that are missing in our corpus and so we can not obtain contextualized embeddings for them."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-73",
"text": "Among different approaches to contextualized word embeddings (mentioned in section 2), we choose ELMo (Peters et al., 2018) as contextualized word embedding approach."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-74",
"text": "The motivation for using ELMo instead of other approaches like BERT (Devlin et al., 2018) is that ELMo provides word-level representations, as opposed to BERT's subwords."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-75",
"text": "This makes it possible to study the word-level semantic traits directly, without resorting to extra steps to compose word-level information from the subwords that could interfere with our analyses."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-77",
"text": "**EVALUATION MEASURES AND RESULTS**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-78",
"text": "There is no standard measure for gender bias, and even less for such the recently proposed contextualized word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-79",
"text": "In this section, we adapt gender bias measures for word embedding methods from previous work (Bolukbasi et al., 2016) and (Gonen and Goldberg, 2019) to be applicable to contextualized word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-80",
"text": "This way, we first compute the gender subspace from the ELMo vector representations of genderdefining words, then identify the presence of direct bias in the contextualized representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-81",
"text": "We then proceed to identify gender information by means of clustering and classifications techniques."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-82",
"text": "We compare our results to previous results from debiased and non-debiased word embeddings (Bolukbasi et al., 2016) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-83",
"text": "Bolukbasi et al. (2016) propose to identify gender bias in word representations by computing the direction between representations of male and female word pairs from the Definitional List ( \u2212 \u2192 he-\u2212\u2192 she, \u2212\u2212\u2192 man-\u2212 \u2212\u2212\u2212\u2212 \u2192 woman) and computing their principal components."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-86",
"text": "In the case of contextualized embeddings, there is not just a single representation for each word, but its representation depends on the sentence it appears in."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-87",
"text": "This way, in order to compute the gender subspace we take the representation of words by randomly sampling sentences that contain words from the Definitional List and, for each of them, we swap the definitional word with its pair-wise equivalent from the opposite gender."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-88",
"text": "We then obtain the ELMo representation of the definintional word in each sentence pair, computing their difference."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-89",
"text": "On the set of difference vectors, we compute their principal components to verify the presence of bias."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-90",
"text": "In order to have a reference, we computed the principal components of representation of random words."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-91",
"text": "Similarly to Bolukbasi et al. (2016) , figure 1 shows that the first eigenvalue is significantly larger than the rest and that there is also a single direction describing the majority of variance in these vectors, still the difference between the percentage of variances is less in case of contextualized embeddings, which may refer that there is less bias in such embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-92",
"text": "We can easily note the difference in the case of random, where there is a smooth and gradual decrease in eigenvalues, and hence the variance percentage."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-93",
"text": "A similar conclusion was stated in the recent work (Zhao et al., 2019) where the authors applied the same approach, but for gender swapped variants of sentences with professions."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-94",
"text": "They computed the difference between the vectors of occupation words in corresponding sentences and got a skewed graph where the first component represent the gender information while the second component groups the male and female related words."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-95",
"text": "Direct Bias Direct Bias is a measure of how close a certain set of words are to the gender vector."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-96",
"text": "To compute it, we extracted from the training data the sentences that contain words in the Professional List."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-97",
"text": "We excluded the sentences that have both a professional token and definitional gender word to avoid the influence of the latter over the presence of bias in the former."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-98",
"text": "We applied the definition of direct bias from Bolukbasi et al. (2016) on the ELMo representations of the professional words in these sentences."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-99",
"text": "where N is the amount of gender neutral words, g the gender direction, and w the word vector of each profession."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-100",
"text": "We got direct bias of 0.03, compared to 0.08 from standard word2vec embeddings described in Bolukbasi et al. (2016) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-101",
"text": "This reduction on the direct bias confirms that the substantial component along the gender direction that is present in standard word embeddings is less for the contextualized word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-102",
"text": "Probably, this reduction comes from the fact that we are using different word embeddings for the same profession depending on the sentence which is a direct consequence and advantage of using contextualized embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-103",
"text": "Male and female-biased words clustering."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-104",
"text": "In order to study if biased male and female words cluster together when applying contextualized embeddings, we used k-means to generate 2 clusters of the embeddings of tokens from the Biased list."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-105",
"text": "Note that we can not use several representations for each word, since it would not make any sense to cluster one word as male and female at the same time."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-106",
"text": "Therefore, in order to make use of the advantages of the contextualized embeddings, we repeated 10 independent experiments, each with a different random sentence of each word from the list of biased male and female words."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-107",
"text": "Among these 10 experiments, we got a minimum accuracy of 69.1% and a maximum of 71.3%, with average accuracy of 70.1%, much lower than in the case of biased and debiased word embeddings which were 99.9 and 92.5, respectively, as stated in Gonen and Goldberg (2019) ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-108",
"text": "Based on this criterion, even if there is still bias information to be removed from contextualized embeddings, it is much less than in case of standard word embeddings, even if debiased."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-109",
"text": "The clusters (for one particular experiment out of the 10 of them) are shown in Figure 2 after applying UMAP to the contextualized embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-110",
"text": "Classification Approach In order to study if contextualized embeddings learn to generalize bias, we trained a Radial Basis Function-kernel Support Vector Machine classifier on the embeddings of random 1000 biased words from the Extended Biased List."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-111",
"text": "After that, we evaluated the generalization on the other random 4000 biased tokens."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-112",
"text": "Again, we performed 10 independent experiments, to guarantee randomization of word representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-113",
"text": "Among these 10 experiments, we got a minimum accuracy of 83.33% and a maximum of 88.43%, with average accuracy of 85.56%."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-114",
"text": "This number shows that the bias is learned in these embeddings with high rate."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-115",
"text": "However, it learns in less rate than the normal embeddings, whose classification reached 88.88% and 98.25% for biased and debiased versions, respectively."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-117",
"text": "**K-NEAREST NEIGHBOR APPROACH**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-118",
"text": "To understand more about the bias in contextualized embeddings, it is important to analyze the bias in the professions."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-119",
"text": "The question is whether these embeddings stereotype the professions as the normal embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-120",
"text": "This can be shown by the nearest neighbors of the female and male stereotyped professions, for example 'receptionist' and 'librarian' for female and 'architect' and 'philosopher' for male."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-121",
"text": "We applied the k nearest neighbors on the Professional List, to get the nearest k neighbor to each profession."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-122",
"text": "We used a random representation for each token of the profession list, after applying the k nearest neighbor algorithm on each profession, we computed the percentage of female and male stereotyped professions among the k nearest neighbor of each profession token."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-123",
"text": "Afterwards, we computed the Pearson correlation of this percentage with the original bias of each profession."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-124",
"text": "Once again, to assure randomization of tokens, we performed 10 experiments, each with different random sentences for each profession, therefore with different word representations."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-125",
"text": "The minimum Pearson correlation is 0.801 and the maximum is 0.961, with average of 0.89."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-126",
"text": "All these correlations are significant with p-values smaller than 1 \u00d7 10 \u221240 ."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-127",
"text": "This experiment showed the highest influence of bias compared to 0.606 for debiased embeddings and 0.774 for non-debiased."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-128",
"text": "Figure 3 demonstrates this influence of bias by showing that female biased words (e.g. nanny) has higher percent of female words than male ones and viceversa for male biased words (e.g. philosopher)."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-130",
"text": "**CONCLUSIONS AND FURTHER WORK**"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-131",
"text": "While our study can not draw clear conclusions on whether contextualized word embeddings augment or reduce the gender bias, our results show more insights of which aspects of the final contextualized word vectors get affected by such phenomena, with a tendency more towards reducing the gender bias rather than the contrary."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-132",
"text": "Contextualized word embeddings mitigate gender bias when measuring in the following aspects:"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-133",
"text": "1. Gender space, which is capturing the gender direction from word vectors, is reduced for gender specific contextualized word vectors compared to standard word vectors."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-134",
"text": "2. Direct bias, which is measuring how close set of words are to the gender vector, is lower for contextualized word embeddings than for standard ones."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-135",
"text": "3. Male/female clustering, which is produced between words with strong gender bias, is less strong than in debiased and non-debiased standard word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-136",
"text": "However, contextualized word embeddings preserve and even amplify gender bias when taking into account other aspects:"
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-137",
"text": "1. The implicit gender of words can be predicted with accuracies higher than 80% based on contextualized word vectors which is only a slightly lower accuracy than when using vectors from debiased and non-debiased standard word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-138",
"text": "2."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-139",
"text": "The stereotyped words group with implicitgender words of the same gender more than in the case of debiased and non-debiased standard word embeddings."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-140",
"text": "While all measures that we present exhibit certain gender bias, when evaluating future debiasing methods for contextualized word embeddings it would be worth it putting emphasis on the latter two evaluation measures that show higher bias than the first three."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-141",
"text": "Hopefully, our analysis will provide a grain of sand towards defining standard evaluation methods for gender bias, proposing effective debiasing methods or even directly designing equitable algorithms which automatically learn to ignore biased data."
},
{
"sent_id": "d70e69bb3eaa6b46ee3b7110126129-C001-142",
"text": "As further work, we plan to extend our study to multiple domains and multiple languages to analyze and measure the impact of gender bias present in contextualized embeddings in these different scenarios."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d70e69bb3eaa6b46ee3b7110126129-C001-14",
"d70e69bb3eaa6b46ee3b7110126129-C001-15"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-17",
"d70e69bb3eaa6b46ee3b7110126129-C001-18"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-19"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-37",
"d70e69bb3eaa6b46ee3b7110126129-C001-38"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-39",
"d70e69bb3eaa6b46ee3b7110126129-C001-40",
"d70e69bb3eaa6b46ee3b7110126129-C001-41",
"d70e69bb3eaa6b46ee3b7110126129-C001-42"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-45",
"d70e69bb3eaa6b46ee3b7110126129-C001-46",
"d70e69bb3eaa6b46ee3b7110126129-C001-47"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-82",
"d70e69bb3eaa6b46ee3b7110126129-C001-83"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-91",
"d70e69bb3eaa6b46ee3b7110126129-C001-92"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-100",
"d70e69bb3eaa6b46ee3b7110126129-C001-101",
"d70e69bb3eaa6b46ee3b7110126129-C001-102"
]
],
"cite_sentences": [
"d70e69bb3eaa6b46ee3b7110126129-C001-15",
"d70e69bb3eaa6b46ee3b7110126129-C001-18",
"d70e69bb3eaa6b46ee3b7110126129-C001-19",
"d70e69bb3eaa6b46ee3b7110126129-C001-38",
"d70e69bb3eaa6b46ee3b7110126129-C001-46",
"d70e69bb3eaa6b46ee3b7110126129-C001-82",
"d70e69bb3eaa6b46ee3b7110126129-C001-91",
"d70e69bb3eaa6b46ee3b7110126129-C001-100"
]
},
"@MOT@": {
"gold_contexts": [
[
"d70e69bb3eaa6b46ee3b7110126129-C001-17",
"d70e69bb3eaa6b46ee3b7110126129-C001-18"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-37",
"d70e69bb3eaa6b46ee3b7110126129-C001-38"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-45",
"d70e69bb3eaa6b46ee3b7110126129-C001-46",
"d70e69bb3eaa6b46ee3b7110126129-C001-47"
]
],
"cite_sentences": [
"d70e69bb3eaa6b46ee3b7110126129-C001-18",
"d70e69bb3eaa6b46ee3b7110126129-C001-38",
"d70e69bb3eaa6b46ee3b7110126129-C001-46"
]
},
"@DIF@": {
"gold_contexts": [
[
"d70e69bb3eaa6b46ee3b7110126129-C001-59"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-100",
"d70e69bb3eaa6b46ee3b7110126129-C001-101",
"d70e69bb3eaa6b46ee3b7110126129-C001-102"
]
],
"cite_sentences": [
"d70e69bb3eaa6b46ee3b7110126129-C001-59",
"d70e69bb3eaa6b46ee3b7110126129-C001-100"
]
},
"@USE@": {
"gold_contexts": [
[
"d70e69bb3eaa6b46ee3b7110126129-C001-66",
"d70e69bb3eaa6b46ee3b7110126129-C001-67",
"d70e69bb3eaa6b46ee3b7110126129-C001-68",
"d70e69bb3eaa6b46ee3b7110126129-C001-69",
"d70e69bb3eaa6b46ee3b7110126129-C001-70",
"d70e69bb3eaa6b46ee3b7110126129-C001-71"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-72"
]
],
"cite_sentences": [
"d70e69bb3eaa6b46ee3b7110126129-C001-66",
"d70e69bb3eaa6b46ee3b7110126129-C001-72"
]
},
"@EXT@": {
"gold_contexts": [
[
"d70e69bb3eaa6b46ee3b7110126129-C001-79"
]
],
"cite_sentences": [
"d70e69bb3eaa6b46ee3b7110126129-C001-79"
]
},
"@SIM@": {
"gold_contexts": [
[
"d70e69bb3eaa6b46ee3b7110126129-C001-91",
"d70e69bb3eaa6b46ee3b7110126129-C001-92"
],
[
"d70e69bb3eaa6b46ee3b7110126129-C001-98",
"d70e69bb3eaa6b46ee3b7110126129-C001-99"
]
],
"cite_sentences": [
"d70e69bb3eaa6b46ee3b7110126129-C001-91",
"d70e69bb3eaa6b46ee3b7110126129-C001-98"
]
}
}
},
"ABC_3ec2dc9530699f55b8a4c234532daf_6": {
"x": [
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-2",
"text": "In this paper, we study the problem of employing pre-trained language models for multi-turn response selection in retrieval-based chatbots."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-3",
"text": "A new model, named Speaker-Aware BERT (SA-BERT), is proposed in order to make the model aware of the speaker change information, which is an important and intrinsic property of multi-turn dialogues."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-4",
"text": "Furthermore, a speaker-aware disentanglement strategy is proposed to tackle the entangled dialogues."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-5",
"text": "This strategy selects a small number of most important utterances as the filtered context according to the speakers' information in them."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-6",
"text": "Finally, domain adaptation is performed in order to incorporate the indomain knowledge into pre-trained language models."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-7",
"text": "Experiments on five public datasets show that our proposed model outperforms the present models on all metrics by large margins and achieves new state-of-the-art performances for multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-10",
"text": "Chatbots aim to engage users in open-domain human-computer conversations and are currently receiving increasing attention."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-11",
"text": "The existing work on building chatbots includes generation-based methods and retrieval-based methods."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-12",
"text": "The first type of methods synthesize a response with a natural language generation model (Shang et al., 2015; Serban et al., 2016; ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-13",
"text": "In this paper, we focus on the second type and study the problem of multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-14",
"text": "This task aims to select the best-matched response from a set of candidates, given the context of a conversation which is composed of multiple utterances (Lowe et al., 2015; Lowe et al., 2017; Wu et al., 2017 )."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-15",
"text": "An example of this task is illustrated in Table 1 ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-17",
"text": "**RELATED WORK**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-18",
"text": "The existing methods used to build an open domain dialogue system can be generally categorized into generation-based methods and retrieval-based methods."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-19",
"text": "The generation-based methods synthesize a response with a natural language generation model by maximizing its generation probability given the previous conversation context."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-20",
"text": "This approach enables the incorporation of rich context when mapping between consecutive dialogue turns (Shang et al., 2015; Serban et al., 2016; ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-21",
"text": "Recently, some extended work has been made to incorporate external knowledge into generation with specific personas or emotions (Li et al., 2016; Zhou et al., 2018a) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-22",
"text": "Our work belongs to the retrieval-based methods, which learn a matching model for a pair of a conversational context and a response candidate."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-23",
"text": "This approach has the advantage of providing informative and fluent responses because they select a proper response for the current conversation from a repository by means of response selection algorithms (Lowe et al., 2015; Lowe et al., 2017 EA EA EA EA EA EA EA EA EA EA EA EA EA EA EA EA EA EA EA EB EB EB EB EB E0 E1 E2 E3 E4 E5 E18 E19 E6 E7 E8 E9 E10 E11 E17 E12 E13 E14 E15 E16 E20 E21 E22 E23 E24 E25 [ E0 E0 E0 E0 E0 E0 E0 E0 E1 E1 E1 E1 E1 E1 E0 E1 E1 E1 E1 E1 E0 E1 E1 E1 E1 E1 Speaker Embeddings + + + + + + + + + + + + + + + + + + + + + + + + + + Figure 1 : The input representation of SA-BERT."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-24",
"text": "The final input embeddings are the sum of the token embeddings, the segmentation embeddings, the position embeddings and the speaker embeddings."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-25",
"text": "Wu et al., 2017; Zhang et al., 2018b) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-26",
"text": "Previous work on retrieval-based chatbots focused on singleturn response selection (Wang et al., 2013; Ji et al., 2014) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-27",
"text": "Recently, researchers have extended the focus to the multi-turn conversation, which is more practical for real applications."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-28",
"text": "Some earlier work on multi-turn response selection matched a response with concatenating the context utterances literally into a single long sequence, and calculating its matching score with a response candidate (Lowe et al., 2015; Kadlec et al., 2015; Lowe et al., 2017) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-29",
"text": "Recent work has kept utterances separated and performed matching within a representation-interaction-aggregation framework, which improved the performance on this task."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-30",
"text": "For example, Zhou et al. (2016) proposed a multi-view model, including an utterance view and a word view."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-31",
"text": "Wu et al. (2017) proposed the sequential matching network (SMN) which first matched the response with each utterance and then accumulated the matching information by recurrent neural network."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-32",
"text": "Zhang et al. (2018b) proposed the deep utterance aggregation network (DUA) which refined utterances and employed self-matching attention to route the vital information in each utterance."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-33",
"text": "Zhou et al. (2018b) proposed the deep attention matching network (DAM) which constructed representations at different granularities with stacked self-attention and cross-attention."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-34",
"text": "Tao et al. (2019a) proposed the multi-representation fusion network (MRFN) with multiple types of representations."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-35",
"text": "Gu et al. (2019) proposed the interactive matching network (IMN) which performed the global and bidirectional interactions between the context and response."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-36",
"text": "Gu et al. (2020) proposed the utterance-to-utterance interactive matching network (U2U-IMN) which treated both contexts and responses as sequences of utterances when calculating the matching degrees between them."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-37",
"text": "Tao et al. (2019b) proposed the interaction over interaction (IOI) model which performed matching by stacking multiple interaction blocks."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-38",
"text": "Yuan et al. (2019) proposed the multi-hop selector network (MSN) which utilized a multi-hop selector to select the relevant utterances as context."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-39",
"text": "Henderson et al. (2019) made the first attempt to employ pre-trained language models for multi-turn response selection which concatenated the context utterances and the response literally and sent into the model for classification."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-41",
"text": "**TASK DEFINITION**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-42",
"text": "Given a dialogue dataset D, an example of the dataset is denoted as (c, r, y), where c = {u 1 , u 2 , ..., u n } represents a conversation context with {u k } n k=1 as the utterances, r is a response candidate, and y \u2208 {0, 1} denotes a label."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-43",
"text": "Specifically, y = 1 indicates that r is a proper response for c; otherwise y = 0."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-44",
"text": "Our goal is to learn a matching model g(c, r) by minimizing a cross-entropy loss function from D. For any context-response pair (c, r), g(c, r) measures the matching degree between c and r. Let \u0398 denote the parameters of model g(c, r)."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-45",
"text": "Then, the loss function L(D, \u0398) for learning can be formulated as"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-46",
"text": "(1)"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-48",
"text": "**METHODOLOGY**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-49",
"text": "We present here our proposed model, named Speaker-Aware BERT (SA-BERT), and a visual architecture of our input representation is illustrated in Figure 1 ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-50",
"text": "Due to limited space, we omit an exhaustive background description of BERT."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-51",
"text": "Readers can refer to (Devlin et al., 2019) for details."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-53",
"text": "**SPEAKER EMBEDDINGS & SEGMENTATION TOKENS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-54",
"text": "To represent a pair of sentence A and sentence B, the original BERT concatenates this pair of sentence with a [SEP] token."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-55",
"text": "For a given token, its input representation of the original BERT is constructed by summing the corresponding token, segment and position embeddings."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-56",
"text": "In order to distinguish utterances in a context and model the speaker change in turn as the conversation progresses, we use two strategies to construct the input sequence for multi-turn response selection as follows."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-57",
"text": "First, in order to model the speaker change, we propose to add additional speaker embeddings to token representations."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-58",
"text": "The embedding functions as indicating the speaker's identity for each utterance."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-59",
"text": "For conversations with two speakers, two speaker embedding vectors need to be estimated during the training process."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-60",
"text": "The first vector is added to each token of the utterances in the first conversation turn."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-61",
"text": "When the speaker changes, the second vector is employed."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-62",
"text": "This is performed alternatively and can be extended to conversations with more speakers."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-63",
"text": "Second, empirical results in Dong and Huang (2018) show that segmentation tokens play an important role for multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-64",
"text": "To model conversation, it is natural to extend that to further model turns and utterances."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-65",
"text": "In this work we propose and empirically show that using an [EOU] token at the end of an utterance and an [EOT] token at the end of a turn model interactions between utterances in a context implicitly and improve the performance consistently."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-67",
"text": "**SPEAKER-AWARE DISENTANGLEMENT STRATEGY**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-68",
"text": "When a group of people communicate in a common channel there are often multiple conversation topics occurring concurrently."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-69",
"text": "In terms of a specific conversation topic, utterances relevant to it are useful and other utterances could be considered as noise for them."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-70",
"text": "Note that BERT is not good at dealing with sequences which are composed of more tokens than the limit (i.e., maximum length of time steps is set to be 512)."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-71",
"text": "In order to select a small number of most important utterances, in this paper, we propose a heuristic speaker-aware disentanglement strategy as follows."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-72",
"text": "First, we define the speaker who is uttering an utterance as the spoken-from speaker, and define the speaker who is receiving an utterance as the spoken-to speaker."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-73",
"text": "Each utterance usually has the labels of both spoken-from and spoken-to speakers. But some utterances may have only the spoken-from speaker label while the spoken-to speaker is unknown which is set to None in our experiments."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-74",
"text": "Second, given the spoken-from speaker of the response, we select the utterances which have the same spoken-from or spoken-to speaker as the spoken-from speaker of the response."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-75",
"text": "Third, these selected utterances are then organized in their original chronological order and used to form the filtered context."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-76",
"text": "Finally, the utterances selected according to their spoken-from or spoken-to speaker labels are assigned with the two speaker embedding vectors respectively."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-78",
"text": "**MULTI-TASK LEARNING FOR DOMAIN ADAPTATION**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-79",
"text": "The original BERT is trained on a large text corpus to learn general language representations."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-80",
"text": "To incorporate specific in-domain knowledge, adaptation on in-domain corpora are designed."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-81",
"text": "In our experiments, we employ the training set of each dataset for domain adaptation without additional external knowledge."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-82",
"text": "Furthermore, domain adaptation is done by performing the multi-task learning that optimizing a combination of two loss functions: (i) a next sentence prediction (NSP) loss, and (ii) a masked language model (MLM) loss (Devlin et al., 2019) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-83",
"text": "MLM We follow the experimental settings in the original BERT by masking some percentage of the input tokens at random and then predicting only those masked tokens to train a deep bidirectional representation."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-84",
"text": "In more detail, we replace the word with the [MASK] token at 80% of the time, with a random word at 10% of the time, and with the original word at 10% of the time."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-85",
"text": "NSP Here, the sentence A and sentence B are constructed with the same method as that used in the fine-tuning process."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-86",
"text": "The positive responses are true responses that follow the context, and the negative responses are randomly sampled."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-87",
"text": "The embedding of the [CLS] token is used as the aggregated representation for classification."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-88",
"text": "Specifically, the speaker embeddings can be pre-trained in the task of NSP."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-89",
"text": "If there is no any adaptation processes, the speaker embeddings have to be initialized randomly at the beginning of the fine-tuning process."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-91",
"text": "**OUTPUT REPRESENTATION**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-92",
"text": "The first token of each concatenated sequence is the [CLS] token, with its embedding being used as the aggregated representation for a context-response pair classification."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-93",
"text": "This embedding captures the matching information between a context-response pair, which is sent into a classifier with a sigmoid output layer."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-94",
"text": "Parameters of this classifier need to be estimated during the fine-tuning process."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-95",
"text": "Finally, the classifier returns a score to denote the matching degree of this context-response pair."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-96",
"text": "We tested SA-BERT on five public multi-turn response selection datasets, Ubuntu Dialogue Corpus V1 (Lowe et al., 2015) , Ubuntu Dialogue Corpus V2 (Lowe et al., 2017) , Douban Conversation Corpus (Wu et al., 2017) , E-commerce Dialogue Corpus (Zhang et al., 2018b) and DSTC 8-Track 2-Subtask 2 Corpus (Seokhwan Kim, 2019)."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-97",
"text": "The first four datasets have been disentangled in advance and our proposed speaker-aware disentanglement strategy has been applied to only the last DSTC 8-Track 2-Subtask 2 Corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-98",
"text": "Ubuntu Dialogue Corpus V1, V2 and DSTC 8-Track 2-Subtask 2 Corpus contain multi-turn dialogues about Ubuntu system troubleshooting in English."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-99",
"text": "Here, we adopted the version of Ubuntu Dialogue Corpus V1 shared in , in which numbers, paths and URLs were replaced by placeholders."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-100",
"text": "Compared with Ubuntu Dialogue Corpus V1, the training, validation and test dialogues in the V2 dataset were generated in different periods without overlap."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-101",
"text": "In the DSTC 8-Track 2-Subtask 2 Corpus, the candidate pool may not contain the correct response, so we need to choose a threshold."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-102",
"text": "When the probability of positive labels was smaller than the threshold, we predicted that candidate pool did not contain the correct response."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-103",
"text": "The threshold was selected among [0.6, 0.65, .., 0.95] based on the validation set."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-104",
"text": "In all of the Ubuntu corpora, the positive responses are true responses from humans, and the negative responses are randomly sampled."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-105",
"text": "The Douban Conversation Corpus was crawled from a Chinese social network on open-domain topics."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-106",
"text": "It was constructed in a similar way to the Ubuntu corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-107",
"text": "The Douban Conversation Corpus collected responses via a small inverted-index system, and labels were manually annotated."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-108",
"text": "The Douban Conversation Corpus is different from the other three datasets in that it includes multiple correct candidates for a context in the test set, which leads to low R n @k, e.g., if there are 3 correct responses, the maximum R 10 @1 is 0.33."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-109",
"text": "Hence, MAP and MRR are recommended for reference."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-110",
"text": "The E-commerce Dialogue Corpus collected real-world conversations between customers and customer service staff from the largest e-commerce platform in China."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-111",
"text": "The DSTC 8-Track 2-Subtask 2 Corpus does not release the labels of the test set."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-112",
"text": "Participants should submit their results on the test set to the official and then be evaluated by them."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-113",
"text": "Thus, we submitted only one result to the official and we provide other results on the validation set for reference."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-114",
"text": "Some statistics of these datasets are provided in Table 2 ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-115",
"text": "1"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-117",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-119",
"text": "**EVALUATION METRICS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-120",
"text": "We used the same evaluation metrics as those used in previous work (Lowe et al., 2015; Lowe et al., 2017; Wu et al., 2017; Zhang et al., 2018b; Seokhwan Kim, 2019) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-121",
"text": "Each model was tasked with selecting the k best-matched responses from n available candidates for the given conversation context c, and we calculated the recall of the true positive replies among the k selected responses, denoted as R n @k, as the main evaluation metric."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-122",
"text": "In addition to R n @k, we considered the mean average precision (MAP) (BaezaYates and Ribeiro-Neto, 1999), mean reciprocal rank (MRR) (Voorhees, 1999) and precision-at-one (P@1), especially for the Douban corpus, following the settings of previous work."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-124",
"text": "**TRAINING DETAILS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-125",
"text": "In our experiments, the base version of BERT was adopted."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-126",
"text": "Most hyper-parameters of the original BERT were followed except the following configurations."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-127",
"text": "The initial learning rate was set to 2e-5 and was linearly decayed by L2 weight decay."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-128",
"text": "The maximum sequence length of the concatenation of a context-response pair was set to 512."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-129",
"text": "The training batch size was set to 25."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-130",
"text": "The maximum number of training epochs was set to 3."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-131",
"text": "We used the validation set to set the stop condition in order to select the best model for testing."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-132",
"text": "All codes were implemented in the TensorFlow framework (Abadi et al., 2016) and will be published to help replicate our results after paper acceptance."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-133",
"text": "2"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-135",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-137",
"text": "**UBUNTU CORPUS V1**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-138",
"text": "R 2 @1 R 10 @1 R 10 @2 R 10 @5 TF-IDF (Lowe et al., 2015) 0.659 0.410 0.545 0.708 RNN (Lowe et al., 2015) 0.768 0.403 0.547 0.819 LSTM (Lowe et al., 2015) 0.878 0.604 0.745 0.926 DL2R 0.899 0.626 0.783 0.944 Match-LSTM (Wang and Jiang, 2016b) (Zhou et al., 2016) 0.908 0.662 0.801 0.951 CompAgg (Wang and Jiang, 2016a) 0.884 0.631 0.753 0.927 BiMPM (Wang et al., 2017) 0.897 0.665 0.786 0.938 HRDE-LTC (Yoon et al., 2018) 0.916 0.684 0.822 0.960 SMN (Wu et al., 2017) 0.926 0.726 0.847 0.961 DUA (Zhang et al., 2018b) -0.752 0.868 0.962 DAM (Zhou et al., 2018b) 0.938 0.767 0.874 0.969 MRFN (Tao et al., 2019a) 0.945 0.786 0.886 0.976 IMN (Gu et al., 2019) 0.946 0.794 0.889 0.974 IOI (Tao et al., 2019b) 0 Table 3 : Evaluation results of SA-BERT and previous methods on Ubuntu Dialogue Corpus V1."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-139",
"text": "Table 3, Table 4 , Table 5 , Table 6 and Table 7 present the evaluation results of SA-BERT and previous methods on the five datasets respectively."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-140",
"text": "All the results except ours are from the existing literature."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-141",
"text": "Due to previous methods did not make use of pre-trained language models, we reproduced the results of BERT baseline by fine-tuning on the training set for reference, denoted as BERT for fair comparisons."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-142",
"text": "As we can see that, BERT has already outperformed the present models on most metrics, except R 10 @5 Ubuntu Corpus V2 R 2 @1 R 10 @1 R 10 @2 R 10 @5 TF-IDF (Lowe et al., 2017) 0.749 0.488 0.587 0.763 RNN (Lowe et al., 2017) 0.777 0.379 0.561 0.836 LSTM (Lowe et al., 2017) 0.869 0.552 0.721 0.924 RNN-CNN (Baudis and Sediv\u00fd, 2016) (Wang and Jiang, 2016a) 0.895 0.641 0.776 0.937 BiMPM (Wang et al., 2017) 0.877 0.611 0.747 0.921 HRDE-LTC (Yoon et al., 2018) 0.915 0.652 0.815 0.966 U2U-IMN (Gu et al., 2020) 0.943 0.762 0.877 0.975 IMN (Gu et al., 2019) 0 Table 4 : Evaluation results of SA-BERT and previous methods on Ubuntu Dialogue Corpus V2."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-144",
"text": "**DOUBAN CONVERSATION CORPUS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-145",
"text": "MAP MRR P@1 R 10 @1 R 10 @2 R 10 @5 TF-IDF (Lowe et al., 2015) 0.331 0.359 0.180 0.096 0.172 0.405 RNN (Lowe et al., 2015) 0.390 0.422 0.208 0.118 0.223 0.589 LSTM (Lowe et al., 2015) 0.485 0.527 0.320 0.187 0.343 0.720 Multi-View (Zhou et al., 2016) 0 Table 5 : Evaluation results of SA-BERT and previous methods on the Douban Corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-146",
"text": "on Ubuntu Dialogue Corpus V1 and R 10 @1 on E-commerce Corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-147",
"text": "Furthermore, our proposed SA-BERT outperforms the other models on all metrics and datasets, which demonstrates its ability to select the best-matched response and its compatibility across domains (system troubleshooting, social network and e-commerce)."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-148",
"text": "These results show that our proposed SA-BERT has achieved a new state-of-the-art performance for multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-149",
"text": "In more detail, SA-BERT outperformed the present state-of-the-art performance by large margins of 5.5% R 10 @1 on Ubuntu Dialogue Corpus V1, 5.9% R 10 @1 on Ubuntu Dialogue Corpus V2, 3.2% MAP and 2.7% MRR on Douban Conversation Corpus, 8.3% R 10 @1 on E-commerce Corpus, and 15.5% R 100 @1 on DSTC 8-Track 2-Subtask 2 Corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-150",
"text": "Compared with BERT, SA-BERT outperformed it by large margins of 4.7% R 10 @1 on Ubuntu Dialogue Corpus V1, 4.9% R 10 @1 on Ubuntu Dialogue Corpus V2, 2.8% MAP and 2.6% MRR on Douban Conversation Corpus, 9.4% R 10 @1 on E-commerce Corpus, and 21.9% on DSTC 8-Track 2-Subtask 2 Corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-151",
"text": "These results show that our proposed SA-BERT has achieved a new state-of-the-art performance on all datasets for multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-153",
"text": "**E-COMMERCE CORPUS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-154",
"text": "R 10 @1 R 10 @2 R 10 @5 TF-IDF (Lowe et al., 2015) 0.159 0.256 0.477 RNN (Lowe et al., 2015) 0.325 0.463 0.775 LSTM (Lowe et al., 2015) 0.365 0.536 0.828 Multi-View (Zhou et al., 2016) 0.421 0.601 0.861 DL2R 0.399 0.571 0.842 MV-LSTM (Wan et al., 2016) 0.412 0.591 0.857 Match-LSTM (Wang and Jiang, 2016b) 0.410 0.590 0.858 SMN (Wu et al., 2017) 0.453 0.654 0.886 DUA (Zhang et al., 2018b) 0.501 0.700 0.921 DAM (Zhou et al., 2018b) 0.526 0.727 0.933 IOI (Tao et al., 2019b) 0.563 0.768 0.950 MSN (Yuan et al., 2019) 0.606 0.770 0.937 IMN (Gu et al., 2019) 0 Table 7 : Evaluation results of SA-BERT and ablation tests of the speaker-aware disentanglement strategy (SDS) on the DSTC 8-Track 2-Subtask 2 Corpus."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-156",
"text": "**ANALYSIS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-157",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-158",
"text": "**ADAPTATION CORPUS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-159",
"text": "We make some further analysis on the effect of adaptation corpus to the performance of multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-160",
"text": "We performed the adaptation process with the same domain but different sets."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-161",
"text": "Here, three different sets of Ubuntu were employed: DSTC 8-Track 2, Ubuntu Dialogue Corpus V1, and Ubuntu Dialogue Corpus V2."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-162",
"text": "And then the fine-tuning process was all performed on the training set of Ubuntu Dialogue Corpus V2."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-163",
"text": "The results on the test set of Ubuntu Dialogue Corpus V2 were shown in Table 8 ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-164",
"text": "As we can see that, the adaptation process can help to improve the performance no matter which adaptation corpus was used."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-165",
"text": "Furthermore, adaptation and fine-tuning on the same corpus achieved the best performance."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-166",
"text": "One explanation may be that although pre-trained language models are designed to provide general linguistic knowledge, some domain-specific knowledge is also necessary for a specific task."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-167",
"text": "Thus, adaptation on a domain-specific corpus can help to incorporate more domain-specific knowledge, and the more similar to the task this adaptation corpus is, the more improvement it can help to achieve."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-169",
"text": "**SPEAKER EMBEDDINGS**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-170",
"text": "Adaptation corpus R 2 @1 R 10 @1 R 10 @2 R 10 @5 The speaker embeddings were ablated and the results were reported in Table 9 ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-171",
"text": "The first two lines discussed the situation in which the adaptation process were omitted, and the last two lines discussed the adaptation process were equipped with."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-172",
"text": "The performance drop verified the effectiveness of speaker embeddings."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-173",
"text": "First, without the pre-training process, the speaker embeddings were initialized at random which would be updated during the fine-tuning process."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-174",
"text": "It can be seen that adding the speaker embeddings only during the fine-tuning process can provide an improvement of 0.5% in terms of R 10 @1, which shows its effectiveness for modelling the speaker change during the conversation."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-175",
"text": "Furthermore, we could observe the similar results with the pre-training process included, which verified the effectiveness of our method again."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-177",
"text": "**SPEAKER-AWARE DISENTANGLEMENT STRATEGY**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-178",
"text": "To show the effectiveness of the speaker-aware disentanglement strategy, we also applied it to the existing model, such as IMN (Gu et al., 2019) ."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-179",
"text": "The original IMN did not employ any disentanglement strategy and selected the last 70 utterances as the context, which achieved a performance of 32.2% R 100 @1."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-180",
"text": "After employing the strategy, about 25 utterances were selected to form the context, which achieved a performance of 37.5% R 100 @1."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-181",
"text": "Similar results can also be observed by employing this strategy to BERT and ablating this strategy in SA-BERT, as shown in Table 7 , which verified the effectiveness of the speaker-aware disentanglement strategy again."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-183",
"text": "**CONCLUSION**"
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-184",
"text": "In this paper, we study the problem of employing pre-trained language models for multi-turn response selection in retrieval-based chatbots."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-185",
"text": "A speaker-aware BERT model is proposed to improve BERT by adding speaker embeddings, introducing a speaker-aware disentanglement strategy and adapting to the specific domain."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-186",
"text": "Experiments on five public datasets show that our proposed method achieves new stateof-the-art performances for multi-turn response selection."
},
{
"sent_id": "3ec2dc9530699f55b8a4c234532daf-C001-187",
"text": "Adjusting pre-trained language models to fit multi-turn response selection and designing new disentanglement strategies will be a part of our future work."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"3ec2dc9530699f55b8a4c234532daf-C001-10",
"3ec2dc9530699f55b8a4c234532daf-C001-11",
"3ec2dc9530699f55b8a4c234532daf-C001-13",
"3ec2dc9530699f55b8a4c234532daf-C001-14"
],
[
"3ec2dc9530699f55b8a4c234532daf-C001-22",
"3ec2dc9530699f55b8a4c234532daf-C001-23"
],
[
"3ec2dc9530699f55b8a4c234532daf-C001-96"
]
],
"cite_sentences": [
"3ec2dc9530699f55b8a4c234532daf-C001-14",
"3ec2dc9530699f55b8a4c234532daf-C001-23",
"3ec2dc9530699f55b8a4c234532daf-C001-96"
]
},
"@SIM@": {
"gold_contexts": [
[
"3ec2dc9530699f55b8a4c234532daf-C001-22",
"3ec2dc9530699f55b8a4c234532daf-C001-23"
]
],
"cite_sentences": [
"3ec2dc9530699f55b8a4c234532daf-C001-23"
]
},
"@USE@": {
"gold_contexts": [
[
"3ec2dc9530699f55b8a4c234532daf-C001-96"
],
[
"3ec2dc9530699f55b8a4c234532daf-C001-120"
]
],
"cite_sentences": [
"3ec2dc9530699f55b8a4c234532daf-C001-96",
"3ec2dc9530699f55b8a4c234532daf-C001-120"
]
}
}
},
"ABC_c2a956a6ae0fb1ab338da01e5a5645_6": {
"x": [
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-2",
"text": "Previous studies have highlighted the necessity for entity linking systems to capture the local entity-mention similarities and the global topical coherence."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-3",
"text": "We introduce a novel framework based on convolutional neural networks and recurrent neural networks to simultaneously model the local and global features for entity linking."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-4",
"text": "The proposed model benefits from the capacity of convolutional neural networks to induce the underlying representations for local contexts and the advantage of recurrent neural networks to adaptively compress variable length sequences of predictions for global constraints."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-5",
"text": "Our evaluation on multiple datasets demonstrates the effectiveness of the model and yields the state-of-the-art performance on such datasets."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-6",
"text": "In addition, we examine the entity linking systems on the domain adaptation setting that further demonstrates the cross-domain robustness of the proposed model."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-9",
"text": "We address the problem of entity linking (EL): mapping entity mentions in documents to their correct entries (called target entities) in some existing knowledge bases (KB) like Wikipedia."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-10",
"text": "For instance, in the sentence \"Liverpool suffered an upset first home league defeat of the season.\", an entity linking system should be able to identify the entity mention \"Liverpool\" as a football club rather than a city in England in the knowledge bases."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-11",
"text": "This is a challenging problem of natural language processing, as the same entity might be presented in various names, and the same entity mention string might refer to different entities in different contexts."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-12",
"text": "Entity linking is a fundamental task for other applications such as information extraction, knowledge base construction etc."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-13",
"text": "In order to tackle the ambiguity in EL, previous studies have first generated a set of target entities in the knowledge bases as the referent candidates for each entity mention in the documents, and then solved a ranking problem to disambiguate the entity mention."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-14",
"text": "The key challenge in this paradigm is the ranking model that computes the relevance of each target entity candidate to the corresponding entity mention using the available context information in both the documents and the knowledge bases."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-15",
"text": "The early approach for the ranking problem in EL has resolved the entity mentions in documents independently (the local approach), utilizing various discrete and hand-designed features/heuristics to measure the local mention-to-entity relatedness for ranking."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-16",
"text": "These features are often specific to each entity mention and candidate entity, covering a wide range of linguistic and/or structured representations such as lexical and part-of-speech tags of context words, dependency paths, topical features, KB infoboxes (Bunescu and Pasca, 2006; Mendes et al., 2011; Cassidy et al., 2011; Ji and Grishman, 2011; Shen et al., 2014) etc."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-17",
"text": "Although the local approach can exploit a rich set of discrete structures for EL, its limitation is twofold:"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-18",
"text": "(i) The independent ranking mechanism in the local approach overlooks the topical coherence among the target entities referred by the entity mentions within the same document."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-19",
"text": "This is undesirable as the topical coherence has been shown to be effective for EL in the previous work (Han et al., This work is licensed under a Creative Commons Attribution 4.0 International License."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-20",
"text": "License details: http:// creativecommons.org/licenses/by/4.0/ 2011; Hoffart et al., 2011; He et al., 2013b; Alhelbawy and Gaizauskas, 2014; Pershina et al., 2015) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-38",
"text": "The role of the RNNs is to accumulate information about the previous entity mentions and target entities, and provide them as the global constraints for the linking process of the current entity mention."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-21",
"text": "(ii) The local approach might suffer from the data sparseness issue of unseen words/features, the difficulty of calibrating, and the failure to induce the underlying similarity structures at high levels of abstraction for EL (due to the extensive reliance on the hand-designed coarse features) (Sun et al., 2015; Francis-Landau et al., 2016) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-22",
"text": "The first drawback of the local approach has been overcome by the global models in which all entity mentions (or a group of entity mentions) within a document are disambiguated simultaneously to obtain a coherent set of target entities."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-23",
"text": "The central idea is that the referent entities of some mentions in a document might in turn introduce useful information to link other mentions in that document due to the semantic relatedness among them."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-24",
"text": "For example, the appearances of \"Manchester\" and \"Chelsea\" as the football clubs in a document would make it more likely that the entity mention \"Liverpool\" in the same document is also a football club."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-25",
"text": "Unfortunately, the coherence assumption of the global approach does not hold in some situations, necessitating the discrete/coarse features in the local approach as a mechanism to compensate for the potential exceptions of the coherence assumption Hoffart et al., 2011; Sil et al., 2012; Durrett and Klein, 2014; Pershina et al., 2015) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-26",
"text": "Consequently, the global approach is still subject to the second limitation of data sparseness of the local approach due to their use of discrete features."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-27",
"text": "Recently, the surge of neural network (NN) models has presented an effective mechanism to mitigate the second limitation of the local approach."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-28",
"text": "In such models, words are represented by continuous representations (Bengio et al., 2003; Turian et al., 2010; Mikolov et al., 2013) and features for the entity mentions and candidate entities are automatically learnt from data."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-29",
"text": "This essentially alleviates the data sparseness problem of unseen words/features and helps to extract more effective features for EL in a given dataset (Kalchbrenner et al., 2014; Nguyen et al., 2016a) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-30",
"text": "In practice, the features automatically induced by NN are combined with the discrete features in the local approach to extend their coverage for EL (Sun et al., 2015; Francis-Landau et al., 2016) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-31",
"text": "However, as the previous NN models for EL are local, they cannot capture the global interdependence among the target entities in the same document (the first limitation of the local approach)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-32",
"text": "Guided by these analyses, in this paper, we propose to use neural networks to model both the local mention-to-entity similarities and the global relatedness among target entities in an unified architecture."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-33",
"text": "This allows us to inherit all the benefits from the previous systems as well as overcome their inherent issues."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-34",
"text": "Our work is an extension of (Francis-Landau et al., 2016) which only considers the local similarities."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-35",
"text": "Given a document, we simultaneously perform linking for every entity mention from the beginning to the end of the document."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-36",
"text": "For each entity mention, we utilize convolutional neural networks (CNN) to obtain the distributed representations for the entity mention as well as its target candidates."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-37",
"text": "These distributed representations are then used for two purposes: (i) computing the local similarities for the entity mention and target candidates, and (ii) functioning as the input for the recurrent neural networks (RNN) that runs over the entity mentions in the documents."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-39",
"text": "We systematically evaluate the proposed model on multiple datasets in both the general setting and the domain adaptation setting."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-40",
"text": "The experiment results show that the proposed model outperforms the current state-of-the-art models on the evaluated datasets."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-41",
"text": "To our knowledge, this is also the first work investigating the EL problem in the domain adaptation setting."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-43",
"text": "**MODEL**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-44",
"text": "The entity linking problem in this work can be formalized as follows."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-45",
"text": "Let D be the input document and M = {m 1 , m 2 , . . . , m k } be the entity mentions in D. The goal is to map each entity mention m i to its corresponding Wikipedia page (entity) or return \"NIL\" if m i is not present in Wikipedia."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-46",
"text": "For each entity"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-47",
"text": "for the correct entity pages."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-48",
"text": "Again, t ij , b ij , t * i and b * i are also sequences of words."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-49",
"text": "In order to link the entity mentions, the strategy is to assign a relevance score \u03c6(m i , p ij ) for each target candidate p ij of m i , and then use these scores to rank the candidates for each mention."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-50",
"text": "In this work, we decompose \u03c6(m i , p ij ) as the sum of the two following factors:"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-51",
"text": "In this formula, \u03c6 local (m i , p ij ) represents the local similarities between m i and p ij , i.e, only using the information related to m i and p ij ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-52",
"text": "\u03c6 global (m 1 , m 2 , . . . , m i , P 1 , P 2 , . . . , P i ), on the other hand, additionally considers the other mentions and candidates in the document, attempting to model the interdependence among these objects."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-53",
"text": "The denotation \u03c6 global (m 1 , m 2 , . . . , m i , P 1 , P 2 , . . . , P i ) implies that we are computing the ranking scores for all the target candidates of all the entity mentions in each document simultaneously, preserving the order of the entity mentions from the beginning to the end of the document."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-54",
"text": "The model in this work consists of three main components: (i) the encoding component that applies convolutional neural networks to induce the distributed representations for the input sequences s i , c i , d i , t ij , and b ij , (ii) the local component that computes the local similarities \u03c6 local (m i , p ij ) for each entity mention m i , and (iii) the global component that runs recurrent neural networks on the entity mentions {m 1 , m 2 , . . . , m k } to generate the global features \u03c6 global (m 1 , m 2 , . . . , m i , P 1 , P 2 , . . . , P i )."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-56",
"text": "**ENCODING**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-57",
"text": "Let x be some context word sequence of the entity mentions or target candidates (i.e, x \u2208 {s i , c i ,"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-58",
"text": "."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-59",
"text": "In order to obtain the distributed representation for x, we first transform each word x i \u2208 x into a real-valued, h-dimensional vector w i using the word embedding table E (Mikolov et al., 2013) :"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-60",
"text": "."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-61",
"text": "This essentially converts the word sequence x into a sequence of vectors that is padded with zero vectors to form a fixed-length sequence of vectors w = (w 1 , w 2 , . . . , w n ) of length n."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-62",
"text": "In the next step, we apply the convolution operation over w to generate the hidden vector sequence, that is then transformed by a non-linear function G and pooled by the sum function (Francis-Landau et al., 2016) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-63",
"text": "Following the previous work on CNN (Nguyen and Grishman, (2015a; 2015b) ), we utilize the set L of multiple window sizes to parameterize the convolution operation."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-64",
"text": "Each window size l \u2208 L corresponds to a convolution matrix M l \u2208 R v\u00d7lh of dimensionality v. Eventually, the concatenation vectorx of the resulting vectors for each window size in L would be used as the distributed representation for"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-65",
"text": "where is the concatenation operation over the window set L and w i:(i+l\u22121) is the concatenation vector of the given word vectors."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-66",
"text": "For convenience, lets i ,c i ,"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-67",
"text": "i and b * i obtained by the convolution procedure above, respectively."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-68",
"text": "Note that we apply the same set of convolution parameters for each type of text granularity in the source document D as well as in the target entity side."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-69",
"text": "The vector representations of the context would then be fed into the next components to compute the features for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-71",
"text": "**LOCAL SIMILARITIES**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-72",
"text": "We employ the local similarities \u03c6 local (m i , p ij ) from (Francis-Landau et al., 2016) , the state-of-the-art neural network model for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-73",
"text": "In particular:"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-74",
"text": "In this formula, W sparse and W CN N are the weights for the feature vectors F sparse and W CN N respectively."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-75",
"text": "F sparse (m i , p ij ) is the sparse feature vector obtained from (Durrett and Klein, 2014) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-76",
"text": "This vector captures various linguistic properties and statistics that have been discovered in the previous studies for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-77",
"text": "The representative features include the anchor text counts from Wikipedia, the string match indications with the title of the Wikipedia candidate pages, or the information about the shape of the queries for candidate generations (Francis-Landau et al., 2016) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-78",
"text": ", on the other hand, involves the cosine similarities between the representation vectors at multiple granularities of m i and p ij ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-79",
"text": "In particular:"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-80",
"text": "The intuition for this computation is that the similarities at different levels of contexts might help to enforce the potential topic compatibility between the contexts of the entity mentions and target candidates for EL (Francis-Landau et al., 2016) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-82",
"text": "**GLOBAL SIMILARITIES**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-104",
"text": "iv) WIKI : This dataset contains 10,000 randomly sampled Wikipedia articles."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-83",
"text": "In order to encapsulate the coherence among the entity mentions and their target entities, we run recurrent neural networks over the sequences of the representation vectors for the entity mentions (i.e, the vector sequences for the surface strings (s 1 ,s 2 , . . . ,s k ) and for the immediate contexts (c 1 ,c 2 , . . . ,c k )) and the target entities (i.e, the vector sequences for the page titles (t * 1 ,t * 2 , . . . ,t * k ) and for the body contents Given the hidden vector sequence, when predicting the target entity for the entity mention m i , we ensure that the target entity is consistent with the global information stored in h b i\u22121 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-84",
"text": "This is achieved by using the cosine similarities between h b i\u22121 and the representation vectors of each target candidate p ij of m i , (i.e, cos(h b i\u22121 ,t ij ) and cos(h b i\u22121 ,b ij )) as the global features for the ranking score."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-85",
"text": "We can repeat this process for the other representation vector sequences in both the entity mention side and the target entity side."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-86",
"text": "The resulting global features would then be grouped into a single feature vector to compute the global similarity score \u03c6 global (m 1 , m 2 , . . . , m i , P 1 , P 2 , . . . , P i ) as in the local similarity section."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-87",
"text": "An overview of the whole model is presented in Figure 1 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-88",
"text": "Regarding the reccurent function \u03a6, we employ the gated recurrent units (GRU) (Cho et al., 2014 ) to alleviate the \"vanishing gradient problem\" of RNN."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-89",
"text": "GRU is a simplified version of long-short term memory units (LSTM) that has been shown to achieve comparable performance (J\u00f3zefowicz et al., 2015) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-90",
"text": "Finally, for training, we jointly optimize the parameters for the CNNs, RNNs and weight vectors by maximizing the log-likelihood of a labeled training corpus."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-91",
"text": "We utilize the stochastic gradient descent algorithm and the AdaDelta update rule (Zeiler, 2012 Liverpool."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-92",
"text": "Each of the entity mentions has two entity candidate pages (either a football club or a city).The orange rectangles denote the CNN-induced representation vectorssi,ci,di,tij andbij."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-93",
"text": "The circles in red and green are the ranking scores for the target candidates, in which the green circles correspond to the correct target entities."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-94",
"text": "Finally, the circles in grey are the hidden vectors (i.e, the global vectors) of the RNNs running over the entity mentions."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-95",
"text": "We only show the global entity vectors in this figure to improve the visualization."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-97",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-99",
"text": "**DATASETS**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-100",
"text": "Following (Francis-Landau et al., 2016), we evaluate the models on 4 different entity linking datasets: i) ACE (Bentivogli et al., 2010 ): This corpus is from the 2005 evaluation of NIST."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-101",
"text": "It is also used in (Fahrni and Strube, 2014) and (Durrett and Klein, 2014) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-102",
"text": "ii) CoNLL-YAGO (Hoffart et al., 2011 ): This corpus is originally from the CoNLL 2003 shared task of named entity recognition for English."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-103",
"text": "iii) WP (Heath and Bizer, 2011) : This dataset consists of short snippets from Wikipedia."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-105",
"text": "The task is to disambiguate the links in each article 4 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-106",
"text": "For all the datasets, we use the standard data splits (for training data, test data and development data) as the previous works for comparable comparison (Francis-Landau et al., 2016)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-108",
"text": "**PARAMETERS AND RESOURCES**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-109",
"text": "For all the experiments below, in the CNN models to learn the distributed representations for the inputs, we use window sizes in the set L = {2, 3, 4, 5} for the convolution operation with the dimensionality v = 200 for each window size 5 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-110",
"text": "The non-linear function for transformation is G = tanh."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-111",
"text": "We employ the English Wikipedia dump from June 2016 as our reference knowledge base."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-112",
"text": "Regarding the input contexts for the entity mentions and the target candidates, we utilize the window size of 10 for the immediate context c i , and only extract the first 100 words in the documents for d i and b ij ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-113",
"text": "Finally, we pre-train the word embedings on the whole English Wikipedia dump using the word2vec toolkit (Mikolov et al., 2013) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-114",
"text": "The training parameters are set to the default values in this toolkit."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-115",
"text": "The dimensionality of the word embeddings is 300."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-116",
"text": "Note that every parameter and resource in this work is either taken from the previous work (Nguyen and Grishman, 2016b; Francis-Landau et al., 2016) or selected by the development data."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-118",
"text": "**EVALUATING THE GLOBAL FEATURES**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-119",
"text": "In this section, we evaluate the effectiveness of the global features for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-120",
"text": "In particular, we differentiate two types of global features based on the side of information we expect to enforce the coherence."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-121",
"text": "The first type of global features (global-mention) concerns the entity mention side and involves applying the global RNN models on the CNN-induced representation vectors of the entity mentions (i.e, the surface vectors (s 1 ,s 2 , . . . ,s k ) and the immediate context vectors (c 1 ,c 2 , . . . ,c k ) )."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-122",
"text": "The second type of global features (global-entity), on the other hand, focuses on the target entity side and models the coherence with the representation vectors of the target entities (i.e, the page title vectors (t * 1 ,t * 2 , . . . ,t * k ) and the body content vectors (b * 1 ,b * 2 , . . . ,b * k ))."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-123",
"text": "The most important observation from the table is that the global features, in general, help to improve the performance of the model on different datasets."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-124",
"text": "This is substantial on the ACE and CoNLL datasets when only one type of the global features (either global-mention or global-entity) is integrated into the model."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-125",
"text": "The combination of global-mention and global-entity is not very effective as it is actually worse than the performance of the individual global feature types."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-126",
"text": "This suggests that global-mention and global-entity might cover overlapping information and their combination would inject redundancy into the model."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-127",
"text": "The best performance is achieved by the global-entity features that would be used in all the evaluations below."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-129",
"text": "**COMPARING TO THE PREVIOUS WORK**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-130",
"text": "This section compares the proposed system (called Global-RNN) with the state-of-the-art models on our four datasets."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-131",
"text": "These systems include the neural network model in (Francis-Landau et al., 2016) , the joint model for entity analysis in (Durrett and Klein, 2014) and the AIDA-light system with two-stage mapping in (Nguyen et al., 2014b) 6 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-132",
"text": "Table 2 shows the performance of the systems on the test sets with the reference knowledge base of the June 2016 Wikipedia dump."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-133",
"text": "We also include the performance of the systems on the December 2014 Wikipedia dump that was used and provided by (Francis-Landau et al., 2016) for further and compatible comparison."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-135",
"text": "**SYSTEMS**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-136",
"text": "Wikipedia 2014 Wikipedia 2016 ACE CoNLL WP WIKI ACE CoNLL WP WIKI DK2014 (Durrett and Klein, 2014) 79 First, we see that the performance of the systems drop significantly when we switch from Wikipedia 2014 to Wikipedia 2016 (especially for the datasets ACE and CoNLL)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-137",
"text": "This is can be partly explained by the inclusion of new entities (pages) into Wikipedia from 2014 to 2016 that has made the entity mentions in the datasets more ambiguous 7 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-138",
"text": "Second and more importantly, Global-RNN significantly outperforms the all the compared models (except for the ACE dataset on Wikipedia 2014 and the WIKI dataset on Wikipedia 2016), thereby demonstrating the benefits of the joint modeling for local and global features via neural networks for EL in this work."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-140",
"text": "**DOMAIN ADAPTATION EXPERIMENTS**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-141",
"text": "The purpose of this section is to further evaluate the models in the domain adaptation setting to investigate their cross-domain robustness for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-142",
"text": "It is often observed in many natural language processing tasks that the performance of a model trained on a source domain would degrade significantly when it is applied to a different target domain (Blitzer et al., 2006; Daume, 2007; McClosky et al., 2010; Plank and Moschitti, 2013; Nguyen and Grishman, 2014a) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-143",
"text": "Such a performance loss originates from a variety of mismatches between the source and the target domains, including the differences in vocabulary, data distributions, styles etc."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-144",
"text": "This has motivated the domain adaptation research that aims to improve the cross-domain performance of the models by adaptation techniques."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-145",
"text": "One of the key strategies of the domain adaptation techniques is the search for the domain-independent features that are discriminative across different domains (Blitzer et al., 2006; Jiang, 2009; Plank and Moschitti, 2013; Nguyen and Grishman, 2014a) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-146",
"text": "These invariants serve as the connectors between different domains and help to transfer the knowledge from one domain to the others."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-147",
"text": "For EL, we hypothesize that the global coherence is an effective domain-independent feature that would help to improve the crossdomain performance of the models."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-148",
"text": "The intuition is that the entities mentioned in a document of any domains should be related to each other."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-149",
"text": "Eventually, we expect that the proposed model with global coherence features would be more robust to domain shifts than the local approach (Francis-Landau et al., 2016) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-151",
"text": "**DATASET**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-152",
"text": "We use the ACE dataset to evaluate the cross-domain performance of the models."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-153",
"text": "ACE involves documents in 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-154",
"text": "Following the common practice of domain adaptation research on this dataset (Plank and Moschitti, 2013; Nguyen et al., 2015c; Gormley et al., 2015) , we use news (the union of bn and nw) as the source domain and bc, cts, wl, un as four different target domains."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-155",
"text": "We take half of bc as the development set and use the remaining data for testing."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-156",
"text": "We note that news consists of formally written documents while a majority of the other domains is informal text, making the source and target domains very divergent in terms of vocabulary and styles (Plank and Moschitti, 2013) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-157",
"text": "Table 3 compares Global-RNN with the neural network EL model in (Francis-Landau et al., 2016) , the best reported model on the ACE dataset in the literature 8 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-158",
"text": "In this table, the models are trained on the source domain news, and evaluated on news itself (in-domain performance) (via 5-fold cross validation) as well as on the 4 target domains bc, cts, wl, un (out-of-domain performance The first observation from the table is that the performance of all the compared systems on the target domains is much worse than the corresponding in-domain performance."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-159",
"text": "In particular, the performance gap between the in-domain performance and the the worst out-of-domain performance (on the domain wl) is up to 10%, thus indicating the mismatches between the source and the target domains for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-160",
"text": "Second and most importantly, Global-RNN is consistently better than the model with only local features in (Francis-Landau et al., 2016) over all the target domains (although it is less pronounced in the cts domain)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-161",
"text": "This demonstrates the cross-domain robustness of the proposed model and confirms our hypothesis about the domain-independence of the global coherence features for EL."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-163",
"text": "**EVALUATION**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-165",
"text": "**ANALYSIS**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-166",
"text": "In order to better understand the performance gap in the domain adaptation experiments for EL, we visualize the representation vectors of the entity mentions in different domains."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-167",
"text": "In particular, after Global-RNN is trained, we retrieve the representation vectorsc i for the immediate contexts of the entity mentions in the source and target domains, project them into the 2-dimension space via the t-SNE algorithm and plot them."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-168",
"text": "Figure 2 shows the plot."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-169",
"text": "As we can see from the figure, the entity mentions in the target domains bc, cts, wl and un are quite separated from those of the source domain news, thereby explaining the performance loss in the domain adaption experiments."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-170",
"text": "It is not clear in Figure 2 why the models perform much worse on the target domains wl and un than the other domains (i.e, bc and cts)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-171",
"text": "We further investigate this problem by computing the similarities between the target domains and the source domain."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-172",
"text": "While there are several methods to estimate domain similarities (Plank and van Noord, 2011) , in this work, we employ the mean of the cosine similarities of every mention pairs in the two domains of interest."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-173",
"text": "Specifically, let E and F be the two domains of interest, and E = {e 1 , e 2 , . . . , e g } and F = {f 1 , f 2 , . . . , f w } be the sets of the representation vectors for the entity mentions in E and F respectively (g = |E|, w = |F |)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-174",
"text": "The similarity between E and F is then given by:"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-175",
"text": "cos(e i , f j ) gw Table 4 shows the similarities between the source domain news and each target domains bc, cts, wl and un with respect to the representation vectors of the immediate contextc i (context) and the target entity titlest * i (title) for the entity mentions m i ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-176",
"text": "We also include the similarities in which the representation vectors are the local feature vectors F CN N (m i , t * i ) in Equation 1 (interaction)."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-177",
"text": "The goal of the local feature similarities is to characterize how the entity mentions in different domains interact with their target entities."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-178",
"text": "It is clear from the table that wl is the most dissimilar domain from the source domain."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-179",
"text": "This is followed by un and partly explains the performance in Table 3 ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-181",
"text": "**RELATED WORK**"
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-182",
"text": "Entity linking or disambiguation has been studied extensively in NLP research, falling broadly into two major approaches: local and global disambiguation."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-183",
"text": "Both approaches share the goal of measuring the similarities between the entity mentions and the target candidates in the reference KB."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-184",
"text": "The local paradigm focuses on the internal structures of each separate mention-entity pair, covering the name string comparisons between the surfaces of the entity mentions and target candidates, entity popularity or entity type and so on (Bunescu and Pasca, 2006; Milne and Witten, 2008; Zheng et al., 2010; Ji and Grishman, 2011; Mendes et al., 2011; Cassidy et al., 2011; Shen et al., 2014) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-185",
"text": "In contrast, the global approach jointly maps all the entity mentions within documents to model the topical coherence."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-186",
"text": "Various techniques have been exploited for capturing such semantic consistency, including Wikipedia category agreement (Cucerzan, 2007) , Wikipedia link-based measures (Kulkarni et al., 2009; Hoffart et al., 2011; Shen et al., 2012) , Point-wise Mutual Information measures , integer linear programming (Cheng and Roth, 2013) , PageRank (Alhelbawy and Gaizauskas, 2014; Pershina et al., 2015) , stacked generalization (He et al., 2013a) , to name a few."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-187",
"text": "The entity linking techniques and systems have been actively evaluated at the NIST-organized Text Analysis Conference (Ji et al., 2014) ."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-188",
"text": "Neural networks are applied to entity linking very recently."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-189",
"text": "He et al. (2013b) learn enttiy representation via Stacked Denoising Auto-encoders."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-190",
"text": "Sun et al. (2015) employ convolutional neural networks and neural tensor networks to model mentions, entities and contexts while Francis-Landau et al. (2016) combine CNN-based representations with sparse features to improve the performance."
},
{
"sent_id": "c2a956a6ae0fb1ab338da01e5a5645-C001-191",
"text": "However, none of these work utilize recurrent neural networks to capture the coherence features as we do in this work."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-21"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-30",
"c2a956a6ae0fb1ab338da01e5a5645-C001-31",
"c2a956a6ae0fb1ab338da01e5a5645-C001-32",
"c2a956a6ae0fb1ab338da01e5a5645-C001-33"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-34",
"c2a956a6ae0fb1ab338da01e5a5645-C001-35",
"c2a956a6ae0fb1ab338da01e5a5645-C001-36",
"c2a956a6ae0fb1ab338da01e5a5645-C001-37"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-62"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-72"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-77"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-80"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-116"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-130",
"c2a956a6ae0fb1ab338da01e5a5645-C001-131",
"c2a956a6ae0fb1ab338da01e5a5645-C001-132"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-133"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-148",
"c2a956a6ae0fb1ab338da01e5a5645-C001-149"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-157"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-160",
"c2a956a6ae0fb1ab338da01e5a5645-C001-161"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-190",
"c2a956a6ae0fb1ab338da01e5a5645-C001-191"
]
],
"cite_sentences": [
"c2a956a6ae0fb1ab338da01e5a5645-C001-21",
"c2a956a6ae0fb1ab338da01e5a5645-C001-30",
"c2a956a6ae0fb1ab338da01e5a5645-C001-34",
"c2a956a6ae0fb1ab338da01e5a5645-C001-62",
"c2a956a6ae0fb1ab338da01e5a5645-C001-72",
"c2a956a6ae0fb1ab338da01e5a5645-C001-77",
"c2a956a6ae0fb1ab338da01e5a5645-C001-80",
"c2a956a6ae0fb1ab338da01e5a5645-C001-116",
"c2a956a6ae0fb1ab338da01e5a5645-C001-131",
"c2a956a6ae0fb1ab338da01e5a5645-C001-133",
"c2a956a6ae0fb1ab338da01e5a5645-C001-149",
"c2a956a6ae0fb1ab338da01e5a5645-C001-157",
"c2a956a6ae0fb1ab338da01e5a5645-C001-160",
"c2a956a6ae0fb1ab338da01e5a5645-C001-190"
]
},
"@MOT@": {
"gold_contexts": [
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-30",
"c2a956a6ae0fb1ab338da01e5a5645-C001-31",
"c2a956a6ae0fb1ab338da01e5a5645-C001-32",
"c2a956a6ae0fb1ab338da01e5a5645-C001-33"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-148",
"c2a956a6ae0fb1ab338da01e5a5645-C001-149"
]
],
"cite_sentences": [
"c2a956a6ae0fb1ab338da01e5a5645-C001-30",
"c2a956a6ae0fb1ab338da01e5a5645-C001-149"
]
},
"@SIM@": {
"gold_contexts": [
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-34",
"c2a956a6ae0fb1ab338da01e5a5645-C001-35",
"c2a956a6ae0fb1ab338da01e5a5645-C001-36",
"c2a956a6ae0fb1ab338da01e5a5645-C001-37"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-116"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-133"
]
],
"cite_sentences": [
"c2a956a6ae0fb1ab338da01e5a5645-C001-34",
"c2a956a6ae0fb1ab338da01e5a5645-C001-116",
"c2a956a6ae0fb1ab338da01e5a5645-C001-133"
]
},
"@USE@": {
"gold_contexts": [
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-62"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-72"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-100"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-106"
]
],
"cite_sentences": [
"c2a956a6ae0fb1ab338da01e5a5645-C001-62",
"c2a956a6ae0fb1ab338da01e5a5645-C001-72",
"c2a956a6ae0fb1ab338da01e5a5645-C001-100",
"c2a956a6ae0fb1ab338da01e5a5645-C001-106"
]
},
"@DIF@": {
"gold_contexts": [
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-160",
"c2a956a6ae0fb1ab338da01e5a5645-C001-161"
],
[
"c2a956a6ae0fb1ab338da01e5a5645-C001-190",
"c2a956a6ae0fb1ab338da01e5a5645-C001-191"
]
],
"cite_sentences": [
"c2a956a6ae0fb1ab338da01e5a5645-C001-160",
"c2a956a6ae0fb1ab338da01e5a5645-C001-190"
]
}
}
},
"ABC_2b836473cf682ed474b7cda1800f84_6": {
"x": [
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-2",
"text": "We propose a non-parametric Bayesian model for unsupervised semantic parsing."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-3",
"text": "Following Poon and Domingos (2009), we consider a semantic parsing setting where the goal is to (1) decompose the syntactic dependency tree of a sentence into fragments, (2) assign each of these fragments to a cluster of semantically equivalent syntactic structures, and (3) predict predicate-argument relations between the fragments."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-4",
"text": "We use hierarchical PitmanYor processes to model statistical dependencies between meaning representations of predicates and those of their arguments, as well as the clusters of their syntactic realizations."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-5",
"text": "We develop a modification of the MetropolisHastings split-merge sampler, resulting in an efficient inference algorithm for the model."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-6",
"text": "The method is experimentally evaluated by using the induced semantic representation for the question answering task in the biomedical domain."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-9",
"text": "Statistical approaches to semantic parsing have recently received considerable attention."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-10",
"text": "While some methods focus on predicting a complete formal representation of meaning (Zettlemoyer and Collins, 2005; Ge and Mooney, 2005; , others consider more shallow forms of representation (Carreras and M\u00e0rquez, 2005; Liang et al., 2009) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-11",
"text": "However, most of this research has concentrated on supervised methods requiring large amounts of labeled data."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-12",
"text": "Such annotated resources are scarce, expensive to create and even the largest of them tend to have low coverage (Palmer and Sporleder, 2010) , motivating the need for unsupervised or semi-supervised techniques."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-13",
"text": "Conversely, research in the closely related task of relation extraction has focused on unsupervised or minimally supervised methods (see, for example, (Lin and Pantel, 2001; Yates and Etzioni, 2009) )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-14",
"text": "These approaches cluster semantically equivalent verbalizations of relations, often relying on syntactic fragments as features for relation extraction and clustering (Lin and Pantel, 2001; Banko et al., 2007) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-15",
"text": "The success of these methods suggests that semantic parsing can also be tackled as clustering of syntactic realizations of predicate-argument relations."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-16",
"text": "While a similar direction has been previously explored in (Swier and Stevenson, 2004; Abend et al., 2009; Lang and Lapata, 2010) , the recent work of (Poon and Domingos, 2009 ) takes it one step further by not only predicting predicate-argument structure of a sentence but also assigning sentence fragments to clusters of semantically similar expressions."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-17",
"text": "For example, for a pair of sentences on Figure 1, in addition to inducing predicate-argument structure, they aim to assign expressions \"Steelers\" and \"the Pittsburgh team\" to the same semantic class Steelers, and group expressions \"defeated\" and \"secured the victory over\"."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-18",
"text": "Such semantic representation can be useful for entailment or question answering tasks, as an entailment model can abstract away from specifics of syntactic and lexical realization relying instead on the induced semantic representation."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-19",
"text": "For example, the two sentences in Figure 1 have identical semantic representation, and therefore can be hypothesized to be equivalent."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-20",
"text": "From the statistical modeling point of view, joint learning of predicate-argument structure and discovery of semantic clusters of expressions can also be beneficial, because it results in a more compact model of selectional preference, less prone to the data-sparsity problem (Zapirain et al., 2010) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-21",
"text": "In this respect our model is similar to recent LDA-based models of selectional preference (Ritter et al., 2010; S\u00e9aghdha, 2010) , and can even be regarded as their recursive and non-parametric extension."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-22",
"text": "In this paper, we adopt the above definition of unsupervised semantic parsing and propose a Bayesian non-parametric approach which uses hierarchical Pitman-Yor (PY) processes (Pitman, 2002) to model statistical dependencies between predicate and argument clusters, as well as distributions over syntactic and lexical realizations of each cluster."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-23",
"text": "Our non-parametric model automatically discovers granularity of clustering appropriate for the dataset, unlike the parametric method of (Poon and Domingos, 2009) which have to perform model selection and use heuristics to penalize more complex models of semantics."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-24",
"text": "Additional benefits generally expected from Bayesian modeling include the ability to encode prior linguistic knowledge in the form of hyperpriors and the potential for more reliable modeling of smaller datasets."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-25",
"text": "More detailed discussion of relation between the Markov Logic Network (MLN) approach of (Poon and Domingos, 2009 ) and our non-parametric method is presented in Section 3."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-26",
"text": "Hierarchical Pitman-Yor processes (or their special case, hierarchical Dirichlet processes) have previously been used in NLP, for example, in the context of syntactic parsing (Liang et al., 2007; Johnson et al., 2007) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-27",
"text": "However, in all these cases the effective size of the state space (i.e., the number of sub-symbols in the infinite PCFG (Liang et al., 2007) , or the number of adapted productions in the adaptor grammar (Johnson et al., 2007) ) was not very large."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-28",
"text": "In our case, the state space size equals the total number of distinct semantic clusters, and, thus, is expected to be exceedingly large even for moderate datasets: for example, the MLN model induces 18,543 distinct clusters from 18,471 sentences of the GENIA corpus (Poon and Domingos, 2009 )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-29",
"text": "This suggests that standard inference methods for hierarchical PY processes, such as Gibbs sampling, Metropolis-Hastings (MH) sampling with uniform proposals, or the structured mean-field algorithm, are unlikely to result in efficient inference: for example in standard Gibbs sampling all thousands of alternatives should be considered at each sampling move."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-30",
"text": "Instead, we use a split-merge MH sampling algorithm, which is a standard and efficient inference tool for non-hierarchical PY processes (Jain and Neal, 2000; Dahl, 2003) but has not previously been used in hierarchical setting."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-31",
"text": "We extend the sampler to include composition-decomposition of syntactic fragments in order to cluster fragments of variables size, as in the example Figure 1 , and also include the argument role-syntax alignment move which attempts to improve mapping between semantic roles and syntactic paths for some fixed predicate."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-32",
"text": "Evaluating unsupervised models is a challenging task."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-33",
"text": "We evaluate our model both qualitatively, examining the revealed clustering of syntactic structures, and quantitatively, on a question answering task."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-34",
"text": "In both cases, we follow (Poon and Domingos, 2009 ) in using the corpus of biomedical abstracts."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-35",
"text": "Our model achieves favorable results significantly outperforming the baselines, including state-of-theart methods for relation extraction, and achieves scores comparable to those of the MLN model."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-36",
"text": "The rest of the paper is structured as follows."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-37",
"text": "Section 2 begins with a definition of the semantic parsing task."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-38",
"text": "Sections 3 and 4 give background on the MLN model and the Pitman-Yor processes, respectively."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-39",
"text": "In Sections 5 and 6, we describe our model and the inference method."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-40",
"text": "Section 7 provides both qualitative and quantitative evaluation."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-41",
"text": "Finally, ad-ditional related work is presented in Section 8."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-43",
"text": "**SEMANTIC PARSING**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-44",
"text": "In this section, we briefly define the unsupervised semantic parsing task and underlying aspects and assumptions relevant to our model."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-45",
"text": "Unlike (Poon and Domingos, 2009 ), we do not use the lambda calculus formalism to define our task but rather treat it as an instance of frame-semantic parsing, or a specific type of semantic role labeling (Gildea and Jurafsky, 2002) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-46",
"text": "The reason for this is two-fold: first, the frame semantics view is more standard in computational linguistics, sufficient to describe induced semantic representation and convenient to relate our method to the previous work."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-47",
"text": "Second, lambda calculus is a considerably more powerful formalism than the predicate-argument structure used in frame semantics, normally supporting quantification and logical connectors (for example, negation and disjunction), neither of which is modeled by our model or in (Poon and Domingos, 2009) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-48",
"text": "In frame semantics, the meaning of a predicate is conveyed by a frame, a structure of related concepts that describes a situation, its participants and properties (Fillmore et al., 2003) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-49",
"text": "Each frame is characterized by a set of semantic roles (frame elements) corresponding to the arguments of the predicate."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-50",
"text": "It is evoked by a frame evoking element (a predicate)."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-51",
"text": "The same frame can be evoked by different but semantically similar predicates: for example, both verbs \"buy\" and \"purchase\" evoke frame Commerce buy in FrameNet (Fillmore et al., 2003) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-52",
"text": "The aim of the semantic role labeling task is to identify all of the frames evoked in a sentence and label their semantic role fillers."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-53",
"text": "We extend this task and treat semantic parsing as recursive prediction of predicate-argument structure and clustering of argument fillers."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-54",
"text": "Thus, parsing a sentence into this representation involves (1) decomposing the sentence into lexical items (one or more words), (2) assigning a cluster label (a semantic frame or a cluster of argument fillers) to every lexical item, and (3) predicting argument-predicate relations between the lexical items."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-55",
"text": "This process is illustrated in Figure 1 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-56",
"text": "For the leftmost example, the sentence is decomposed into three lexical items: \"Ravens\", \"defeated\" and \"Steelers\", and they are assigned to clusters In this work, we define a joint model for the labeling and argument identification stages."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-57",
"text": "Similarly to core semantic roles in FrameNet, semantic roles are treated as frame-specific in our model, as our model does not try to discover any correspondences between roles in different frames."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-58",
"text": "As you can see from the above description, frames (which groups predicates with similar meaning such as the WinPrize frame in our example) and clusters of argument fillers (Ravens and Steelers) are treated in our definition in a similar way."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-59",
"text": "For convenience, we will refer to both types of clusters as semantic classes."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-60",
"text": "1 This definition of semantic parsing is closely related to a realistic relation extraction setting, as both clustering of syntactic forms of relations (or extraction patterns) and clustering of argument fillers for these relations is crucial for automatic construction of knowledge bases (Yates and Etzioni, 2009) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-61",
"text": "In this paper, we make three assumptions."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-62",
"text": "First, we assume that each lexical item corresponds to a subtree of the syntactic dependency graph of the sentence."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-63",
"text": "This assumption is similar to the adjacency assumption in (Zettlemoyer and Collins, 2005) , though ours may be more appropriate for languages with free or semi-free word order, where syntactic structures are inherently non-projective."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-64",
"text": "Second, we assume that the semantic arguments are local in the dependency tree; that is, one lexical item can be a semantic argument of another one only if they are connected by an arc in the dependency tree."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-65",
"text": "This is a slight simplification of the semantic role labeling problem but one often made."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-66",
"text": "Thus, the argument identification and labeling stages consist of labeling each syntactic arc with a semantic role label."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-67",
"text": "In comparison, the MLN model does not explicitly assume contiguity of lexical items and does not make this directionality assumption but their clustering algorithm uses initialization and clusterization moves such that the resulting model also obeys both of these constraints."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-68",
"text": "Third, as in (Poon and Domingos, 2009 ), we do not model polysemy as we assume that each syntactic fragment corresponds to a single semantic class."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-69",
"text": "This is not a model assumption and is only used at inference as it reduces mixing time of the Markov chain."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-70",
"text": "It is not likely to be restrictive for the biomedical domain studied in our experiments."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-71",
"text": "As in some of the recent work on learning semantic representations (Eisenstein et al., 2009; Poon and Domingos, 2009 ), we assume that dependency structures are provided for every sentence."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-72",
"text": "This assumption allows us to construct models of semantics not Markovian within a sequence of words (see for an example a model described in (Liang et al., 2009) ), but rather Markovian within a dependency tree."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-73",
"text": "Though we include generation of the syntactic structure in our model, we would not expect that this syntactic component would result in an accurate syntactic model, even if trained in a supervised way, as the chosen independence assumptions are oversimplistic."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-74",
"text": "In this way, we can use a simple generative story and build on top of the recent success in syntactic parsing."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-76",
"text": "**RELATION TO THE MLN APPROACH**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-77",
"text": "The work of (Poon and Domingos, 2009 ) models joint probability of the dependency tree and its latent semantic representation using Markov Logic Networks (MLNs) (Richardson and Domingos, 2006) , selecting parameters (weights of first-order clauses) to maximize the probability of the observed dependency structures."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-78",
"text": "For each sentence, the MLN induces a Markov network, an undirected graphical model with nodes corresponding to ground atoms and cliques corresponding to ground clauses."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-79",
"text": "The MLN is a powerful formalism and allows for modeling complex interaction between features of the input (syntactic trees) and latent output (semantic representation), however, unsupervised learning of semantics with general MLNs can be prohibitively expensive."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-80",
"text": "The reason for this is that MLNs are undirected models and when learned to maximize likelihood of syntactically annotated sentences, they would require marginalization over semantic representation but also over the entire space of syntactic structures and lexical units."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-81",
"text": "Given the complexity of the semantic parsing task and the need to tackle large datasets, even approximate methods are likely to be infeasible."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-82",
"text": "In order to overcome this problem, (Poon and Domingos, 2009 ) group parameters and impose local normalization constraints within each group."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-199",
"text": "Similarly to the MLN system (USP-MLN), we generate answers as follows."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-83",
"text": "Given these normalization constraints, and additional structural constraints satisfied by the model, namely that the clauses should be engineered in such a way that they induce treestructured graphs for every sentence, the parameters can be estimated by a variant of the EM algorithm."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-84",
"text": "The class of such restricted MLNs is equivalent to the class of directed graphical models over the same set of random variables corresponding to fragments of syntactic and semantic structure."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-85",
"text": "Given that the above constraints do not directly fit into the MLN methodology, we believe that it is more natural to regard their model as a directed model with an underlying generative story specifying how the semantic structure is generated and how the syntactic parse is drawn for this semantic structure."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-86",
"text": "This view would facilitate understanding what kind of features can easily be integrated into the model, simplify application of non-parametric Bayesian techniques and expedite the use of inference techniques designed specifically for directed models."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-87",
"text": "Our approach makes one step in this direction by proposing a non-parametric version of such generative model."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-89",
"text": "**HIERARCHICAL PITMAN-YOR PROCESSES**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-90",
"text": "The central component of our non-parametric Bayesian model are Pitman-Yor (PY) processes, which are a generalization of the Dirichlet processes (DPs) (Ferguson, 1973) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-91",
"text": "We use PY processes to model distributions of semantic classes appearing as an argument of other semantic classes."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-92",
"text": "We also use them to model distributions of syntactic realizations for each semantic class and distributions of syntactic dependency arcs for argument types."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-93",
"text": "In this section we present relevant background on PY processes."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-94",
"text": "For a more detailed consideration we refer the reader to ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-95",
"text": "The Pitman-Yor process over a set S, denoted P Y (\u03b1, \u03b2, H), is a stochastic process whose samples G 0 constitute probability measures on partitions of S. In practice, we do not need to draw measures, as they can be analytically marginalized out."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-96",
"text": "The conditional distribution of x j+1 given the previous j draws, with G 0 marginalized out, follows (Black-well and MacQueen, 1973)"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-97",
"text": "where \u03c6 1 , . . . , \u03c6 K are K values assigned to x 1 , x 2 , . . . , x j ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-98",
"text": "The number of times \u03c6 k was assigned is denoted j k , so that j = K k=1 j k ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-99",
"text": "The parameter \u03b2 < 1 controls how heavy the tail of the distribution is: when it approaches 1, a new value is assigned to every draw, when \u03b2 = 0 the PY process reduces to DP."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-100",
"text": "The expected value of K scales as O(\u03b1n \u03b2 ) with the number of draws n, while it scales only logarithmically for DP processes."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-101",
"text": "PY processes are expected to be more appropriate for many NLP problems, as they model power-law type distributions common for natural language (Teh, 2006) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-102",
"text": "Hierarchical Dirichlet Processes (HDP) or hierarchical PY processes are used if the goal is to draw several related probability measures for the same set S. For example, they can be used to generate transition distributions of a Markov model, HDP-HMM Beal et al., 2002) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-103",
"text": "For such a HMM, the top-level state proportions are drawn from the top-level stick breaking construction \u03b3 \u223c GEM (\u03b1, \u03b2), and then the individual transition distributions for every state z = 1, 2, . . ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-104",
"text": "\u03c6 z are drawn from P Y (\u03b3, \u03b1 , \u03b2 )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-105",
"text": "The parameters \u03b1 and \u03b2 control how similar the individual transition distributions \u03c6 z are to the top-level state proportions \u03b3, or, equivalently, how similar the transition distributions are to each other."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-107",
"text": "**A MODEL FOR SEMANTIC PARSING**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-108",
"text": "Our model of semantics associates with each semantic class a set of distributions which govern the generation of corresponding syntactic realizations 2 and the selection of semantic classes for its arguments."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-109",
"text": "Each sentence is generated starting from the root of its dependency tree, recursively drawing a semantic class, its syntactic realization, arguments and semantic classes for the arguments."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-110",
"text": "Below we describe the model by first defining the set of the model parameters and then explaining the generation of in-dividual sentences."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-111",
"text": "The generative story is formally presented in Figure 2 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-112",
"text": "We associate with each semantic class c, c = 1, 2, . . . , a distribution of its syntactic realizations \u03c6 c ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-113",
"text": "For example, for the frame WinPrize illustrated in Figure 1 this distribution would concentrate at syntactic fragments corresponding to lexical items \"defeated\", \"secured the victory\" and \"won\"."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-114",
"text": "The distribution is drawn from DP (w (C) , H (C) ), where H (C) is a base measure over syntactic subtrees."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-115",
"text": "We use a simple generative process to define the probability of a subtree, the underlying model is similar to the base measures used in the Bayesian tree-substitution grammars (Cohn et al., 2009) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-116",
"text": "We start by generating a word w uniformly from the treebank distribution, then we decide on the number of dependents of w using the geometric distribution Geom(q (C) )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-117",
"text": "For every dependent we generate a dependency relation r and a lexical form w from P (r|w)P (w |r), where probabilities P are based on add-0.1 smoothed treebank counts."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-118",
"text": "The process is continued recursively."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-119",
"text": "The smaller the parameter q (C) , the lower is the probability assigned to larger sub-trees."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-120",
"text": "Parameters \u03c8 c,t and \u03c8 + c,t , t = 1, . . . , T , define a distribution over vectors (m 1 , m 2 , . . . , m T ) where m t is the number of times an argument of type t appears for a given semantic frame occurrence 3 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-121",
"text": "For the frame WinPrize these parameters would enforce that there exists exactly one Winner and exactly one Opponent for each occurrence of WinPrize."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-122",
"text": "The parameter \u03c8 c,t defines the probability of having at least one argument of type t. If 0 is drawn from \u03c8 c,t then m t = 0, otherwise the number of additional arguments of type t (m t \u2212 1) is drawn from the geometric distribution Geom(\u03c8 + c,t )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-123",
"text": "This generative story is flexible enough to accommodate both argument types which appear at most once per semantic class occurrence (e.g., agents), and argument types which frequently appear multiple times per semantic class occurrence (e.g., arguments corresponding to descriptors)."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-124",
"text": "Parameters \u03c6 c,t , t = 1, . . . , T , define the dis-Parameters: tributions over syntactic paths for the argument type t. In our example, for argument type Opponent, this distribution would associate most of the probability mass with relations pp over, dobj and pp against."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-125",
"text": "These distributions are drawn from DP (w (A) , H (A) )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-126",
"text": "In this paper we only consider paths consisting of a single relation, therefore the base probability distribution H (A) is just normalized frequencies of dependency relations in the treebank."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-127",
"text": "The crucial part of the model are the selectionpreference parameters \u03b8 c,t , the distributions of semantic classes c for each argument type t of class c. For arguments Winner and Opponent of the frame WinPrize these distributions would assign most of the probability mass to semantic classes denoting teams or players."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-128",
"text": "Distributions \u03b8 c,t are drawn from a hierarchical PY process: first, top-level proportions of classes \u03b3 are drawn from GEM (\u03b1 0 , \u03b2 0 ), and then the individual distributions \u03b8 c,t over c are chosen from P Y (\u03b1, \u03b2, \u03b3)."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-129",
"text": "For each sentence, we first generate a class corresponding to the root of the dependency tree from the root-specific distribution of semantic classes \u03b8 root ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-130",
"text": "Then we recursively generate classes for the entire sentence."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-131",
"text": "For a class c, we generate the syntactic realization s and for each of the T types, decide how many arguments of that type to generate (see GenSemClass in Figure 2 )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-132",
"text": "Then we generate each of the arguments (see GenArgument) by first generating a syntactic arc a c,t , choosing a class as its filler c c,t and, finally, recursing."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-134",
"text": "**INFERENCE**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-135",
"text": "In our model, latent states, modeled with hierarchical PY processes, correspond to distinct semantic classes and, therefore, their number is expected to be very large for any reasonable model of semantics."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-136",
"text": "As a result, many standard inference techniques, such as Gibbs sampling, or the structured mean-field method are unlikely to result in tractable inference."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-137",
"text": "One of the standard and most efficient samplers for non-hierarchical PY processes are split-merge MH samplers (Jain and Neal, 2000; Dahl, 2003) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-138",
"text": "In this section we explain how split-merge samplers can be applied to our model."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-140",
"text": "**SPLIT AND MERGE MOVES**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-141",
"text": "On each move, split-merge samplers decide either to merge two states into one (in our case, merge two semantic classes), or split one state into two."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-142",
"text": "These moves can be computed efficiently for our model of semantics."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-143",
"text": "Note that for any reasonable model of semantics only a small subset of the entire set of semantic classes can be used as an argument for some fixed semantic class due to selectional preferences exhibited by predicates."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-144",
"text": "For instance, only teams or players can fill arguments of the frame WinPrize in our running example."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-145",
"text": "As a result, only a small number of terms in the joint distribution has to be evaluated on every move we may consider."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-146",
"text": "When estimating the model, we start with assigning each distinct word (or, more precisely, a tuple of a word's stem and its part-of-speech tag) to an individual semantic class."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-147",
"text": "Then, we would iterate by selecting a random pair of class occurrences, and decide, at random, whether we attempt to perform a split-merge move or a compose-decompose move."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-149",
"text": "**COMPOSE AND DECOMPOSE MOVES**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-150",
"text": "The compose-decompose operations modify syntactic fragments assigned to semantic classes, composing two neighboring dependency sub-trees or decomposing a dependency sub-tree."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-151",
"text": "If the two randomly-selected syntactic fragments s and s correspond to different classes, c and c , we attempt to compose them into\u015d and create a new semantic class\u0109."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-152",
"text": "All occurrences of\u015d are assigned to this new class\u0109."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-153",
"text": "For example, if two randomly-selected occurrences have syntactic realizations \"secure\" and \"victory\" they can be composed to obtain the syntactic fragment \"secure dobj \u2212\u2212\u2192 victory\"."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-154",
"text": "This fragment will be assigned to a new semantic class which can later be merged with other classes, such as the ones containing syntactic realizations \"defeat\" or \"win\"."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-155",
"text": "Conversely, if both randomly-selected syntactic fragments are already composed in the corresponding class, we attempt to split them."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-157",
"text": "**ROLE-SYNTAX ALIGNMENT MOVE**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-158",
"text": "Merge, compose and decompose moves require recomputation of mapping between argument types (semantic roles) and syntactic fragments."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-159",
"text": "Computing the best statistical mapping is infeasible and proposing a random mapping will result in many attempted moves being rejected."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-160",
"text": "Instead we use a greedy randomized search method called Gibbs scan (Dahl, 2003) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-161",
"text": "Though it is a part of the above 3 moves, this alignment move is also used on its own to induce semantic arguments for classes (frames) with a single syntactic realization."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-162",
"text": "The Gibbs scan procedure is also used during the split move to select one of the newly introduced classes for each considered syntactic fragment."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-164",
"text": "**INFORMED PROPOSALS**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-165",
"text": "Since the number of classes is very large, selecting examples at random would result in a relatively low proportion of moves getting accepted, and, consequently, in a slow-mixing Markov chain."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-166",
"text": "Instead of selecting both class occurrences uniformly, we select the first occurrence from a uniform distribution and then use a simple but effective proposal distribution for selecting the second class occurrence."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-167",
"text": "Let us denote the class corresponding to the first occurrence as c 1 and its syntactic realization as s 1 with a head word w 1 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-168",
"text": "We begin by selecting uniformly randomly whether to attempt a composedecompose or a split-merge move."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-169",
"text": "If we chose a compose-decompose move, we look for words (children) which can be attached below the syntactic fragment s 1 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-170",
"text": "We use the normalized counts of these words conditioned on the parent s 1 to select the second word w 2 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-171",
"text": "We then select a random occurrence of w 2 ; if it is a part of syntactic realization of c 1 then a decompose move is attempted."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-172",
"text": "Otherwise, we try to compose the corresponding clusters together."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-173",
"text": "If we selected a split-merge move, we use a distribution based on the cosine similarity of lexical contexts of the words."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-174",
"text": "The context is represented as a vector of counts of all pairs of the form (head word, dependency type) and (dependent, dependency type)."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-175",
"text": "So, instead of selecting a word occurrence uniformly, each occurrence of every word w 2 is weighted by its similarity to w 1 , where the similarity is based on the cosine distance."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-176",
"text": "As the moves are dependent only on syntactic representations, all the proposal distributions can be computed once at the initialization stage."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-177",
"text": "4"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-179",
"text": "**EMPIRICAL EVALUATION**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-180",
"text": "We induced a semantic representation over a collection of texts and evaluated it by answering questions about the knowledge contained in the corpus."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-181",
"text": "We used the GENIA corpus (Kim et al., 2003) , a dataset of 1999 biomedical abstracts, and a set of questions produced by (Poon and Domingos, 2009) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-182",
"text": "A example question is shown in Figure 3 ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-183",
"text": "All model hyperpriors were set to maximize the posterior, except for w (A) and w (C) , which were set to 1.e \u2212 10 and 1.e \u2212 35, respectively."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-184",
"text": "Inference was run for around 300,000 sampling iterations until the percentage of accepted split-merge moves became lower than 0.05%."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-185",
"text": "Let us examine some of the induced semantic classes (Table 1) realizations have a clear semantic connection."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-186",
"text": "Cluster 6, for example, clusters lymphocytes with the exception of thymocyte, a type of cell which generates T cells."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-187",
"text": "Cluster 8 contains verbs roughly corresponding to Cause change of position on a scale frame in FrameNet."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-188",
"text": "Verbs in class 9 are used in the context of providing support for a finding or an action, and many of them are listed as evoking elements for the Evidence frame in FrameNet."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-189",
"text": "Argument types of the induced classes also show a tendency to correspond to semantic roles."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-190",
"text": "For example, an argument type of class 2 is modeled as a distribution over two argument parts, prep of and prep from."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-191",
"text": "The corresponding arguments define the origin of the cells (transgenic mouse, smoker, volunteer, donor, . . . ) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-192",
"text": "We now turn to the QA task and compare our model (USP-BAYES) with the results of baselines considered in (Poon and Domingos, 2009 )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-193",
"text": "The first set of baselines looks for answers by attempting to match a verb and its argument in the question with the input text."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-194",
"text": "The first version (KW) simply returns the rest of the sentence on the other side of the verb, while the second (KW-SYN) uses syntactic information to extract the subject or the object of the verb."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-195",
"text": "Other baselines are based on state-of-the-art relation extraction systems."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-196",
"text": "When the extracted relation and one of the arguments match those in a given question, the second argument is returned as an answer."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-197",
"text": "The systems include TextRunner (TR) (Banko et al., 2007) , RESOLVER (RS) (Yates and Etzioni, 2009) and DIRT (Lin and Pantel, 2001 )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-198",
"text": "The EX-ACT versions of the methods return answers when they match the question argument exactly, and the SUB versions produce answers containing the question argument as a substring."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-200",
"text": "We use our trained model to parse a question, i.e. recursively decompose it into lexical items and assign them to semantic classes induced at training."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-201",
"text": "Using this semantic representation, we look for the type of an argument missing in the question, which, if found, is reported as an answer."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-202",
"text": "It is clear that overly coarse clusters of argument fillers or clustering of semantically related but not equivalent relations can hurt precision for this evaluation method."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-203",
"text": "Each system is evaluated by counting the answers it generates, and computing the accuracy of those answers."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-204",
"text": "5 Table 2 summarizes the results."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-205",
"text": "First, both USP models significantly outperform all other baselines: even though the accuracy of KW-SYN and TR-EXACT are comparable with our accuracy, the number of correct answers returned by USPBayes is 4 and 11 times smaller than those of KW-SYN and TR-EXACT, respectively."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-206",
"text": "While we are not beating the MLN baseline, the difference is not significant."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-207",
"text": "The effective number of questions is relatively small (less than 80 different questions are answered by any of the models)."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-208",
"text": "More than 50% of USP-BAYES mistakes were due to wrong interpretation of only 5 different questions."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-209",
"text": "From another point of view, most of the mistakes are explained Question: What does cyclosporin A suppress?"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-210",
"text": "Answer: expression of EGR-2 Sentence: As with EGR-3 , expression of EGR-2 was blocked by cyclosporin A ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-211",
"text": "Question: What inhibits tnf-alpha?"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-212",
"text": "Answer: IL -10 Sentence: Our previous studies in human monocytes have demonstrated that interleukin ( IL ) -10 inhibits lipopolysaccharide ( LPS ) -stimulated production of inflammatory cytokines , IL-1 beta , IL-6 , IL-8 , and tumor necrosis factor ( TNF ) -alpha by blocking gene transcription ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-213",
"text": "by overly coarse clustering corresponding to just 3 classes, namely, 30%, 25% and 20% of errors are due to the clusters 6, 8 and 12 (Figure 1) , respectively."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-214",
"text": "Though all these clusters have clear semantic interpretation (white blood cells, predicates corresponding to changes and cykotines associated with cancer progression, respectively), they appear to be too coarse for the QA method we use in our experiments."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-215",
"text": "Though it is likely that tuning and different heuristics may result in better scores, we chose not to perform excessive tuning, as the evaluation dataset is fairly small."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-217",
"text": "**RELATED WORK**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-218",
"text": "There is a growing body of work on statistical learning for different versions of the semantic parsing problem (e.g., (Gildea and Jurafsky, 2002; Zettlemoyer and Collins, 2005; Ge and Mooney, 2005; ), however, most of these methods rely on human annotation, or some weaker forms of supervision (Kate and Mooney, 2007; Liang et al., 2009; Titov and Kozhevnikov, 2010; Clarke et al., 2010) and very little research has considered the unsupervised setting."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-219",
"text": "In addition to the MLN model (Poon and Domingos, 2009 ), another unsupervised method has been proposed in (Goldwasser et al., 2011) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-220",
"text": "In that work, the task is to predict a logical formula, and the only supervision used is a lexicon providing a small number of examples for every logical symbol."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-221",
"text": "A form of self-training is then used to bootstrap the model."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-222",
"text": "Unsupervised semantic role labeling with a generative model has also been considered (Grenager and Manning, 2006) , however, they do not attempt to discover frames and deal only with isolated predicates."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-223",
"text": "Another generative model for SRL has been proposed in (Thompson et al., 2003) , but the parameters were estimated from fully annotated data."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-224",
"text": "The unsupervised setting has also been considering for the related problem of learning narrative schemas (Chambers and Jurafsky, 2009 )."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-225",
"text": "However, their approach is quite different from our Bayesian model as it relies on similarity functions."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-226",
"text": "Though in this work we focus solely on the unsupervised setting, there has been some successful work on semi-supervised semantic-role labeling, including the Framenet version of the problem (F\u00fcrstenau and Lapata, 2009) ."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-227",
"text": "Their method exploits graph alignments between labeled and unlabeled examples, and, therefore, crucially relies on the availability of labeled examples."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-228",
"text": "----------------------------------"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-229",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-230",
"text": "In this work, we introduced a non-parametric Bayesian model for the semantic parsing problem based on the hierarchical Pitman-Yor process."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-231",
"text": "The model defines a generative story for recursive generation of lexical items, syntactic and semantic structures."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-232",
"text": "We extend the split-merge MH sampling algorithm to include composition-decomposition moves, and exploit the properties of our task to make it efficient in the hierarchical setting we consider."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-233",
"text": "We plan to explore at least two directions in our future work."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-234",
"text": "First, we would like to relax some of unrealistic assumptions made in our model: for example, proper modeling of alterations requires joint generation of syntactic realizations for predicateargument relations (Grenager and Manning, 2006; Lang and Lapata, 2010) , similarly, proper modeling of nominalization implies support of arguments not immediately local in the syntactic structure."
},
{
"sent_id": "2b836473cf682ed474b7cda1800f84-C001-235",
"text": "The second general direction is the use of the unsupervised methods we propose to expand the coverage of existing semantic resources, which typically require substantial human effort to produce."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"2b836473cf682ed474b7cda1800f84-C001-11",
"2b836473cf682ed474b7cda1800f84-C001-14",
"2b836473cf682ed474b7cda1800f84-C001-15",
"2b836473cf682ed474b7cda1800f84-C001-16",
"2b836473cf682ed474b7cda1800f84-C001-3",
"2b836473cf682ed474b7cda1800f84-C001-9"
],
[
"2b836473cf682ed474b7cda1800f84-C001-25"
],
[
"2b836473cf682ed474b7cda1800f84-C001-28"
],
[
"2b836473cf682ed474b7cda1800f84-C001-45",
"2b836473cf682ed474b7cda1800f84-C001-46",
"2b836473cf682ed474b7cda1800f84-C001-47"
],
[
"2b836473cf682ed474b7cda1800f84-C001-77",
"2b836473cf682ed474b7cda1800f84-C001-78",
"2b836473cf682ed474b7cda1800f84-C001-79",
"2b836473cf682ed474b7cda1800f84-C001-80",
"2b836473cf682ed474b7cda1800f84-C001-81"
],
[
"2b836473cf682ed474b7cda1800f84-C001-82",
"2b836473cf682ed474b7cda1800f84-C001-83",
"2b836473cf682ed474b7cda1800f84-C001-84",
"2b836473cf682ed474b7cda1800f84-C001-85",
"2b836473cf682ed474b7cda1800f84-C001-86",
"2b836473cf682ed474b7cda1800f84-C001-87"
],
[
"2b836473cf682ed474b7cda1800f84-C001-219"
]
],
"cite_sentences": [
"2b836473cf682ed474b7cda1800f84-C001-16",
"2b836473cf682ed474b7cda1800f84-C001-25",
"2b836473cf682ed474b7cda1800f84-C001-28",
"2b836473cf682ed474b7cda1800f84-C001-45",
"2b836473cf682ed474b7cda1800f84-C001-47",
"2b836473cf682ed474b7cda1800f84-C001-77",
"2b836473cf682ed474b7cda1800f84-C001-82",
"2b836473cf682ed474b7cda1800f84-C001-219"
]
},
"@DIF@": {
"gold_contexts": [
[
"2b836473cf682ed474b7cda1800f84-C001-23"
],
[
"2b836473cf682ed474b7cda1800f84-C001-45",
"2b836473cf682ed474b7cda1800f84-C001-46",
"2b836473cf682ed474b7cda1800f84-C001-47"
],
[
"2b836473cf682ed474b7cda1800f84-C001-82",
"2b836473cf682ed474b7cda1800f84-C001-83",
"2b836473cf682ed474b7cda1800f84-C001-84",
"2b836473cf682ed474b7cda1800f84-C001-85",
"2b836473cf682ed474b7cda1800f84-C001-86",
"2b836473cf682ed474b7cda1800f84-C001-87"
]
],
"cite_sentences": [
"2b836473cf682ed474b7cda1800f84-C001-23",
"2b836473cf682ed474b7cda1800f84-C001-45",
"2b836473cf682ed474b7cda1800f84-C001-47",
"2b836473cf682ed474b7cda1800f84-C001-82"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"2b836473cf682ed474b7cda1800f84-C001-28"
]
],
"cite_sentences": [
"2b836473cf682ed474b7cda1800f84-C001-28"
]
},
"@USE@": {
"gold_contexts": [
[
"2b836473cf682ed474b7cda1800f84-C001-33",
"2b836473cf682ed474b7cda1800f84-C001-34"
],
[
"2b836473cf682ed474b7cda1800f84-C001-181"
],
[
"2b836473cf682ed474b7cda1800f84-C001-192"
]
],
"cite_sentences": [
"2b836473cf682ed474b7cda1800f84-C001-34",
"2b836473cf682ed474b7cda1800f84-C001-181",
"2b836473cf682ed474b7cda1800f84-C001-192"
]
},
"@SIM@": {
"gold_contexts": [
[
"2b836473cf682ed474b7cda1800f84-C001-68"
],
[
"2b836473cf682ed474b7cda1800f84-C001-71"
]
],
"cite_sentences": [
"2b836473cf682ed474b7cda1800f84-C001-68",
"2b836473cf682ed474b7cda1800f84-C001-71"
]
},
"@MOT@": {
"gold_contexts": [
[
"2b836473cf682ed474b7cda1800f84-C001-77",
"2b836473cf682ed474b7cda1800f84-C001-78",
"2b836473cf682ed474b7cda1800f84-C001-79",
"2b836473cf682ed474b7cda1800f84-C001-80",
"2b836473cf682ed474b7cda1800f84-C001-81"
],
[
"2b836473cf682ed474b7cda1800f84-C001-82",
"2b836473cf682ed474b7cda1800f84-C001-83",
"2b836473cf682ed474b7cda1800f84-C001-84",
"2b836473cf682ed474b7cda1800f84-C001-85",
"2b836473cf682ed474b7cda1800f84-C001-86",
"2b836473cf682ed474b7cda1800f84-C001-87"
]
],
"cite_sentences": [
"2b836473cf682ed474b7cda1800f84-C001-77",
"2b836473cf682ed474b7cda1800f84-C001-82"
]
}
}
},
"ABC_7895613ddd09696bbee4143c4359b0_6": {
"x": [
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-70",
"text": "**MODEL**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-2",
"text": "Recent work has highlighted the advantage of jointly learning grounded sentence representations from multiple languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-3",
"text": "However, the data used in these studies has been limited to an aligned scenario: the same images annotated with sentences in multiple languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-4",
"text": "We focus on the more realistic disjoint scenario in which there is no overlap between the images in multilingual image-caption datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-5",
"text": "We confirm that training with aligned data results in better grounded sentence representations than training with disjoint data, as measured by image-sentence retrieval performance."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-6",
"text": "In order to close this gap in performance, we propose a pseudopairing method to generate synthetically aligned English-German-image triplets from the disjoint sets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-68",
"text": "**EXPERIMENTAL PROTOCOL**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-7",
"text": "The method works by first training a model on the disjoint data, and then creating new triples across datasets using sentence similarity under the learned model."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-8",
"text": "Experiments show that pseudopairs improve image-sentence retrieval performance compared to disjoint training, despite requiring no external data or models."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-9",
"text": "However, we do find that using an external machine translation model to generate the synthetic data sets results in better performance."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-12",
"text": "The perceptual-motor system plays an important role in concept acquisition and representation, and in learning the meaning of linguistic expressions (Pulverm\u00fcller, 2005) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-13",
"text": "In natural language processing, many approaches have been proposed that integrate visual information in the learning of word and sentence representations, highlighting the benefits of visually grounded representations (Lazaridou et al., 2015; Baroni, 2016; Elliott and K\u00e1d\u00e1r, 2017) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-14",
"text": "In these approaches the visual world is taken as a naturally occurring meaning representation for linguistic utterances, grounding language in perceptual reality."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-15",
"text": "Recent work has shown that we can learn better visually grounded representations of sentences by training image-sentence ranking models on multiple languages (Gella et al., 2017; K\u00e1d\u00e1r et al., 2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-16",
"text": "This line of research has focused on training models on datasets where the same images are annotated with sentences in multiple languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-17",
"text": "This alignment has either been in the form of the translation pairs (e.g. German, English, French, and Czech in Multi30K ) or independently collected sentences (English and Japanese in STAIR (Yoshikawa et al., 2017) )."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-18",
"text": "In this paper, we consider the problem of training an image-sentence ranking model using imagecaption collections in different languages with nonoverlapping images drawn from different sources."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-19",
"text": "We call these collections disjoint datasets and argue that it is easier to find disjoint datasets than aligned datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-20",
"text": "This is especially the case for datasets in different languages, e.g. digital museum collections 1 , newspaper collections (Ramisa et al., 2017) , or the the images used in Wikipedia articles (Schamoni et al., 2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-21",
"text": "Multilingual aligned datasets, by contrast, are small and expensive to collect : there is a need for methods that can train image-sentence ranking models on disjoint multilingual image datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-22",
"text": "K\u00e1d\u00e1r et al. (2018) claim that a multilingual image-sentence ranking model trained on disjoint datasets performs on-par with a model trained on aligned data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-23",
"text": "However, the disjoint datasets in their paper are artificial because they were formed by randomly splitting the Multi30K dataset into two halves."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-24",
"text": "We examine whether the ranking model can benefit from multilingual supervision when it is trained using disjoint datasets drawn from different sources."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-25",
"text": "In experiments with the Multi30K and COCO datasets, we find substantial benefits from training with these disjoint sources, but the best performance comes from training on aligned datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-26",
"text": "Given the empirical benefits of training on aligned datasets, we explore two approaches to creating synthetically aligned training data in the disjoint scenario."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-27",
"text": "One approach to creating synthetically aligned data is to use an off-the-shelf machine translation system to generate new imagecaption pairs by translating the original captions."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-28",
"text": "This approach is very simple, but has the limitation that an external system needs to be trained, which requires additional data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-29",
"text": "The second approach is to generate synthetically aligned data that are pseudopairs."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-30",
"text": "We assume the existence of image-caption datasets in different languages where the images do not overlap between the datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-31",
"text": "Pseudopairs are created by annotating the images of one dataset with the captions from another dataset."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-32",
"text": "This can be achieved by leveraging the sentence similarities predicted by an image-sentence ranking model trained on the original image-caption datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-33",
"text": "One advantage of this approach is that it does not require additional models or datasets because it uses the trained model to create new pairs."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-34",
"text": "The resulting pseudopairs can then be used to re-train or fine-tune the original model."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-35",
"text": "In experiments on the Multi30K and COCO datasets, we find that using an external machine translation system to create the synthetic data improves image-sentence ranking performance by 26.1% compared to training on only the disjoint data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-36",
"text": "The proposed pseudopair approach consistently improves performance compared to the disjoint baseline by 6.4%, and, crucially, this improvement is achieved without using any external datasets or pre-trained models."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-37",
"text": "We expect that there is a broad scope for more complex pseudopairing methods in future work in this direction."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-39",
"text": "**METHOD**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-40",
"text": "We adopt the model architecture and training procedure of K\u00e1d\u00e1r et al. (2018) for the task of matching images with sentences."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-41",
"text": "This task is defined as learning to rank the sentences associated with an image higher than other sentences in the data set, and vice-versa (Hodosh et al., 2013) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-42",
"text": "The model is comprised of a recurrent neural network language model and a convolutional neural network image encoder."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-43",
"text": "The parameters of the language encoder are randomly initialized, while the image encoder is pre-trained, frozen during training and followed by a linear layer which is tuned for the task."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-44",
"text": "The model is trained to make true pairs < a, b > similar to each other, and contrastive pairs <\u00e2, b > and < a,b > dissimilar from each other in a joint embedding space by minimizing the max-violation loss function (Faghri et al., 2017) :"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-45",
"text": "In our experiments, the < a, b > pairs are either image-caption pairs < i, c > or caption-caption pairs < c a , c b > (following Gella et al. (2017) ; K\u00e1d\u00e1r et al. (2018) )."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-46",
"text": "When we train on < i, c > pairs, we sample a batch from an image-caption data set with uniform probability, encode the images and the sentences, and perform an update of the model parameters."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-47",
"text": "For the caption-caption objective, we follow K\u00e1d\u00e1r et al. (2018) and generate a sentence pair data set by taking all pairs of sentences that belong to the same image and are written in different languages: 5 English and 5 German captions result in 25 English-German pairs."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-48",
"text": "The sentences are encoded and we perform an update of the model parameters using the same loss."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-49",
"text": "When training with both the image-caption and caption-caption (c2c) ranking tasks, we randomly select the task to perform with probability p=0.5."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-51",
"text": "**GENERATING SYNTHETIC PAIRS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-52",
"text": "We propose two approaches to creating synthetic image-caption pairs to improve image-sentence ranking models when training with disjoint data sets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-53",
"text": "We assume the existence of datasets D 1 : < I 1 , C 1 > and D 2 : < I 2 , C 2 > consisting of image-caption pairs < i 1 i , c 1 i > and < i 2 i , c 2 i > in languages 1 and 2 , where the image sets do not overlap I 1 \u2229 I 2 = \u2205. We seek to extend < I 2 , C 2 > to a bilingual dataset with synthetic captions\u0109 1 i \u2208\u0108 1 in language 1 , resulting in a triplet data set < I 2 ,\u0108 1 , C 2 > consisting of triplets < i 2 i ,\u0109 1 i , c 2 i >."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-54",
"text": "We hypothesize that the new dataset will improve model performance because it will be trained to map the images to captions in both languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-56",
"text": "**PSEUDOPAIRS APPROACH**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-57",
"text": "Given two image-caption corpora < I 1 , C 1 > and < I 2 , C 2 > with pairs < i 1 i , c 1 i > and < i 2 i , c 2 i >, we generate a pseudopair corpus labeling each image in I 2 with a caption from C 1 ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-58",
"text": "We create pseudopairs only in one direction leading to new image-caption pairs < i 2 ,\u0109 1 >."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-59",
"text": "The pseudopairs are generated using the sentence representations of the model trained on both corpora < I 1 , C 1 > and < I 2 , C 2 > jointly."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-60",
"text": "We encode all captions c 1 i \u2208 C 1 and c 2 i \u2208 C 2 and for each c 2 i find the most similar caption\u0109 1 i using the cosine similarity between the sentence representations."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-61",
"text": "This leads to pairs < c 2 i ,\u0109 1 i > and as a result to triplets < i 2 i , c 2 i ,\u0109 1 i > ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-62",
"text": "Filtering Optionally we filter the resulting pseudopair set C 1 , in an attempt to avoid misleading samples with three filtering strategies: Fine-tuning vs. restart After the pseudopairs are generated we consider two options: re-train the model from scratch with all previous data sets adding the generated pseudopairs or fine-tunening with same data sets and the additional pseudopairs."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-64",
"text": "**TRANSLATION APPROACH**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-65",
"text": "Given a corpus < I 2 , C 2 > with pairs < i 2 i , c 2 i >, we use a machine translation system to translate each caption c 2 i to a language 1 leading to new image-caption pairs < i 2 i ,\u0109 1 i > 2 ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-66",
"text": "Any off-theshelf translation system could be used to create the translated captions, e.g. an online service or a pre-trained translation model."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-71",
"text": "Our implementation, training protocol and parameter settings are based on the existing codebase of K\u00e1d\u00e1r et al. (2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-72",
"text": "3 ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-73",
"text": "In all experiments, we use the 2048 dimensional image features extracted from the last average-pooling layer of a pre-trained 4 ResNet50 CNN (He et al., 2016) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-74",
"text": "The image representation used in our model is obtained by a single affine transformation that we train from scratch W I \u2208 R 2048\u00d71024 ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-75",
"text": "For the sentence encoder we use a uni-directional Gated Recurrent Unit (GRU) network (Cho et al., 2014) with a single hidden layer with 1024 hidden units and 300 dimensional word embeddings."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-76",
"text": "When training bilingual models we use a single word embedding for the same word-forms, making no distinction if they come from different languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-77",
"text": "Each sentence is represented by the final hidden state of the GRU."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-78",
"text": "For the similarity function in the loss function (Eq. 1) we use cosine similarity and \u03b1 = 0.2 margin parameter."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-79",
"text": "In all experiments, we early stop on the validation set when no improvement is observed for 10 inspections, which are performed every 500 updates."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-80",
"text": "The stopping criterion is the sum of textto-image (T\u2192I) and image-to-text (I\u2192T) recall scores at ranks 1, 5 and 10 across all languages in the training data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-81",
"text": "The models are trained with a batch-size of 128 with with the Adam optimizer (Kingma and Ba, 2014) using default parameters and an initial learning rate of 2e-4 without applying any learning-rate decay schedule."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-82",
"text": "We apply gradient norm clipping with a value of 2.0."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-83",
"text": "We use a pre-trained OpenNMT (Klein et al., 2018) English-German machine translation model 5 to create the data for the translation approach described in Section 2.3."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-85",
"text": "**DATASETS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-86",
"text": "The models are trained and evaluated on the bilingual English-German Multi30K dataset (M30K), and we optionally train on the English COCO dataset (Chen et al., 2015) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-87",
"text": "In monolingual experiments, the model is trained on a single language from M30K or COCO."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-88",
"text": "In This gives 82,783 training, 5,000 validation, and 5,000 test images; each image is paired with five captions."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-89",
"text": "The data set has an additional split containing the 30,504 images from the original validation set of MS-COCO (\"restval\"), which we add to the training set as in previous work (Karpathy and Fei-Fei, 2015; Vendrov et al., 2016; Faghri et al., 2017) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-91",
"text": "**EVALUATION**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-92",
"text": "We report results on Multimodal Translation Shared Task 2016 test split of M30K."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-180",
"text": "In the second example, both captions imply that the man sits on the tree not beside it."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-93",
"text": "Due to space constraints, we only report recall at 1 (R@1) for Image-to-Text (I\u2192T) and Text-to-Image (T\u2192I) retrieval, and the sum of R@1, R@5, and R@10 recall scores across both tasks and languages (Sum)."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-94",
"text": "6"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-96",
"text": "**BASELINE RESULTS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-97",
"text": "The experiments presented here set the baseline performance for the visually grounded bilingual models and introduces the data settings that we will use in the later sections."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-98",
"text": "Aligned In these experiments we only use the aligned English-German data from M30K."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-99",
"text": "Tables 1 and 2 present the result for English and German, respectively."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-100",
"text": "The Sum-of-recall scores for both languages show that the best approach is the bilingual model with the c2c loss (En+De+c2c, and De+En+c2c)."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-101",
"text": "These results reproduce the findings of K\u00e1d\u00e1r et al. (2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-102",
"text": "Disjoint We now determine the performance of the model when it is trained on data drawn from different data sets with no overlapping images."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-103",
"text": "6 This is the criterion we use for early-stopping."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-104",
"text": "First we train two English monolingual models: one on the M30K English dataset and one on the English COCO dataset."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-105",
"text": "Both models are evaluated on image-sentence ranking performance on the M30K English test 2016 set."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-106",
"text": "The results in Table 1 show that there is a substantial difference in performance in both text-to-image and imageto-text retrieval, depending on whether the model is trained on the M30K or the COCO dataset."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-107",
"text": "The final row of Table 1 shows, however, that jointly training on both data sets improves over only using the M30K English training data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-108",
"text": "We also conduct experiments in the bilingual disjoint setting, where we study whether it is possible to improve the performance of a German model using the out-of-domain English COCO data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-109",
"text": "Table 2 shows that there is an increase in performance when the model is trained on the disjoint sets, as opposed to only the in-domain M30K German (compare De against De+COCO)."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-110",
"text": "This result is not too surprising as we have observed both the advantage of joint training on both languages in the aligned setting and the overlap between the different datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-111",
"text": "Finally, we compare the performance of a German model trained in the aligned and disjoint settings."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-112",
"text": "We find that a model trained in the aligned setting (De+En) is better than a model trained in the disjoint setting (De+COCO), as shown in Table 2 ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-113",
"text": "This finding contradicts the conclusion of K\u00e1d\u00e1r et al. (2018) , who claimed that the aligned and disjoint conditions lead to comparable performance."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-114",
"text": "This is most likely because the disjoint setting in K\u00e1d\u00e1r et al. (2018) is artificial, in the sense that they used different 50% subsets of M30K."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-115",
"text": "In our experiments the disjoint image-caption sets are real, in the sense that we trained the models on the two different datasets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-116",
"text": "Aligned plus disjoint Our final baseline experiments explore the combination of disjoint and aligned data settings."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-117",
"text": "We train an English-German bilingual model with the c2c objective on M30K, and we also train on the English COCO data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-118",
"text": "Table 3 shows that adding the disjoint data improves performance for both English and German compared to training solely the aligned model."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-119",
"text": "Summary First we reproduced the findings of K\u00e1d\u00e1r et al. (2018) showing that bilingual joint training improves over monolingual and using c2c loss further improves performance."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-120",
"text": "Furthermore, we have found that adding the COCO as additional training data both when only training on German, and training on both German-English from M30K improves performance even if the model is trained on data drawn from a different dataset."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-122",
"text": "**TRAINING WITH PSEUDOPAIRS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-123",
"text": "In this section we turn our attention to creating a synthetic English-German aligned data set from the English COCO using the pseudopair method (Section 2.1)."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-124",
"text": "The synthetic data set is used to train an image-sentence ranking model either from scratch or by fine-tuning the original model; in addition, we also explore the effect of using all of the pseudopairs or by filtering the pseudopairs."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-125",
"text": "We hypothesise that training a model with the additional pseudopairs with improve over the aligned plus disjoint baseline."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-126",
"text": "Disjoint We generate pseudopairs using the disjoint bilingual model trained on the German M30K and the English COCO."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-127",
"text": "Table 4 reports the results when evaluating on the M30K German data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-128",
"text": "Line 2 shows that using the full pseudopair set and re-training the model does not lead to noticeable improvements."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-129",
"text": "However, line 3 shows that performance increases when we train with all pseudopairs and fine-tuning the original disjoint bilingual model."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-130",
"text": "Filtering the pseudopairs at either the 25% and 75% percentile is detrimental to the final performance."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-131",
"text": "7"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-132",
"text": "Aligned plus disjoint We generate pseudopairs using a model trained on M30K English-German data with the c2c objective and the English COCO data set."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-133",
"text": "The results for both English and German are reported in Table 5 ; note that when we train with the pseudopairs we also train with the c2c loss on both data sets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-134",
"text": "Overall we find that pseudopairs improve performance, however, we do not achieve the best results for English and German in the same conditions."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-135",
"text": "The best results for German are to filter at 25% percentile and apply fine-tuning, while for English the best results are without filtering or fine-tuning."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-181",
"text": "This shows that even if the datasets are similar, transferring a caption that exactly matches the picture is difficult."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-136",
"text": "The best overall model is trained with all the pseudopairs with fine-tuning, according to the Sum of the Sum-of-recall scores across both English and German."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-137",
"text": "The performance across both data sets is increased from 723.5 to 728.2 using the pseudopair method."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-138",
"text": "Summary In both aligned plus disjoint and disjoint scenarios, the additional pseudopairs improve performance, and in both cases the overall best performance is achieved when applying the fine-tuning strategy and no filtering of the samples."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-140",
"text": "**TRAINING WITH TRANSLATIONS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-141",
"text": "We now focus on our second approach to creating an English-German aligned dataset using the translation method described in Section 2.1."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-142",
"text": "Disjoint We first report the results of disjoint bilingual model trained on the German M30K, the English COCO data, and the translated German COCO in Table 6 ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-143",
"text": "The results show that retrieval performance is improved when the model is trained on the translated German COCO data in addition to the English COCO data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-144",
"text": "We find the best performance when we jointly train on the M30K German, the Translated German COCO and the English COCO with the additional c2c objective over the COCO datasets (+c2c)."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-145",
"text": "We note that this setup leads to a better model, as measured by the sum-ofrecall-scores, than training on the aligned M30K data (compare De+COCO+Translation+c2c in Ta-ble 6 to De+En+c2c in Table 2 )."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-147",
"text": "**ALIGNED PLUS DISJOINT**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-148",
"text": "In these experiments, we train models with the aligned M30K data, the disjoint English COCO data, and the translated German COCO data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-149",
"text": "Table 5 presents the results for the English and German evaluation."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-150",
"text": "We find that training on the German Translated COCO data and using the c2c loss over the COCO data results in improvements for both languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-151",
"text": "Summary In both the disjoint and aligned plus disjoint settings, we find that training with the translations of COCO improves performance over training with only the English COCO data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-153",
"text": "**DISCUSSION**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-155",
"text": "**SENTENCE-SIMILARITY QUALITY**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-156",
"text": "The core of the proposed pseudopairing method is based on measuring the similarity between sentences, but how well does our model encode similar sentences?"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-157",
"text": "Here we analyze the ability of our models to identify translation equivalent sentences using the English-German translation pairs in the Multi30K test 2016 data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-179",
"text": "The first example demonstrates the difference between the Multi30K and COCO datasets: there are no giraffes in the former, but there are dogs (\"Hund\")."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-158",
"text": "This experiment proceeds as follows: (i) we assume a pre-trained imagesentence ranking model, (ii) we encode the German and English sentences using the language encoder of the model, (iii) we calculate the model's performance on the task of ranking the correct translation for English sentences, given the German caption, and vice-versa."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-159",
"text": "To put our results into perspective we compare to the best approach to our knowledge as reported by Rotman et al. (2018) canonical correlation analysis method maximizing the canonical correlation between captions of the same image conditioned on image representations as a third view."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-160",
"text": "Table 7 reports the results of this experiment."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-161",
"text": "Our models consistently improve upon the state-of-the-art."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-162",
"text": "The baseline aligned model trained on the Multi30K data slightly outperforms the DPCCA for EN \u2192 DE retrieval, and more substantially outperforms DPCCA for DE \u2192 EN."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-163",
"text": "If we train the same model with the additional c2c objective, R@1 improves by 8.0 and 12.1 points, respectively."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-164",
"text": "We find that adding more monolingual English data from the external COCO data set slightly degrades retrieval performance, and that performing sentence retrieval using a model trained on the disjoint M30K German and English COCO data sets result in much lower retrieval performance."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-165",
"text": "We conclude that the model that we used to estimate sentence similarity is the bestperforming method known for this task on this data set, but there is room for improvement for models trained on disjoint data sets."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-166",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-167",
"text": "**CHARACTERISTICS OF THE PSEUDOPAIRS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-168",
"text": "We now investigate the properties of the pseudopairs generated by our method."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-169",
"text": "In particular, we focus on pseudpairs generated by an aligned plus disjoint model (En+De+COCO+c2c) and a disjoint model (De+COCO)."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-170",
"text": "The pseudopairs generated by the aligned plus disjoint model cover 40% of the German captions in the M30K data set, and overall, the pseudopairs form a heavy-tailed distribution."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-171",
"text": "We find a similar pattern for the pseudopairs generated by the disjoint model: the pseudopairs cover 37% of the M30K data set, and the top 150 captions cover 23% of the data."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-172",
"text": "This is far from using each caption equally in the pseudopair transfer, and may suggest a hubness problem (Dinu et al., 2014) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-173",
"text": "We assessed the stability of the sets of transferred captions using the Jaccard measure in two cases: (i) different random seeds, and (ii) disjoint or aligned plus disjoint."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-174",
"text": "For the aligned plus disjoint model, we observe an overlap of 0.53 between different random seeds compared to 0.51 for the disjoint model."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-175",
"text": "The overlap between the two types of models is much lower at 0.41."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-176",
"text": "Finally, we find that when a caption is transferred by both models, the overlap of the caption annotating the same COCO image is 0.33 for the disjoint model, and 0.34 for the aligned plus disjoint model, and the overlap between the models is 0.16."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-177",
"text": "This shows that the models do not transfer the same captions for the same images."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-178",
"text": "Figure 1 presents examples of the annotations transferred using the pseudopair method."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-182",
"text": "The final two examples show semantically accurate and similar sentences are transferred by both models."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-183",
"text": "In the fourth example, both models transfer exactly the same caption."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-185",
"text": "**RELATED WORK**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-186",
"text": "Image-sentence ranking is the task of retrieving the sentences that best describe an image, and vice-versa (Hodosh et al., 2013) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-187",
"text": "Most recent approaches are based on learning to project image representations and sentence representations into a shared space using deep neural networks (Frome et al., 2013; Socher et al., 2014; Vendrov et al., 2016; Faghri et al., 2017, inter-alia) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-188",
"text": "More recently, there has been a focus on solving this task using multilingual data (Gella et al., 2017; K\u00e1d\u00e1r et al., 2018) in the Multi30K dataset ; an extension of the popular Flickr30K dataset into German, French, and Czech."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-189",
"text": "These works take a multi-view learning perspective in which images and their descriptions in multiple languages are different views of the same concepts."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-190",
"text": "The assumption is that common representations of multiple languages and perceptual stimuli can potentially exploit complementary information between views to learn better representations."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-191",
"text": "For example, Rotman et al. (2018) improves bilingual sentence representations by in- Ein jet jagt steil in die luft, viel rauch kommt aus dem rumpf."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-192",
"text": "[A jet goes steep up into the air, a lot of smoke is coming out of its hull.]"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-193",
"text": "Ein jet jagt steil in die luft, viel rauch kommt aus dem rumpf."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-194",
"text": "[A jet goes steep up into the air, a lot of smoke is coming out of its hull.] Figure 1 : Visualisation of the sentences transferred from Multi30K to the COCO data set using the pseudopair method."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-195",
"text": "1 is transferred from a model trained on De+COCO, whereas 2 is transferred from En+De+COCO."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-196",
"text": "[English glosses of the sentences are included for ease of reading.] corporating image information as a third view by Deep Partial Canonical Correlation Analysis."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-197",
"text": "More similar to our work Gella et al. (2017) , propose a convolutional-recurrent architecture with both an image-caption and caption-caption loss to learn bilingual visually grounded representations."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-198",
"text": "Their results were improved by the approach presented in K\u00e1d\u00e1r et al. (2018) , who has also shown that the multilingual models outperform bilingual models, and that image-caption retrieval performance in languages with less resources can be improved with data from higher-resource languages."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-199",
"text": "We largely follow K\u00e1d\u00e1r et al. (2018) , however, our main interest lies in learning multimodal and bilingual representations in the scenario where the images do not come from the same data set i.e.: the data is presented is two sets of image-caption tuples rather than image-caption-caption triples."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-200",
"text": "Taking a broader perspective, images have been used as pivots in multilingual multimodal language processing."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-201",
"text": "On the word level this intuition is applied to visually grounded bilingual lexicon induction, which aims to learn cross-lingual word representations without aligned text using images as pivots (Bergsma and Van Durme, 2011; Kiela et al., 2015; Vuli\u0107 et al., 2016; Hartmann and S\u00f8gaard, 2017; Hewitt et al., 2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-202",
"text": "Images have been used as pivots to learn translation models only from image-caption data sets, without parallel text (Hitschler et al., 2016; Nakayama and Nishida, 2017; Lee et al., 2017; Chen et al., 2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-203",
"text": "----------------------------------"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-204",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-205",
"text": "Previous work has demonstrated improved imagesentence ranking performance when training models jointly on multiple languages (Gella et al., 2017; K\u00e1d\u00e1r et al., 2018) ."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-206",
"text": "Here we presented a study on learning multimodal and multilingual representations in the disjoint setting, where images between languages do not overlap."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-207",
"text": "We found that learning visually grounded sentence embeddings in this setting is more challenging."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-208",
"text": "To close the gap, we developed a pseudopairing technique that creates synthetic pairs by annotating the images from one of the data sets with the image descriptions of the other using the sentence similarities of the model trained on both."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-209",
"text": "We showed that training with pseudopairs improves performance, without the need to augment training from additional data sources or other pipeline components."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-210",
"text": "However, our technique is outperformed by creating synthetic pairs using an off-the-shelf automatic machine translation system."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-211",
"text": "As such our results suggest that it is better to use translation, when a good translation system is available, however, in its absence, pseudopairs offer consistent improvements."
},
{
"sent_id": "7895613ddd09696bbee4143c4359b0-C001-212",
"text": "We have found that our pseudopairing method only transfers annotations from a small number of images and in the future we plan to substitute our naive matching algorithms with approaches developed to mitigate this hubness issue (Radovanovi\u0107 et al., 2010) and to close the gap between translation and pseudopairs."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7895613ddd09696bbee4143c4359b0-C001-15",
"7895613ddd09696bbee4143c4359b0-C001-16",
"7895613ddd09696bbee4143c4359b0-C001-18"
],
[
"7895613ddd09696bbee4143c4359b0-C001-22",
"7895613ddd09696bbee4143c4359b0-C001-23"
],
[
"7895613ddd09696bbee4143c4359b0-C001-40",
"7895613ddd09696bbee4143c4359b0-C001-41",
"7895613ddd09696bbee4143c4359b0-C001-42",
"7895613ddd09696bbee4143c4359b0-C001-43",
"7895613ddd09696bbee4143c4359b0-C001-44"
],
[
"7895613ddd09696bbee4143c4359b0-C001-45"
],
[
"7895613ddd09696bbee4143c4359b0-C001-112",
"7895613ddd09696bbee4143c4359b0-C001-113",
"7895613ddd09696bbee4143c4359b0-C001-114",
"7895613ddd09696bbee4143c4359b0-C001-115"
],
[
"7895613ddd09696bbee4143c4359b0-C001-119"
],
[
"7895613ddd09696bbee4143c4359b0-C001-188",
"7895613ddd09696bbee4143c4359b0-C001-189",
"7895613ddd09696bbee4143c4359b0-C001-190"
],
[
"7895613ddd09696bbee4143c4359b0-C001-198"
],
[
"7895613ddd09696bbee4143c4359b0-C001-199",
"7895613ddd09696bbee4143c4359b0-C001-200",
"7895613ddd09696bbee4143c4359b0-C001-201",
"7895613ddd09696bbee4143c4359b0-C001-202"
],
[
"7895613ddd09696bbee4143c4359b0-C001-205"
]
],
"cite_sentences": [
"7895613ddd09696bbee4143c4359b0-C001-15",
"7895613ddd09696bbee4143c4359b0-C001-40",
"7895613ddd09696bbee4143c4359b0-C001-45",
"7895613ddd09696bbee4143c4359b0-C001-113",
"7895613ddd09696bbee4143c4359b0-C001-114",
"7895613ddd09696bbee4143c4359b0-C001-119",
"7895613ddd09696bbee4143c4359b0-C001-188",
"7895613ddd09696bbee4143c4359b0-C001-198",
"7895613ddd09696bbee4143c4359b0-C001-199",
"7895613ddd09696bbee4143c4359b0-C001-205"
]
},
"@MOT@": {
"gold_contexts": [
[
"7895613ddd09696bbee4143c4359b0-C001-15",
"7895613ddd09696bbee4143c4359b0-C001-16",
"7895613ddd09696bbee4143c4359b0-C001-18"
],
[
"7895613ddd09696bbee4143c4359b0-C001-22",
"7895613ddd09696bbee4143c4359b0-C001-23"
]
],
"cite_sentences": [
"7895613ddd09696bbee4143c4359b0-C001-15"
]
},
"@USE@": {
"gold_contexts": [
[
"7895613ddd09696bbee4143c4359b0-C001-40",
"7895613ddd09696bbee4143c4359b0-C001-41",
"7895613ddd09696bbee4143c4359b0-C001-42",
"7895613ddd09696bbee4143c4359b0-C001-43",
"7895613ddd09696bbee4143c4359b0-C001-44"
],
[
"7895613ddd09696bbee4143c4359b0-C001-47",
"7895613ddd09696bbee4143c4359b0-C001-48",
"7895613ddd09696bbee4143c4359b0-C001-49"
],
[
"7895613ddd09696bbee4143c4359b0-C001-71"
]
],
"cite_sentences": [
"7895613ddd09696bbee4143c4359b0-C001-40",
"7895613ddd09696bbee4143c4359b0-C001-47",
"7895613ddd09696bbee4143c4359b0-C001-71"
]
},
"@SIM@": {
"gold_contexts": [
[
"7895613ddd09696bbee4143c4359b0-C001-101"
]
],
"cite_sentences": [
"7895613ddd09696bbee4143c4359b0-C001-101"
]
},
"@DIF@": {
"gold_contexts": [
[
"7895613ddd09696bbee4143c4359b0-C001-112",
"7895613ddd09696bbee4143c4359b0-C001-113",
"7895613ddd09696bbee4143c4359b0-C001-114",
"7895613ddd09696bbee4143c4359b0-C001-115"
],
[
"7895613ddd09696bbee4143c4359b0-C001-119"
],
[
"7895613ddd09696bbee4143c4359b0-C001-199",
"7895613ddd09696bbee4143c4359b0-C001-200",
"7895613ddd09696bbee4143c4359b0-C001-201",
"7895613ddd09696bbee4143c4359b0-C001-202"
]
],
"cite_sentences": [
"7895613ddd09696bbee4143c4359b0-C001-113",
"7895613ddd09696bbee4143c4359b0-C001-114",
"7895613ddd09696bbee4143c4359b0-C001-119",
"7895613ddd09696bbee4143c4359b0-C001-199"
]
}
}
},
"ABC_d3672a2d7129beef6703598f1558c4_6": {
"x": [
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-2",
"text": "Breaking down the structure of long texts into semantically coherent segments makes the texts more readable and supports downstream applications like summarization and retrieval."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-3",
"text": "Starting from an apparent link between text coherence and segmentation, we introduce a novel supervised model for text segmentation with simple but explicit coherence modeling."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-4",
"text": "Our model -a neural architecture consisting of two hierarchically connected Transformer networks -is a multi-task learning model that couples the sentence-level segmentation objective with the coherence objective that differentiates correct sequences of sentences from corrupt ones."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-5",
"text": "The proposed model, dubbed Coherence-Aware Text Segmentation (CATS), yields state-of-the-art segmentation performance on a collection of benchmark datasets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-6",
"text": "Furthermore, by coupling CATS with cross-lingual word embeddings, we demonstrate its effectiveness in zero-shot language transfer: it can successfully segment texts in languages unseen in training."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-9",
"text": "Natural language texts are, more often than not, a result of a deliberate cognitive effort of an author and as such consist of semantically coherent segments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-10",
"text": "Text segmentation deals with automatically breaking down the structure of text into such topically contiguous segments, i.e., it aims to identify the points of topic shift (Hearst 1994; Choi 2000; Brants, Chen, and Tsochantaridis 2002; Riedl and Biemann 2012; Du, Buntine, and Johnson 2013; Glava\u0161, Nanni, and Ponzetto 2016; Koshorek et al. 2018) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-11",
"text": "Reliable segmentation results with texts that are more readable for humans, but also facilitates downstream tasks like automated text summarization (Angheluta, De Busser, and Moens 2002; Bokaei, Sameti, and Liu 2016) , passage retrieval (Huang et al. 2003; Shtekh et al. 2018) , topical classification (Zirn et al. 2016) , or dialog modeling (Manuvinakurike et al. 2016; Zhao and Kawahara 2017) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-12",
"text": "Text coherence is inherently tied to text segmentationintuitively, the text within a segment is expected to be more coherent than the text spanning different segments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-13",
"text": "Consider, e.g., the text in Figure 1 , with two topical segments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-14",
"text": "Snippets T 1 and T 2 are more coherent than T 3 and T 4 : all T 1 sentences relate to Amsterdam's history, and all T 2 sentences to Amsterdam's geography; in contrast, T 3 and T 4 contain sentences Amsterdam is younger than Dutch cities such as Nijmegen, Rotterdam, and Utrecht."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-15",
"text": "Amsterdam was granted city rights in either 1300 or 1306."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-16",
"text": "In the 14th century Amsterdam flourished because of trade with the Hanseatic League."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-17",
"text": "Amsterdam is located in the Western Netherlands."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-18",
"text": "The river Amstel ends in the city centre and connects to numerous canals."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-19",
"text": "Amsterdam is about 2 metres (6.6 feet) below sea level."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-20",
"text": "T1 T2 T3 T4 Figure 1 : Snippet illustrating the relation (i.e., dependency) between text coherence and segmentation."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-21",
"text": "from both topics."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-22",
"text": "T 1 and T 2 being more coherent than T 3 and T 4 signals that the fourth sentence starts a new segment."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-23",
"text": "Given this duality between text segmentation and coherence, it is surprising that the methods for text segmentation capture coherence only implicitly."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-24",
"text": "Unsupervised segmentation models rely either on probabilistic topic modeling (Brants, Chen, and Tsochantaridis 2002; Riedl and Biemann 2012; Du, Buntine, and Johnson 2013) or semantic similarity between sentences (Glava\u0161, Nanni, and Ponzetto 2016) , both of which only indirectly relate to text coherence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-25",
"text": "Similarly, a recently proposed state-of-the-art supervised neural segmentation model (Koshorek et al. 2018 ) directly learns to predict binary sentence-level segmentation decisions and has no explicit mechanism for modeling coherence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-26",
"text": "In this work, in contrast, we propose a supervised neural model for text segmentation that explicitly takes coherence into account: we augment the segmentation prediction objective with an auxiliary coherence modeling objective."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-27",
"text": "Our proposed model, dubbed Coherence-Aware Text Segmentation (CATS), encodes a sentence sequence using two hierarchically connected Transformer networks (Vaswani et al. 2017; Devlin et al. 2018 )."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-28",
"text": "Similar to (Koshorek et al. 2018) , CATS' main learning objective is a binary sentence-level segmentation prediction."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-29",
"text": "However, CATS augments the segmentation objective with an auxiliary coherence-based objec-tive which pushes the model to predict higher coherence for original text snippets than for corrupt (i.e., fake) sentence sequences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-30",
"text": "We empirically show (1) that even without the auxiliary coherence objective, the Two-Level Transformer model for Text Segmentation (TLT-TS) yields state-of-the-art performance across multiple benchmarks, (2) that the full CATS model, with the auxiliary coherence modeling, further significantly improves the segmentation, and (3) that both TLT-TS and CATS are robust in domain transfer."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-31",
"text": "Furthermore, we demonstrate models' effectiveness in zero-shot language transfer."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-32",
"text": "Coupled with a cross-lingual word embedding space, 1 our models trained on English Wikipedia successfully segment texts from unseen languages, outperforming the best-performing unsupervised segmentation model (Glava\u0161, Nanni, and Ponzetto 2016) by a wide margin."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-33",
"text": "CATS: Coherence-Aware Two-Level Transformer for Text Segmentation Figure 2 illustrates the high-level architecture of the CATS model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-34",
"text": "A snippet of text -a sequence of sentences of fixed length -is an input to the model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-35",
"text": "Token encodings are a concatenation of a pretrained word embedding and a positional embedding."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-36",
"text": "Sentences are first encoded from their tokens with a token-level Transformer (Vaswani et al. 2017 )."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-37",
"text": "Next, we feed the sequence of obtained sentence representations to the second, sentence-level Transformer."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-38",
"text": "Transformed (i.e., contextualized) sentence representations are next fed to the feed-forward segmentation classifier, which makes a binary segmentation prediction for each sentence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-39",
"text": "We additionally feed the encoding of the whole snippet (i.e., the sentence sequence) to the coherence regressor (a feed-forward net), which predicts a coherence score."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-40",
"text": "In what follows, we describe each component in more detail."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-42",
"text": "**TRANSFORMER-BASED SEGMENTATION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-43",
"text": "The segmentation decision for a sentence clearly does not depend only on its content but also on its context, i.e., information from neighboring sentences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-44",
"text": "In this work, we employ the encoding stack of the attention-based Transformer architecture (Vaswani et al. 2017 ) to contextualize both token representations in a sentence and, more importantly, sentence representations within the snippet."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-45",
"text": "We choose Transfomer encoders because (1) they have recently been reported to outperform recurrent encoders on a range of NLP tasks (Devlin et al. 2018; Radford et al. 2018; Shaw, Uszkoreit, and Vaswani 2018) and (2) they are faster to train than recurrent nets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-46",
"text": "Sentence Encoding."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-47",
"text": "Let S = {S 1 , S 2 , . . . , S K } denote a single training instance -a snippet consisting of K sentences and let each sentence S i = {t i 1 , t i 2 , . . . , t i T } be a fixed-size sequence of T tokens."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-48",
"text": "2 Following (Devlin et al. 2018) , we prepend each sentence S i with a special sentence start token 1 See (Ruder, S\u00f8gaard, and Vuli\u0107 2018; Glava\u0161 et al. 2019 ) for a comprehensive overview of methods for inducing cross-lingual word embeddings."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-49",
"text": "2 We trim/pad sentences longer/shorter than T tokens."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-50",
"text": "Figure 2 : High-level depiction of the Coherence-Aware Text Segmentation (CATS) model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-51",
"text": ", aiming to use the transformed representation of that token as the sentence encoding."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-52",
"text": "3 We encode each token t i j (i \u2208 {1, . . . , K}, j \u2208 {0, 1, . . . , T }) with a vector t i j which is the concatenation of a d e -dimensional word embedding and a d p -dimensional embedding of the position j. We use pretrained word embeddings and fix them in training; we learn positional embeddings as model's parameters."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-53",
"text": "Let Transform T denote the encoder stack of the Transformer model (Vaswani et al. 2017) , consisting of N T T layers, each coupling a multi-head attention net with a feed-forward net."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-54",
"text": "4 We then apply Transform T to the token sequence of each snippet sentence:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-55",
"text": "The sentence encoding is then the transformed vector of the sentence start token [ss]:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-56",
"text": "produced with Transform T only capture the content of the sentence itself, but not its context."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-57",
"text": "We thus employ a second, sentence-level Transformer Transform S (with N T S layers) to produce context-informed sentence representations."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-58",
"text": "We prepend each sequence of non-contextualized sentence embeddings {s i } K i=1 with a fixed embedding s 0 , denoting the snippet start token , in order to capture the encoding of the whole snippet (i.e., sequence of K sentences) as the transformed embedding of the token:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-59",
"text": "with the transformed vector ss 0 being the encoding of the whole snippet S."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-60",
"text": "Segmentation Classification."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-61",
"text": "Finally, contextualized sentence vectors ss i go into the segmentation classifier, a singlelayer feed-forward net coupled with softmax function:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-62",
"text": "with W seg \u2208 R (de+dp)\u00d72 and b seg \u2208 R 2 as classifier's parameters."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-63",
"text": "Let y i \u2208 {[0, 1], [1, 0]} be the true segmentation label of the i-th sentence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-64",
"text": "The segmentation loss J seg is then the simple negative log-likelihood over all sentences of all N snippets in the training batch:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-66",
"text": "**AUXILIARY COHERENCE MODELING**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-67",
"text": "Given the obvious dependency between segmentation and coherence, we pair the segmentation task with an auxiliary task of predicting snippet coherence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-68",
"text": "To this effect, we couple each true snippet S from the original text with a corrupt (i.e., incoherent) snippet S, created by (1) randomly shuffling the order of sentences in S and (2) randomly replacing sentences from S, with other document sentences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-69",
"text": "Let (S, S) be a pair of a true snippet and its corrupt counterpart, and (ss 0 , ss 0 ) their respective encodings, obtained with the Two-Level Transformer."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-70",
"text": "The encodings of the correct snippet (ss 0 ) and the scrambled snippet (ss 0 ) are then presented to the coherence regressor, which independently generates a coherence score for each of them."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-71",
"text": "The scalar output of the coherence regressor is:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-72",
"text": "with w c \u2208 R de+dp and b c \u2208 R as regressor's parameters."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-73",
"text": "We then jointly softmax-normalize the scores for S and S:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-74",
"text": "We want to force the model to produce higher coherence score for the correct snippet S than for its corrupt counterpart S. We thus define the following contrastive margin-based coherence objective:"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-75",
"text": "where \u03b4 coh is the margin by which we would like coh(S) to be larger than coh(S)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-77",
"text": "**CREATING TRAINING INSTANCES**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-78",
"text": "Our presumed training corpus contains documents that are generally longer than the snippet size K and annotated for segmentation at the sentence level."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-79",
"text": "We create training instances by sliding a sentence window of size K over documents' sentences with a stride of K/2."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-80",
"text": "For the sake of auxiliary coherence modeling, for each original snippet S, we create its corrupt counterpart S with the following corruption procedure: (1) we first randomly shuffle the order of sentences in S; (2) for p 1 percent of snippets (random selection) we additionally replace sentences of the shuffled snippet (with the probability p 2 ) with randomly chosen sentences from other, non-overlapping document snippets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-82",
"text": "**INFERENCE**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-83",
"text": "At inference time, given a long document, we need to make a binary segmentation decision for each sentence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-84",
"text": "Our model, however, does not take individual sentences as input, but rather sequences of K sentences (i.e., snippets) and makes in-context segmentation prediction for each sentence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-85",
"text": "Since we can create multiple different sequences of K consecutive sentences that contain some sentence S, 5 our model can obtain multiple segmentation predictions for the same sentence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-86",
"text": "As we do not know apriori which of the snippets containing the sentence S is the most reliable with respect to the segmentation prediction for S, we consider all possible snippets containing S. In other words, at inference time, unlike in training, we create snippets by sliding the window of K sentences over the document with the stride of 1. Let S = {S 1 , S 2 , . . . , S K } be the set of (at most) K different snippets containing a sentence S. We then average the segmentation probabilities predicted for the sentence S over all snippets in S: 6"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-87",
"text": "Finally, we predict that S starts a new segment if P seg (S) > \u03c4 , where \u03c4 is the confidence threshold, tuned as a hyperparameter of the model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-89",
"text": "**CROSS-LINGUAL ZERO-SHOT TRANSFER**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-90",
"text": "Models that do not require any language-specific features other than pretrained word embeddings as input can (at least conceptually) be easily transferred to another language by means of a cross-lingual word embedding space (Ruder, S\u00f8gaard, and Vuli\u0107 2018; Glava\u0161 et al. 2019) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-91",
"text": "Let X L1 be the monolingual embedding space of the source language (most often English), which we use in training and let X L2 be the independently trained embedding space of the target language to which we want to transfer the segmentation model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-92",
"text": "To transfer the model, we need to project target-language vectors from X L2 to the sourcelanguage space X L1 ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-93",
"text": "There is a plethora of recently proposed methods for inducing projection-based cross-lingual embeddings (Faruqui and Dyer 2014; Smith et al. 2017; Artetxe, Labaka, and Agirre 2018; Vuli\u0107 et al. 2019, inter alia) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-94",
"text": "We opt for the supervised alignment model based on solving the Procrustes problem (Smith et al. 2017), due to its simplicity and competitive performance in zero-shot language transfer of NLP models ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-95",
"text": "Given a limited-size word translation training dictionary D, we obtain the linear projection matrix W L2\u2192L1 between X L2 and X L1 as follows: W L2\u2192L1 = UV ; U\u03a3V = SVD(X S X T ); (9) with X S \u2282 X L1 and X T \u2282 X L2 as subsets of monolingual spaces that align vectors from training translations pairs from D. Once we obtain W L2\u2192L1 , the language transfer of the segmentation model is straightforward: we input the embeddings of L2 words from the projected space"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-97",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-98",
"text": "We first describe datasets used for training and evaluation and then provide the details on the comparative evaluation setup and model optimization."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-100",
"text": "**DATA**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-101",
"text": "WIKI-727K Corpus."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-102",
"text": "Koshorek et al. (2018) leveraged the manual structuring of Wikipedia pages into sections to automatically create a large segmentation-annotated corpus."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-103",
"text": "WIKI-727K consists of 727,746 documents created from English (EN) Wikipedia pages, divided into training (80%), development (10%), and test portions (10%)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-104",
"text": "We train, optimize, and evaluate our models on respective portions of the WIKI-727K dataset."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-105",
"text": "Standard Test Corpora."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-106",
"text": "Koshorek et al. (2018) additionally created a small evaluation set WIKI-50 to allow for comparative evaluation against unsupervised segmentation models, e.g., the GRAPHSEG model of Glava\u0161, Nanni, and Ponzetto (2016) , for which evaluation on large datasets is prohibitively slow."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-107",
"text": "For years, the synthetic dataset of Choi (2000) was used as a standard becnhmark for text segmentation models."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-108",
"text": "CHOI dataset contains 920 documents, each of which is a concatenation of 10 paragraphs randomly sampled from the Brown corpus."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-109",
"text": "CHOI dataset is divided into subsets containing only documents with specific variability of segment lengths (e.g., segments with 3-5 or with 9-11 sentences)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-110",
"text": "7 Finally, we evaluate the performance of our models on two small datasets, CITIES and ELEMENTS, created by Chen et al. (2009) from Wikipedia pages dedicated to the cities of the world and chemical elements, respectively."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-111",
"text": "Other Languages."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-112",
"text": "In order to test the performance of our Transformer-based models in zero-shot language transfer setup, we prepared small evaluation datasets in other languages."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-113",
"text": "Analogous to the WIKI-50 dataset created by Koshorek et al. (2018) from English (EN) Wikipedia, we created WIKI-50-CS, WIKI-50-FI, and WIKI-50-TR datasets consisting of 50 randomly selected pages from Czech (CS), Finnish (FI), and Turkish (TR) Wikipedia, respectively."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-114",
"text": "8"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-116",
"text": "**COMPARATIVE EVALUATION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-117",
"text": "Evaluation Metric."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-118",
"text": "Following previous work (Riedl and Biemann 2012; Glava\u0161, Nanni, and Ponzetto 2016; Koshorek et al. 2018) , we also adopt the standard text segmentation measure P k (Beeferman, Berger, and Lafferty 1999) as our evaluation metric."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-119",
"text": "P k score is the probability that a model makes a wrong prediction as to whether the first and last sentence of a randomly sampled snippet of k sentences belong to the same segment (i.e., the probability of the model predicting the same segment for the sentences from different segment or different segments for the sentences from the same segment)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-120",
"text": "Following (Glava\u0161, Nanni, and Ponzetto 2016; Koshorek et al. 2018) , we set k to the half of the average ground truth segment size of the dataset."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-121",
"text": "Baseline Models."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-122",
"text": "We compare CATS against the state-ofthe-art neural segmentation model of Koshorek et al. (2018) and against GRAPHSEG (Glava\u0161, Nanni, and Ponzetto 2016) , the state-of-the-art unsupervised text segmentation model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-123",
"text": "Additionally, as a sanity check, we evaluate the RANDOM baseline -it assigns a positive segmentation label to a sentence with the probability that corresponds to the ratio of the total number of segments (according to the gold segmentation) and total number of sentences in the dataset."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-125",
"text": "**MODEL CONFIGURATION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-126",
"text": "Model Variants."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-127",
"text": "We evaluate two variants of our two-level transformer text segmentation model: with and without the auxiliary coherence modeling."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-128",
"text": "The first model, TLT-TS, minimizes only the segmentation objective J seg ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-129",
"text": "CATS, our second model, is a multi-task learning model that alternately minimizes the segmentation objective J seg and the coherence objective J coh ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-130",
"text": "We adopt a balanced alternate training regime for CATS in which a single parameter update based on the minimization of J seg is followed by a single parameter update based on the optimization of J coh ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-131",
"text": "Word Embeddings."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-132",
"text": "In all our experiments we use 300dimensional monolingual FASTTEXT word embeddings pretrained on the Common Crawl corpora of respective languages: EN, CS, FI, and TR."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-133",
"text": "9 We induce a cross-lingual word embedding space, needed for the zero-shot language transfer experiments, by projecting CS, FI, and TR monolingual embedding spaces to the EN embedding space."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-134",
"text": "Following (Smith et al. 2017; Glava\u0161 et al. 2019) , we create training dictionaries D for learning projection matrices by machine translating 5,000 most frequent EN words to CS, FI, and TR."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-135",
"text": "Model Optimization."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-136",
"text": "We optimize all hyperparameters, including the data preparation parameters like the snippet size K, via cross-validation on the development portion of the Wiki-727K dataset."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-137",
"text": "We found the following configuration to lead to robust 10 performance for both TLT-TS and CATS: (1) training instance preparation: snippet size of K = 16 sentences with T = 50 tokens; scrambling probabilities p 1 = p 2 = 0.5; (2) configuration of Transformers: N T T = N T S = 6 layers and with 4 attention heads per layer in both transformers; 11 (3) other model hyperparameters: positional embedding size of d p = 10; coherence objective contrastive margin of \u03b4 coh = 1."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-138",
"text": "We found different optimal inference thresholds: \u03c4 = 0.5 for the segmentation-only TLT-TS model and \u03c4 = 0.3 for the coherence-aware CATS model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-139",
"text": "We trained both TLT-TS and CATS in batches of N = 32 snippets (each with K = 16 sentences), using the Adam optimization algorithm (Kingma and Ba 2014) with the initial learning rate set to 10 \u22124 ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-141",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-142",
"text": "We first present and discuss the results that our models, TLT-TS and CATS, yield on the previously introduced EN evaluation datasets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-143",
"text": "We then report and analyze models' performance in the cross-lingual zero-shot transfer experiments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-144",
"text": "9 https://tinyurl.com/y6j4gh9a 10 Given the large hyperparameter space and large training set, we only searched over a limited-size grid of hyperparameter configurations."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-145",
"text": "It is thus likely that a better-performing configuration than the one reported can be found with a more extensive grid search."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-146",
"text": "11 We do not tune other transformer hyperparameters, but rather adopt the recommended values from (Vaswani et al. 2017) : filter size of 1024 and dropout probabilities of 0.1 for both attention layers and feed-forward ReLu layers."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-147",
"text": "Table 1 shows models' performance on five EN evaluation datasets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-148",
"text": "Both our Transformer-based models -TLT-TS and CATS -outperform the competing supervised model of Koshorek et al. (2018) , a hierarchical encoder based on recurrent components, across the board."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-149",
"text": "The improved performance that TLT-TS has with respect to the model of Koshorek et al. (2018) is consistent with improvements that Transformer-based architectures yield in comparison with models based on recurrent components in other NLP tasks (Vaswani et al. 2017; Devlin et al. 2018) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-150",
"text": "The gap in performance is particularly wide (>20 P k points) for the EL-EMENTS dataset."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-151",
"text": "Evaluation on the ELEMENTS test set is, arguably, closest to a true domain-transfer setting: 12 while the train portion of the WIKI-727K set contains pages similar in type to those found in WIKI-50 and CITIES test sets, it does not contain any Wikipedia pages about chemical elements (all such pages are in the ELEMENTS test set)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-152",
"text": "This would suggest that TLT-TS and CATS offer more robust domain transfer than the recurrent model of Koshorek et al. (2018) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-154",
"text": "**BASE EVALUATION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-155",
"text": "CATS significantly 13 and consistently outperforms TLT-TS."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-156",
"text": "This empirically confirms the usefulness of explicit coherence modeling for text segmentation."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-157",
"text": "Moreover, Koshorek et al. (2018) report human performance on the WIKI-50 dataset of 14.97, which is a mere one P k point better than the performance of our coherence-aware CATS model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-158",
"text": "The unsupervised GRAPHSEG model of Glava\u0161, Nanni, and Ponzetto (2016) seems to outperform all supervised models on the synthetic CHOI dataset."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-159",
"text": "We believe that this is primarily because (1) by being synthetic, the CHOI dataset can be accurately segmented based on simple lexical overlaps and word embedding similarities (and GRAPHSEG relies on similarities between averaged word embeddings) and because (2) by being trained on a much more challenging real-world WIKI-727K dataset -on which lexical overlap is insufficient for accurate segmentation -supervised models learn to segment based on deeper natural language understanding (and learn not to encode lexical overlap as reliable segmentation signal)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-160",
"text": "Additionally, GRAPHSEG is evaluated separately on each subset of the CHOI dataset, for each of which it is provided the (gold) minimal segment size, which further facilitates and improves its predicted segmentations."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-162",
"text": "**ZERO-SHOT CROSS-LINGUAL TRANSFER**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-163",
"text": "In Table 2 we show the results of our zero-shot cross-lingual transfer experiments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-164",
"text": "In this setting, we use our Transformerbased models, trained on the English WIKI-727K dataset, to segment texts from the WIKI-50-X (X \u2208 {CS, FI, TR}) datasets in other languages."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-165",
"text": "As a baseline, we additionally evaluate GRAPHSEG (Glava\u0161, Nanni, and Ponzetto 2016) , as a language-agnostic model requiring only pretrained word embeddings of the test language as input."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-166",
"text": "Both our Transformer-based models, TLT-TS and CATS, outperform the unsupervised GRAPHSEG model (which seems to be only marginally better than the random baseline) by a wide margin."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-167",
"text": "The coherence-aware CATS model is again significantly better (p < 0.01 for FI and p < 0.05 for CS and TR) than the TLT-TS model which was trained to optimize only the segmentation objective."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-168",
"text": "While the results on the WIKI-50-{CS, FI, TR} datasets are not directly comparable to the results reported on the EN WIKI-50 (see Table 1 ) because the datasets in different languages do not contain mutually comparable Wikipedia pages, results in Table 2 still suggest that the drop in performance due to the cross-lingual transfer is not big."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-169",
"text": "This is quite encouraging as it suggests that it is possible to, via the zero-shot language transfer, rather reliably segment texts from under-resourced languages lacking sufficiently large gold-segmented data needed to directly train language-specific segmentation models (that is, robust neural segmentation models in particular)."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-171",
"text": "**RELATED WORK**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-172",
"text": "In this work we address the task of text segmentation -we thus provide a detailed account of existing segmentation models."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-173",
"text": "Because our CATS model has an auxiliary coherencebased objective, we additionally provide a brief overview of research on modeling text coherence."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-175",
"text": "**TEXT SEGMENTATION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-176",
"text": "Text segmentation tasks come in two main flavors: (1) linear (i.e., sequential) text segmentation and (2) hierarchical segmentation in which top-level segments are further broken down into sub-segments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-177",
"text": "While the hierarchical segmentation received a non-negligible research attention (Yaari 1997; Eisenstein 2009; Du, Buntine, and John-son 2013) , the vast majority of the proposed models (including this work) focus on linear segmentation (Hearst 1994; Beeferman, Berger, and Lafferty 1999; Choi 2000; Brants, Chen, and Tsochantaridis 2002; Misra et al. 2009; Riedl and Biemann 2012; Glava\u0161, Nanni, and Ponzetto 2016; Koshorek et al. 2018, inter alia) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-178",
"text": "In one of the pioneering segmentation efforts, Hearst (1994) proposed an unsupervised TextTiling algorithm based on the lexical overlap between adjacent sentences and paragraphs."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-179",
"text": "Choi (2000) computes the similarities between sentences in a similar fashion, but renormalizes them within the local context; the segments are then obtained through divisive clustering."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-180",
"text": "Utiyama and Isahara (2001) and Fragkou, Petridis, and Kehagias (2004) minimize the segmentation cost via exhaustive search with dynamic programming."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-181",
"text": "Following the assumption that topical cohesion guides the segmentation of the text, a number of segmentation approaches based on topic models have been proposed."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-182",
"text": "Brants, Chen, and Tsochantaridis (2002) induce latent representations of text snippets using probabilistic latent semantic analysis (Hofmann 1999) and segment based on similarities between latent representations of adjacent snippets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-183",
"text": "Misra et al. (2009) and Riedl and Biemann (2012) leverage topic vectors of snippets obtained with the Latent Dirichlet Allocation model (Blei, Ng, and Jordan 2003) ."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-184",
"text": "While Misra et al. (2009) finds a globally optimal segmentation based on the similarities of snippets' topic vectors using dynamic programming, Riedl and Biemann (2012) adjust the TextTiling model of (Hearst 1994) to use topic vectors instead of sparse lexicalized representations of snippets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-185",
"text": "Malioutov and Barzilay (2006) proposed a first graphbased model for text segmentation."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-186",
"text": "They segment lecture transcripts by first inducing a fully connected sentence graph with edge weights corresponding to cosine similarities between sparse bag-of-word sentence vectors and then running a minimum normalized multiway cut algorithm to obtain the segments."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-187",
"text": "Glava\u0161, Nanni, and Ponzetto (2016) propose GRAPHSEG, a graph-based segmentation algorithm similar in nature to (Malioutov and Barzilay 2006) , which uses dense sentence vectors, obtained by aggregating word embeddings, to compute intra-sentence similarities and performs segmentation based on the cliques of the similarity graph."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-188",
"text": "Finally, Koshorek et al. (2018) identify Wikipedia as a free large-scale source of manually segmented texts that can be used to train a supervised segmentation model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-189",
"text": "They train a neural model that hierarchically combines two bidirectional LSTM networks and report massive improvements over unsupervised segmentation on a range of evaluation datasets."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-190",
"text": "The model we presented in this work has a similar hierarchical architecture, but uses Transfomer networks instead of recurrent encoders."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-191",
"text": "Crucially, CATS additionally defines an auxiliary coherence objective, which is coupled with the (primary) segmentation objective in a multi-task learning model."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-193",
"text": "**TEXT COHERENCE**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-194",
"text": "Measuring text coherence amounts to predicting a score that indicates how meaningful the order of the information in the text is."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-195",
"text": "The majority of the proposed text coherence models are grounded in formal theories of text coherence, among which the entity grid model (Barzilay and Lapata 2008) , based on the centering theory of Grosz, Weinstein, and Joshi (1995) , is arguably the most popular."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-196",
"text": "The entity grid model represent texts as matrices encoding the grammatical roles that the same entities have in different sentences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-197",
"text": "The entity grid model, as well as its extensions (Elsner and Charniak 2011; Feng and Hirst 2012; Feng, Lin, and Hirst 2014; Nguyen and Joty 2017 ) require text to be preprocessedentities extracted and grammatical roles assigned to themwhich prohibits an end-to-end model training."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-198",
"text": "In contrast, Li and Hovy (2014) train a neural model that couples recurrent and recursive sentence encoders with a convolutional encoder of sentence sequences in an end-to-end fashion on limited-size datasets with gold coherence scores."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-199",
"text": "Our models' architecture is conceptually similar, but we use Transformer networks to both encode sentences and sentence sequences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-200",
"text": "With the goal of supporting text segmentation and not aiming to predict exact coherence scores, our model does not require gold coherence labels; instead we devise a coherence objective that contrasts original text snippets against corrupted sentence sequences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-202",
"text": "**CONCLUSION**"
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-203",
"text": "Though the segmentation of text depends on its (local) coherence, existing segmentation models capture coherence only implicitly via lexical or semantic overlap of (adjacent) sentences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-204",
"text": "In this work, we presented CATS, a novel supervised model for text segmentation that couples segmentation prediction with explicit auxiliary coherence modeling."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-205",
"text": "CATS is a neural architecture consisting of two hierarchically connected Transformer networks: the lower-level sentence encoder generates input for the higher-level encoder of sentence sequences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-206",
"text": "We train the model in a multi-task learning setup by learning to predict (1) sentence segmentation labels and (2) that original text snippets are more coherent than corrupt sentence sequences."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-207",
"text": "We show that CATS yields state-of-theart performance on several text segmentation benchmarks and that it can -in a zero-shot language transfer setting, coupled with a cross-lingual word embedding space -successfully segment texts from target languages unseen in training."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-208",
"text": "Although effective for text segmentation, our coherence modeling is still rather simple: we use only fully randomly shuffled sequences as examples of (highly) incoherent text."
},
{
"sent_id": "d3672a2d7129beef6703598f1558c4-C001-209",
"text": "In subsequent work, we will investigate negative instances of different degree of incoherence as well as more elaborate objectives for (auxiliary) modeling of text coherence."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-10"
],
[
"d3672a2d7129beef6703598f1558c4-C001-25",
"d3672a2d7129beef6703598f1558c4-C001-26",
"d3672a2d7129beef6703598f1558c4-C001-27"
],
[
"d3672a2d7129beef6703598f1558c4-C001-102",
"d3672a2d7129beef6703598f1558c4-C001-103",
"d3672a2d7129beef6703598f1558c4-C001-104"
],
[
"d3672a2d7129beef6703598f1558c4-C001-106"
],
[
"d3672a2d7129beef6703598f1558c4-C001-122"
],
[
"d3672a2d7129beef6703598f1558c4-C001-149"
],
[
"d3672a2d7129beef6703598f1558c4-C001-157"
],
[
"d3672a2d7129beef6703598f1558c4-C001-177"
],
[
"d3672a2d7129beef6703598f1558c4-C001-188",
"d3672a2d7129beef6703598f1558c4-C001-189",
"d3672a2d7129beef6703598f1558c4-C001-190",
"d3672a2d7129beef6703598f1558c4-C001-191"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-10",
"d3672a2d7129beef6703598f1558c4-C001-25",
"d3672a2d7129beef6703598f1558c4-C001-122",
"d3672a2d7129beef6703598f1558c4-C001-149",
"d3672a2d7129beef6703598f1558c4-C001-157",
"d3672a2d7129beef6703598f1558c4-C001-188"
]
},
"@DIF@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-25",
"d3672a2d7129beef6703598f1558c4-C001-26",
"d3672a2d7129beef6703598f1558c4-C001-27"
],
[
"d3672a2d7129beef6703598f1558c4-C001-28",
"d3672a2d7129beef6703598f1558c4-C001-29",
"d3672a2d7129beef6703598f1558c4-C001-30"
],
[
"d3672a2d7129beef6703598f1558c4-C001-188",
"d3672a2d7129beef6703598f1558c4-C001-189",
"d3672a2d7129beef6703598f1558c4-C001-190",
"d3672a2d7129beef6703598f1558c4-C001-191"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-25",
"d3672a2d7129beef6703598f1558c4-C001-28",
"d3672a2d7129beef6703598f1558c4-C001-188"
]
},
"@SIM@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-28",
"d3672a2d7129beef6703598f1558c4-C001-29",
"d3672a2d7129beef6703598f1558c4-C001-30"
],
[
"d3672a2d7129beef6703598f1558c4-C001-113"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-28",
"d3672a2d7129beef6703598f1558c4-C001-113"
]
},
"@USE@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-102",
"d3672a2d7129beef6703598f1558c4-C001-103",
"d3672a2d7129beef6703598f1558c4-C001-104"
],
[
"d3672a2d7129beef6703598f1558c4-C001-118",
"d3672a2d7129beef6703598f1558c4-C001-119"
],
[
"d3672a2d7129beef6703598f1558c4-C001-120"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-118",
"d3672a2d7129beef6703598f1558c4-C001-120"
]
},
"@EXT@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-113"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-113"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-148"
],
[
"d3672a2d7129beef6703598f1558c4-C001-152"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-148",
"d3672a2d7129beef6703598f1558c4-C001-152"
]
},
"@MOT@": {
"gold_contexts": [
[
"d3672a2d7129beef6703598f1558c4-C001-188",
"d3672a2d7129beef6703598f1558c4-C001-189",
"d3672a2d7129beef6703598f1558c4-C001-190",
"d3672a2d7129beef6703598f1558c4-C001-191"
]
],
"cite_sentences": [
"d3672a2d7129beef6703598f1558c4-C001-188"
]
}
}
},
"ABC_5b17eb75600820a80b3573bf74c427_6": {
"x": [
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-2",
"text": "We describe a process for automatically detecting decision-making sub-dialogues in multi-party, human-human meetings in real-time."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-3",
"text": "Our basic approach to decision detection involves distinguishing between different utterance types based on the roles that they play in the formulation of a decision."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-4",
"text": "In this paper, we describe how this approach can be implemented in real-time, and show that the resulting system's performance compares well with other detectors, including an off-line version."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-7",
"text": "In collaborative and organized work environments, people share information and make decisions through multi-party conversations, commonly referred to as meetings."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-8",
"text": "The demand for automatic methods that process, understand and summarize information contained in audio and video recordings of meetings is growing rapidly, as evidenced by on-going projects which are focused on this goal, (Waibel et al., 2003; Janin et al., 2004) ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-9",
"text": "Our research is part of a general effort to develop a system that can automatically extract and summarize information such as conversational topics, action items, and decisions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-10",
"text": "This paper concerns the development of a realtime decision detector -a system which can detect and summarize decisions as they are made during a meeting."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-11",
"text": "Such a system could provide a summary of all of the decisions which have been made up until the current point in the meeting, and this is something which we expect will help users to enjoy more productive meetings."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-12",
"text": "Certainly, good decision-making relies on access to relevant information, and decisions made earlier in a meeting often have a bearing on the current topic of discussion, and so form part of this relevant information."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-13",
"text": "However, in a long and winding meeting, participants might not have these earlier decisions at the forefront of their minds, and so an accurate and succinct reminder, as provided by a real-time decision detector, could potentially be very useful."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-14",
"text": "A record of earlier decisions could also help users to identify outstanding issues for discussion, and to therefore make better use of the remainder of the meeting."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-15",
"text": "Our approach to decision detection uses an annotation scheme which distinguishes between different types of utterance based on the roles which they play in the decision-making process."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-16",
"text": "Such a scheme facilitates the detection of decision discussions (Fern\u00e1ndez et al., 2008) , and by indicating which utterances contain particular types of information, it also aids their summarization."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-17",
"text": "To automatically detect decision discussions, we use what we refer to as hierarchical classification."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-18",
"text": "Here, independent binary sub-classifiers detect the different decision dialogue acts, and then based on the sub-classifier hypotheses, a super-classifier determines which dialogue regions are decision discussions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-19",
"text": "In this paper then, we address the challenges for applying this approach in real-time, and produce a system which is able to detect decisions soon after they are made, (for example within a minute)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-20",
"text": "We conduct tests and compare this system's performance with other detectors, including an off-line equivalent."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-21",
"text": "The remainder of the paper proceeds as follows."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-22",
"text": "Section 2 describes related work, and Section 3 describes our annotation scheme for decision discussions, and our experimental data."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-23",
"text": "Next, Section 4 explains the hierarchical classification approach in more detail, and Section 5 considers how it can be applied in real-time."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-24",
"text": "Section 6 describes the experiments in which we test the real-time detector, and finally, Section 7 presents conclusions and ideas for future work."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-26",
"text": "**RELATED WORK**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-27",
"text": "Decisions are one of the most important meeting outputs."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-28",
"text": "User studies (Lisowska et al., 2004; Banerjee et al., 2005) have confirmed that meeting participants consider this to be the case, and Whittaker et al. (2006) found that the development of an automatic decision detection component is critical to the re-use of meeting archives."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-29",
"text": "As a result, with the new availability of substantial meeting corpora such as the ISL (Burger et al., 2002) , ICSI (Janin et al., 2004) and AMI Meeting Corpora, recent years have seen an increasing amount of research on decision-making dialogue."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-30",
"text": "This recent research has tackled issues such as the automatic detection of agreement and disagreement (Hillard et al., 2003; Galley et al., 2004) , and of the level of involvement of conversational participants (Wrede and Shriberg, 2003; Gatica-Perez et al., 2005) ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-31",
"text": "In addition, Verbree et al. (2006) created an argumentation scheme intended to support automatic production of argument structure diagrams from decision-oriented meeting transcripts."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-32",
"text": "Only very recent research has specifically investigated the automatic detection of decisions, namely (Hsueh and Moore, 2007) and (Fern\u00e1ndez et al., 2008) ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-33",
"text": "Hsueh and Moore (2007) used the AMI Meeting Corpus, and attempted to automatically identify dialogue acts (DAs) in meeting transcripts which are \"decision-related\"."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-34",
"text": "Within any meeting, the authors decided which DAs were decision-related based on two different kinds of manually created summary: the first was an extractive summary of the whole meeting, and the second, an abstractive summary of the decisions which were made."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-35",
"text": "Those DAs in the extractive summary which support any of the decisions in the abstractive summary were manually tagged as decision-related."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-36",
"text": "Hsueh and Moore (2007) then trained a Maximum Entropy classifier to recognize this single DA class, using a variety of lexical, prosodic, dialogue act and conversational topic features."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-37",
"text": "They achieved an F-score of 0.35, which gives an indication of the difficulty of this task."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-38",
"text": "Unlike Hsueh and Moore (2007), Fern\u00e1ndez et al. (2008) made an attempt at modelling the structure of decision-making dialogue."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-39",
"text": "They designed an annotation scheme that takes account of the different roles which different utterances play in the decision-making process -for example, their scheme distinguishes between decision DAs (DDAs) which initiate a discussion by raising a topic/issue, those which propose a resolution, and those which express agreement for a proposed resolution and cause it to be accepted as a decision."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-40",
"text": "The authors applied the annotation scheme to a portion of the AMI corpus, and then took what they refer to as a hierarchical classification approach in order to automatically identify decision discussions and their component DAs."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-41",
"text": "Here, one binary Support Vector Machine (SVM) per DDA class hypothesized occurrences of that DDA class, and then based on the hypotheses of these socalled sub-classifiers, a super-classifier, (a further SVM), determined which regions of dialogue represented decision discussions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-42",
"text": "This approach produced better results than the kind of \"flat classification\" approach pursued by Hsueh and Moore (2007) where a single classifier looks for examples of a single decision-related DA class."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-43",
"text": "Using manual transcripts, and a variety of lexical, utterance, speaker, DA and prosodic features for the sub-classifiers, the super-classifier's F1-score was 0.58 according to a lenient match metric."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-44",
"text": "Note that (Purver et al., 2007) had previously pursued the same basic approach as Fern\u00e1ndez et al. (2008) in order to detect action items."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-45",
"text": "While both Hsueh and Moore (2007), and Fern\u00e1ndez et al. (2008) attempted off-line decision detection, in this paper, we attempt real-time decision detection."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-46",
"text": "We take the same basic approach as Fern\u00e1ndez et al. (2008) , and make changes to its implementation so that it can work effectively in real-time."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-48",
"text": "**DATA**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-49",
"text": "The AMI corpus , is a freely available corpus of multi-party meetings containing both audio and video recordings, as well as a wide range of annotated information including dialogue acts and topic segmentation."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-50",
"text": "Conversations are all in English, but participants can include non-native English speakers."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-51",
"text": "All of the meetings in our sub-corpus last around 30 minutes, and are scenario-driven, wherein four participants play different roles in a company's design team: project manager, marketing expert, interface designer and industrial designer."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-52",
"text": "The discussions concern how to design a remote control."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-53",
"text": "We used the off-line version of the Decipher speech recognition engine (Stolcke et al., 2008) in order to obtain off-line ASR transcripts for these 17 meetings, and the real-time version, to obtain real-time ASR transcripts."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-54",
"text": "Decipher generates the transcripts by first producing Word Confusion Networks (WCNs) and then extracting their best paths."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-55",
"text": "The real-time recognizer generates \"live\" transcripts with 5 to 15 seconds of latency for immediate display."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-56",
"text": "In processing completed meetings, the off-line system makes seven recognition passes, including acoustic adaptation and language model rescoring, in about 4.2 times realtime (on a 4-score 2.6 GHz Opteron server)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-57",
"text": "In general usage with multi-party dialogue, the word error rate (WER) for the off-line version of Decipher is approximately 23%, and for the realtime version, approximately 35% 1 ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-58",
"text": "Stolcke et al. (2008) report a WER of 26.9% for the off-line version on AMI meetings."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-59",
"text": "The real-time ASR transcripts for the 17 meetings contain a total of 8440 utterances/dialogue acts, (around 496 per meeting), and the off-line ASR transcripts, 7495 utterances/dialogue acts, (around 441 per meeting)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-61",
"text": "**MODELLING DECISION DISCUSSIONS**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-62",
"text": "We use the same annotation scheme as (Fern\u00e1ndez et al., 2008 ) in order to model decision-making dialogue."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-63",
"text": "As stated in Section 2, this scheme distinguishes between a small number of dialogue act types based on the role which they perform in the formulation of a decision."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-64",
"text": "Recall that using this scheme in conjunction with hierarchical classification produced better decision detection than a \"flat classification\" approach with a single \"decision-related\" DA class."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-65",
"text": "Since it indicates which utterances contain particular types of information, such a scheme also aids the summarization of decision discussions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-66",
"text": "The annotation scheme (see Table 1 for a summary) is based on the observation that a decision discussion contains the following main structural components: (a) a topic or issue requiring resolution is raised, (b) one or more possible resolutions are considered, (c) a particular resolution is agreed upon, that is, it becomes the decision."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-67",
"text": "Hence the scheme distinguishes between three corresponding decision dialogue act (DDA) classes: Issue (I), Resolution (R), and Agreement (A)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-68",
"text": "Class R is further subdivided into Resolution Proposal (RP) and Resolution Restatement (RR)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-69",
"text": "Note that an utterance can be assigned to more than one of these DDA classes, and that within a decision discussion, more than one utterance may correspond to a particular DDA class."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-70",
"text": "Here we use the sample decision discussion below in 1 in order to provide examples of the different DDA types."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-71",
"text": "I utterances introduce the topic of the decision discussion, examples being \"Are we going to have a backup?\" and \"But would a backup really be necessary?\" On the other hand, R utterances specify the resolution which is ultimately adopted as the decision."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-72",
"text": "RP utterances propose this resolution (e.g. \"I think maybe we could just go for the kinetic energy. . . \"), while RR utterances close the discussion by confirming/summarizing the decision (e.g. \"Okay, fully kinetic energy\")."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-73",
"text": "Finally, A utterances agree with the proposed resolution, so causing it to be adopted as the decision, (e.g. \"Yeah\", \"Good\" and \"Okay\"."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-74",
"text": "( 1)"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-76",
"text": "**EXPERIMENTAL DATA FOR REAL-TIME DECISION DETECTION**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-77",
"text": "Originally, two individuals used the annotation scheme described above in order to annotate the manual transcripts of 9 and 10 meetings respectively."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-78",
"text": "The annotators overlapped on two meetings, and their kappa inter-annotator agreement ranged from 0.63 to 0.73 for the four DDA classes."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-79",
"text": "(2007) are part of the AMI corpus, and are for the manual transcriptions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-80",
"text": "The reader can find a comparison between these annotations and our own manual transcript annotations in (Fern\u00e1ndez et al., 2008) ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-81",
"text": "After obtaining the new off-line and real-time ASR transcripts, we transferred the DDA annotations from the manual transcripts."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-82",
"text": "In both sets of ASR transcripts, each meeting contains on average around 26 DAs tagged with one or more of the DDA sub-classes in Table 1 ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-83",
"text": "DDAs are thus very sparse, corresponding to only 5.3% of utterances in the real-time transcripts, and 6.0% in the offline."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-84",
"text": "In the real-time transcripts, Issue utterances make up less than 1.2% of the total number of utterances in a meeting, while Resolution utterances are around 1.6%: 1.2% are RP and less than 0.4% are RR on average."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-85",
"text": "Almost half of DDA utterances (slightly over 2.6% of all utterances on average) are tagged as belonging to class Agreement."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-86",
"text": "In the off-line transcripts, the percentages are fairly similar: 1.6% of utterances are Issue DDAs, 2.0% are RP, 0.5% are RR, and 2.4% are A."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-87",
"text": "We now move on to describe the hierarchical classification approach which we use to try to automatically detect decision sub-dialogues and their component DDAs."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-89",
"text": "**HIERARCHICAL CLASSIFICATION**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-90",
"text": "Hierarchical classification is designed to exploit the fact that within decision discussions, our DDAs can be expected to co-occur in particular types of patterns."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-91",
"text": "It involves two different types of classifier:"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-92",
"text": "1. Sub-classifier: One independent binary subclassifier per DDA class classifies each utterance."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-93",
"text": "2. Super-classifier: A sliding window shifts through the meeting one utterance at a time, and following each shift, a binary superclassifier determines whether the region of dialogue within the window is part of a decision discussion."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-94",
"text": "In our decision detectors, the sub-classifiers run in parallel in order to reduce processing time."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-95",
"text": "For each utterance, the sub-classifiers use features which are derived from the properties of that utterance in context."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-96",
"text": "On the other hand, the super-classifier's features are the hypothesized class labels and confidence scores for the utterances within the window."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-97",
"text": "In various experiments, we have found that a suitable size for the window, is the average length of a decision discussion in our data in utterances."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-98",
"text": "The super-classifier also \"corrects\" the sub-classifiers."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-99",
"text": "This means that if a DA is classified as positive by a sub-classifier, but does not fall within a region classified as part of a decision discussion by the super-classifier, then the sub-classifier's hypothesis is changed to negative."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-100",
"text": "We now move on to consider how this basic approach to decision detection can be implemented in a real-time system."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-102",
"text": "**DESIGN CONSIDERATIONS FOR OUR REAL-TIME SYSTEM**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-103",
"text": "A real-time decision detector should detect decisions as soon after they are made as possible."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-104",
"text": "It is for this reason that we have set our real-time detector to automatically run at frequent and regular intervals during a meeting."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-105",
"text": "An alternative would be to give the user (a meeting participant) responsibility for instructing the detector when to run."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-106",
"text": "However, a user may sometimes leave substantial gaps between giving run commands."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-107",
"text": "When this happens, the detector will have to process a large number of utterances in a single run, and so the user may wait some time before being presented with any results."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-108",
"text": "In addition, giving the user responsibility for instructing the detector when to run means burdening the user with an extra task to perform during the meeting, and this goes against the general philosophy behind the system's development."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-109",
"text": "The system is intended to be as unobtrusive as possible during the meeting, and to relieve users of tasks which distract their attention away from the current discussion (e.g. note-taking), not to create new tasks, however small."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-110",
"text": "Obviously, on the first occasion that the detector runs during a meeting, it can only process \"new\" (previously unprocessed) utterances, but on subsequent runs, it has the option to reprocess some number of \"old\" utterances (utterances which it has already processed in a previous run)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-111",
"text": "Certainly, the detector should reprocess some of the most recent old utterances because it is possible that a decision discussion straddles these utterances and new utterances."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-112",
"text": "However, the number of old utterances that are reprocessed should be limited."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-113",
"text": "If the meeting has lasted a while already, then the processing of a large portion of the earlier old utterances is likely to be redundant -it will simply produce the same results for these utterances as the previous run."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-114",
"text": "The fact that the real-time detector processes recent old utterances means that consecutive runs can produce hypotheses for decision discussion regions which overlap, or which are duplicates."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-115",
"text": "Figure 1 gives an example of the former."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-116",
"text": "We deal with overlapping hypotheses by merging them into one, so forming a larger single decision discussion region."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-117",
"text": "Figure 2 gives an example of duplicate hypotheses."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-118",
"text": "Here, on run n, the detector hypothesizes decision discussion D 1 to D 2 , and then on run n+1, since the bounds of this original hypothesis are now wholly contained within the region of old reprocessed utterances, the detector hypothesizes a duplicate."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-119",
"text": "We deal with such cases by discarding the duplicate."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-121",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-122",
"text": "We conducted various experiments related to realtime decision detection, our goal being to produce a system which:"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-123",
"text": "\u2022 relative to alternative versions, detects decision discussions accurately,"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-124",
"text": "\u2022 generates results for any portion of dialogue very soon after that portion of dialogue has ended."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-125",
"text": "The current version of our real-time detector is set to process the same number of old and new utterances on each run."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-126",
"text": "Here, we refer to this value as i, and hence on each run the system processes a total of 2i utterances (i old and i new)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-127",
"text": "Another of the system's characteristics is that runs take place every i utterances, meaning that as we decrease i, the system provides new results more frequently and is hence \"more real-time\"."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-128",
"text": "One of the things we investigate here then, is what to set i to in order to best satisfy the two design goals given above."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-129",
"text": "Having found this value, we compare the hierarchical real-time detector's performance with alternative detectors, these being:"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-130",
"text": "\u2022 an off-line detector applied to off-line ASR transcripts,"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-131",
"text": "\u2022 a flat real-time detector,"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-132",
"text": "\u2022 an off-line detector applied to the real-time ASR transcripts."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-133",
"text": "Note that the off-line detectors use hierarchical classification, and that the flat real-time detector uses a single binary classifier which treats all DDAs as members of a single merged DDA class."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-135",
"text": "**CLASSIFIERS AND FEATURES**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-136",
"text": "All classifiers (sub and super-classifiers) in all detectors are linear-kernel Support Vector Machines (SVMs), produced using SVMlight (Joachims, 1999) ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-137",
"text": "For the sub-classifiers, we are obviously restricted to using features which can be computed in a very short period of time, and in the experiments here, we use lexical, utterance and speaker features."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-138",
"text": "These are summarized in Table 2 ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-139",
"text": "An utterance's lexical features are the words in its transcription, its utterance features are its duration, number of words, and word rate (number of words divided by duration), and its speaker features are the speaker's role (see Section 3) and ID."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-140",
"text": "We also use lexical features for the previous and where available, next utterances: the I, RP and RR sub-classifiers use the lexical features for the previous/next utterance and the A sub-classifier, those from the previous/next 5 utterances."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-141",
"text": "These settings produced the best results in preliminary experiments."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-142",
"text": "We do not use DA features because we lack an automatic DA tagger, nor do we use prosodic features because (Fern\u00e1ndez et al., 2008) was unable to derive any value from them with SVMs."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-144",
"text": "**EVALUATION**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-145",
"text": "We evaluate each of our decision detectors in 17-fold cross validations, where in each fold, the detector trains on 16 meetings and then tests on the remaining one."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-146",
"text": "Evaluation can be made at three levels:"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-147",
"text": "1. The sub-classifiers' detection of each of the DDA classes."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-148",
"text": "2. The sub-classifiers' detection of each of the DDA classes after correction by the superclassifier."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-149",
"text": "3. The super-classifier's detection of decision discussion regions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-150",
"text": "For 1 and 2, we use the same lenient-match metric as (Fern\u00e1ndez et al., 2008; Hsueh and Moore, 2007) , which allows a margin of 20 seconds preceding and following a hypothesized DDA."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-151",
"text": "Note that here we only give credit for hypotheses based on a 1-1 mapping with the gold-standard labels."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-152",
"text": "For 3, we follow (Fern\u00e1ndez et al., 2008; Purver et al., 2007) and use a windowed metric that divides the dialogue into 30-second windows and evaluates on a per window basis."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-154",
"text": "**RESULTS AND ANALYSIS**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-155",
"text": "Here, Section 6.3.1 will present results for different values of i, the number of old/new utterances processed in a single run."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-156",
"text": "Section 6.3.2 then compares the performance of the real-time and off-line systems, (and also real-time systems which use hierarchical vs. flat classification), and Section 6.3.3 presents some feature analysis."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-157",
"text": "Figure 3 shows the relationship between i, the setting for the number of old/new utterances processed in a single run, and the super-classifier's F1-score."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-158",
"text": "Here, the sub-classifiers are using only lexical features."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-159",
"text": "We can see from the graph that as i increases to 15, the super-classifier's F1-score also increases, but thereafter, it plateaus."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-160",
"text": "Hence 15 is apparently the value which best satisfies the two design goals given at the start of Section 6."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-161",
"text": "It should also be noted that 15 is the mean length of a decision discussion in our data, and so per- .55 Table 4 : Results for the hierarchical off-line decision detector on off-line ASR transcripts, using lexical, utterance and speaker features."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-163",
"text": "**VARYING THE NUMBER OF OLD/NEW UTTERANCES PROCESSED IN A RUN**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-164",
"text": "haps this is a transferable finding."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-165",
"text": "The mean duration of a run when i = 15 is approximately 4 seconds, while the mean duration of 15 utterances in our data-set is approximately 60 seconds, meaning that for the average case, the detector returns the results for the current run, long before it is due to make the next."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-166",
"text": "Significant lee-way is perhaps necessary here, because the final version of the real-time detector will include a summarization component which extracts key phrases from Issue/Resolution utterances, and its processing can last some time, even for a single decision."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-167",
"text": "We should say then, that the system is not strictly real-time because in general, it detects decisions soon after they are made (for example within a minute), rather than immediately after."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-168",
"text": "In the future we intend to modify the system so that it can run more frequently than once every i utterances."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-169",
"text": "However it is important that runs do not occur too frequently -for example, if i = 15 and the system runs after every utterance, then the extra processing will cause it to gradually fall further and further behind the meeting."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-170",
"text": ".55 Table 6 : Results for the hierarchical real-time decision detector, using lexical features only."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-172",
"text": "**REAL-TIME VS. OFF-LINE RESULTS**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-173",
"text": "off-line detector, which are shown in Table 4 ."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-174",
"text": "The F1-scores for the real-time and off-line decision super-classifiers are .54 and .55 respectively, and the difference is not statistically significant."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-175",
"text": "This may indicate that the hierarchical classification approach is fairly robust to increasing ASR Word Error Rates (WERs)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-176",
"text": "Combining the output from each of the independent sub-classifiers might compensate somewhat for any decreases in their individual accuracy, as there was here for the I and RP sub-classifiers."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-177",
"text": "The hierarchical real-time detector's F1-score is also 10 points higher than a flat classifier (.54 vs. .44)."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-178",
"text": "Hence, while Fern\u00e1ndez et al. (2008) demonstrated that the hierarchical classification approach could improve off-line decision detection, we have demonstrated here that it can also improve realtime decision detection."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-179",
"text": "Table 5 shows the results when an off-line detector is applied to real-time ASR transcripts."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-180",
"text": "Here, the super-classifier obtains an F1-score of .55, one point higher than the real-time detector, but again, the difference is not statistically significant."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-182",
"text": "**FEATURE ANALYSIS**"
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-183",
"text": "We also investigated the contribution of the utterance and speaker features."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-184",
"text": "Table 6 shows the results for the hierarchical real-time decision detector when its sub-classifiers use only lexical features."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-185",
"text": "The sub-classifier F1-scores are all slightly lower than when utterance and speaker features are used (see Table 3 ), and the super-classifier score is only 1 point different."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-186",
"text": "None of these differences are statistically significant."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-187",
"text": "Since lexical features are important, we used information gain in order to investigate which words are predictive of each DDA type."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-188",
"text": "Due to differences in the transcripts, the predictive words for the off-line and real-time systems are not the same, but we can find commonalities, and these commonalities make sense given the DDA definitions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-189",
"text": "Firstly in Resolution and particularly Issue DAs, some of the most predictive words could be used to define discussion topics, and so we might expect to find them in the meeting agenda."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-190",
"text": "Examples are \"energy\", and \"color\"."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-191",
"text": "Predictive words for Resolutions also include semantically-related words which are key in defining the decision (\"kinetic\",\"green\")."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-192",
"text": "Additional predictive words for RPs are the personal pronouns \"I\" and \"we\", and the verbs, \"think\" and \"like\", and for RRs, words which we would associate with summing up (\"consensus\", \"definitely\", and \"okay\")."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-193",
"text": "Unsurprisingly, for Agreements, \"yeah\" and \"okay\" both score very highly."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-194",
"text": "7 Conclusion (Fern\u00e1ndez et al., 2008) described an approach to decision detection in multi-party meetings and demonstrated how it could work relatively well in an off-line system."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-195",
"text": "The approach has two defining characteristics."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-196",
"text": "The first is its use of an annotation scheme which distinguishes between different utterance types based on the roles which they play in the decision-making process."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-197",
"text": "The second is its use of hierarchical classification, whereby binary sub-classifiers detect instances of each of the decision DAs (DDAs), and then based on the sub-classifier hypotheses, a super-classifier determines which regions of dialogue are decision discussions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-198",
"text": "In this paper then, we have taken the same basic approach to decision detection as Fern\u00e1ndez et al. (2008) , but changed the way in which it is implemented so that it can work effectively in realtime."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-199",
"text": "Our implementation changes include running the detector at regular and frequent intervals during the meeting, and reprocessing recent utterances in case a decision discussion straddles these and brand new utterances."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-200",
"text": "The fact that the detector reprocesses utterances means that on consecutive runs, overlapping and duplicate hypothesized decision discussions are possible."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-201",
"text": "We have therefore added facilities to merge overlapping hypotheses and to remove duplicates."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-202",
"text": "In general, the resulting system is able to detect decisions soon after they are made (for example within a minute), rather than immediately after."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-203",
"text": "It has performed well in testing, achieving an F1-score of .54, which is only one point lower than an equivalent off-line system, and in any case, the difference was not statistically significant."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-204",
"text": "A flat real-time detector achieved .44."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-205",
"text": "In future work, we plan to extend the decision discussion annotation scheme and try to extract supporting arguments for decisions."
},
{
"sent_id": "5b17eb75600820a80b3573bf74c427-C001-206",
"text": "We will also experiment with using sequential models in order to try to exploit any sequential ordering patterns in the occurrence of the DDAs."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5b17eb75600820a80b3573bf74c427-C001-15",
"5b17eb75600820a80b3573bf74c427-C001-16"
],
[
"5b17eb75600820a80b3573bf74c427-C001-32"
],
[
"5b17eb75600820a80b3573bf74c427-C001-38",
"5b17eb75600820a80b3573bf74c427-C001-39",
"5b17eb75600820a80b3573bf74c427-C001-40",
"5b17eb75600820a80b3573bf74c427-C001-41",
"5b17eb75600820a80b3573bf74c427-C001-42",
"5b17eb75600820a80b3573bf74c427-C001-43"
],
[
"5b17eb75600820a80b3573bf74c427-C001-44"
],
[
"5b17eb75600820a80b3573bf74c427-C001-79",
"5b17eb75600820a80b3573bf74c427-C001-80"
],
[
"5b17eb75600820a80b3573bf74c427-C001-194",
"5b17eb75600820a80b3573bf74c427-C001-195",
"5b17eb75600820a80b3573bf74c427-C001-196",
"5b17eb75600820a80b3573bf74c427-C001-197"
],
[
"5b17eb75600820a80b3573bf74c427-C001-198",
"5b17eb75600820a80b3573bf74c427-C001-199",
"5b17eb75600820a80b3573bf74c427-C001-200",
"5b17eb75600820a80b3573bf74c427-C001-201"
]
],
"cite_sentences": [
"5b17eb75600820a80b3573bf74c427-C001-16",
"5b17eb75600820a80b3573bf74c427-C001-32",
"5b17eb75600820a80b3573bf74c427-C001-38",
"5b17eb75600820a80b3573bf74c427-C001-44",
"5b17eb75600820a80b3573bf74c427-C001-80",
"5b17eb75600820a80b3573bf74c427-C001-194",
"5b17eb75600820a80b3573bf74c427-C001-198"
]
},
"@DIF@": {
"gold_contexts": [
[
"5b17eb75600820a80b3573bf74c427-C001-45"
],
[
"5b17eb75600820a80b3573bf74c427-C001-46"
],
[
"5b17eb75600820a80b3573bf74c427-C001-177",
"5b17eb75600820a80b3573bf74c427-C001-178"
],
[
"5b17eb75600820a80b3573bf74c427-C001-198",
"5b17eb75600820a80b3573bf74c427-C001-199",
"5b17eb75600820a80b3573bf74c427-C001-200",
"5b17eb75600820a80b3573bf74c427-C001-201"
]
],
"cite_sentences": [
"5b17eb75600820a80b3573bf74c427-C001-45",
"5b17eb75600820a80b3573bf74c427-C001-46",
"5b17eb75600820a80b3573bf74c427-C001-178",
"5b17eb75600820a80b3573bf74c427-C001-198"
]
},
"@USE@": {
"gold_contexts": [
[
"5b17eb75600820a80b3573bf74c427-C001-62"
],
[
"5b17eb75600820a80b3573bf74c427-C001-150",
"5b17eb75600820a80b3573bf74c427-C001-151"
],
[
"5b17eb75600820a80b3573bf74c427-C001-152"
]
],
"cite_sentences": [
"5b17eb75600820a80b3573bf74c427-C001-62",
"5b17eb75600820a80b3573bf74c427-C001-150",
"5b17eb75600820a80b3573bf74c427-C001-152"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"5b17eb75600820a80b3573bf74c427-C001-142"
]
],
"cite_sentences": [
"5b17eb75600820a80b3573bf74c427-C001-142"
]
}
}
},
"ABC_dbb0178b572c2a451853737910ac86_7": {
"x": [
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-2",
"text": "Automatically validating a research artefact is one of the frontiers in Artificial Intelligence (AI) that directly brings it close to competing with human intellect and intuition."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-3",
"text": "Although criticized sometimes, the existing peer review system still stands as the benchmark of research validation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-4",
"text": "The present-day peer review process is not straightforward and demands profound domain knowledge, expertise, and intelligence of human reviewer(s), which is somewhat elusive with the current state of AI."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-5",
"text": "However, the peer review texts, which contains rich sentiment information of the reviewer, reflecting his/her overall attitude towards the research in the paper, could be a valuable entity to predict the acceptance or rejection of the manuscript under consideration."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-6",
"text": "Here in this work, we investigate the role of reviewers sentiments embedded within peer review texts to predict the peer review outcome."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-7",
"text": "Our proposed deep neural architecture takes into account three channels of information: the paper, the corresponding reviews, and the review polarity to predict the overall recommendation score as well as the final decision."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-8",
"text": "We achieve significant performance improvement over the baselines (\u223c 29% error reduction) proposed in a recently released dataset of peer reviews."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-9",
"text": "An AI of this kind could assist the editors/program chairs as an additional layer of confidence in the final decision making, especially when non-responding/missing reviewers are frequent in present day peer review."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-12",
"text": "The rapid increase in research article submissions across different venues is posing a significant management challenge for the journal editors and conference program chairs 1 ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-13",
"text": "Among the load of works like assigning reviewers, ensuring timely receipt of reviews, slot-filling against the non-responding reviewer, taking informed decisions, communicating to the authors, etc., editors/program chairs are usually overwhelmed with many such demanding yet crucial tasks."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-14",
"text": "However, the major hurdle lies in to decide the acceptance and rejection of the manuscripts based on the reviews received from the reviewers."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-15",
"text": "The quality, randomness, bias, inconsistencies in peer reviews is well-debated across the academic community (Bornmann and Daniel, 2010) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-16",
"text": "Due to the rise in article submissions and nonavailability of expert reviewers, editors/program chairs are sometimes left with no other options than to assign papers to the novice, out of domain reviewers which sometimes results in more inconsistencies and poor quality reviews."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-17",
"text": "To study the arbitrariness inherent in the existing peer review system, organisers of the NIPS 2014 conference assigned 10% submissions to two different sets of reviewers and observed that the two committees disagreed for more than quarter of the papers (Langford and Guzdial, 2015) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-18",
"text": "Again it is quite common that a paper rejected in one venue gets the cut in another with little or almost no improvement in quality."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-19",
"text": "Many are of the opinion that the existing peer review system is fragile as it only depends on the view of a selected few (Smith, 2006) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-20",
"text": "Moreover, even a preliminary study into the inners of the peer review system is itself very difficult because of data confidentiality and copyright issues of the publishers."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-21",
"text": "However, the silver lining is that the peer review system is evolving with the likes of OpenReviews 2 , author response periods/rebuttals, increased effective communications between authors and reviewers, open access initiatives, peer review workshops, review forms with objective questionnaires, etc."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-22",
"text": "gaining momentum."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-23",
"text": "The PeerRead dataset (Kang et al., 2018) is an excellent resource towards research and study on this very impactful and crucial problem."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-24",
"text": "With our ongoing effort towards the development of an Artificial Intelligence (AI)-assisted peer review system, we are intrigued with: What if there is an additional AI reviewer which predicts decisions by learning the high-level interplay between the review texts and the papers? How would the sentiment embedded within the review texts empower such decision-making?"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-25",
"text": "Although editors/program chairs usually go by the majority of the reviewer recommendations, they still need to go through all the review texts corresponding to all the submissions."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-26",
"text": "A good use case of this research would be: slot-filling the missing reviewer, providing an additional perspective to the editor in cases of contrasting/borderline reviews."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-27",
"text": "This work in no way attempts to replace the human reviewers; instead, we are intrigued to see how an AI can act as an additional reviewer with inputs from her human counterparts and aid the decision-making in the peer review process."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-28",
"text": "We develop a deep neural architecture incorporating full paper information and review text along with the associated sentiment to predict the acceptability and recommendation score of a given research article."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-29",
"text": "We perform two tasks, a classification (predicting accept/reject decision) and a regression (predicting recommendation score) one."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-30",
"text": "The evaluation shows that our proposed model successfully outperforms the earlier reported results in PeerRead."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-31",
"text": "We also show that the addition of review sentiment component significantly enhances the predictive capability of such a system."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-33",
"text": "**RELATED WORK**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-34",
"text": "Artificial Intelligence in academic peer review is an important yet less explored territory."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-35",
"text": "However, with the recent progress in AI research, the topic is gradually gaining attention from the community."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-36",
"text": "Price and Flach (2017) did a thorough study of the various means of computational support to the peer review system."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-37",
"text": "Mrowinski et al. (2017) explored an evolutionary algorithm to improve editorial strategies in peer review."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-38",
"text": "The famous Toronto Paper Matching system (Charlin and Zemel, 2013) was developed to match paper with reviewers."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-39",
"text": "Recently we (Ghosal et al., 2018b,a) investigated the impact of various features in the editorial pre-screening process."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-86",
"text": "For a paper P, the final output of this convolution filter is then given as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-40",
"text": "Wang and Wan (2018) explored a multi-instance learning framework for sentiment analysis from the peer review texts."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-41",
"text": "We carry our current investigations on a portion of the recently released PeerRead dataset (Kang et al., 2018) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-42",
"text": "Study towards automated support for peer review was otherwise not possible due to the lack of rejected paper instances and corresponding reviews."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-43",
"text": "Our approach achieves significant performance improvement over the two tasks defined in Kang et al. (2018) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-44",
"text": "We attribute this to the use of deep neural networks and augmentation of review sentiment information in our architecture."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-46",
"text": "**DATA DESCRIPTION AND ANALYSIS**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-47",
"text": "The PeerRead dataset consists of papers, a set of associated peer reviews, and corresponding accept/reject decisions with aspect specific scores of papers collected from several top-tier Artificial Intelligence (AI), Natural Language Processing (NLP) and Machine Learning (ML) conferences."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-48",
"text": "Table 1 shows the data we consider in our experiments."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-49",
"text": "We could not consider NIPS and arXiv portions of PeerRead due to the lack of aspect scores and reviews, respectively."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-50",
"text": "For more details on the dataset creation and the task, we request the readers to refer to Kang et al. (2018) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-51",
"text": "We further use the submissions of ICLR 2018, corresponding reviews and aspect scores to boost our training set for the decision prediction task."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-52",
"text": "One motivation of our work stems from the finding that aspect scores for certain factors like Impact, Originality, Soundness/Correctness which are seemingly central to the merit of the paper, often have very low correlation with the final recommendation made by the reviewers as is made evident in Kang et al. (2018) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-53",
"text": "However, from the heatmap in Figure 1 we can see that the reviewer's sentiments (compound/positive) embedded within the review texts have visible correlations with the aspects like Recommendation, Appropriateness and Overall Decision."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-54",
"text": "This also seconds our recent finding that determining the scope or appropriateness of an article to a venue is the first essential step in peer review (Ghosal et al., 2018a) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-55",
"text": "Since our study aims at deciding the fate of the paper, we take predicting recommendation score and overall decision as the objectives of our investigation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-56",
"text": "Thus our proposal to augment sentiment of reviews to the deep neural architecture seems intuitive."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-58",
"text": "**METHODOLOGY**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-60",
"text": "**PRE-PROCESSING**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-87",
"text": "F is the total number of filters used."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-61",
"text": "At the very beginning, we convert the papers in PDF to .json encoded files using the Science Parse 3 library."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-62",
"text": "Figure 2 illustrates the overall architecture we employ in our investigation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-63",
"text": "The left segment is for the decision prediction while the right segment predicts the overall recommendation score."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-65",
"text": "**DEEPSENTIPEER ARCHITECTURE**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-67",
"text": "**DOCUMENT ENCODING**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-68",
"text": "We extract full-text sentences from each research article and represent each sentence s i \u2208 R d using the Transformer variant of the Universal Sentence 3 https://github.com/allenai/science-parse Encoder (USE) (Cer et al., 2018) , d is the dimension of the sentence semantic vector which is 512."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-69",
"text": "A paper is then represented as,"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-70",
"text": "\u2295 being the concatenation operator, n 1 is the maximum number of sentences in a paper text in the entire dataset (padding is done wherever necessary)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-71",
"text": "Similarly, we do this for each of the reviews and create a review representation as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-72",
"text": "n 2 being the maximum number of sentences in the reviews."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-74",
"text": "**SENTIMENT ENCODING**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-75",
"text": "The sentiment encoding of the review is done using VADER Sentiment Analyzer."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-76",
"text": "For a sentence s i , VADER gives a vector S i , S i \u2208 R 4 ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-77",
"text": "The review is then encoded (padded where necessary) for sentiment as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-79",
"text": "**FEATURE EXTRACTION WITH CONVOLUTIONAL NEURAL NETWORK**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-80",
"text": "We make use of a Convolutional Neural Network (CNN) to extract features from both the paper and review representations."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-81",
"text": "CNN has shown great success in solving the NLP problems in recent years."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-82",
"text": "The convolution operation works by sliding a filter W f k \u2208 R l\u00d7d to a window of length l, the output of such h th window is given as,"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-83",
"text": "X h\u2212l+1:h means the l sentences within the h th window in Paper P. b k is the bias for the k th filter, g() is the non-linear function."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-84",
"text": "The feature map f k for the k th filter is then obtained by applying this filter to each possible window of sentences in the P as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-85",
"text": "We then apply a max-pooling operation to this filter map to get the most significant feature,f k a\u015d"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-88",
"text": "In the same way, we can get r as the output of the convolution operator for the Review R."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-89",
"text": "We call the outputs p and r as the high-level representation feature vector of the paper and the review, respectively."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-90",
"text": "We then concatenate these feature vectors (Feature-Level Fusion)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-91",
"text": "The reason we extract features from both is to simulate the editorial workflow, wherein ideally, the editor/chair would look at both into the paper and the corresponding reviews to arrive at a judgement."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-93",
"text": "**MULTI-LAYER PERCEPTRON**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-94",
"text": "We employ a Multi-Layer Perceptron (MLP Predict) to take the joint paper+review representations x pr as input to get the final (Kang et al., 2018) , RMSE\u2192Root Mean Squared Error."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-95",
"text": "CNN variant as in (Kang et al., 2018 ) is used as the comparing system."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-96",
"text": "representation as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-97",
"text": "where \u03b8 predict represents the parameters of the MLP Predict."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-98",
"text": "We also extract features from the review sentiment representation x rs via another MLP (MLP Senti)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-99",
"text": "\u03b8 senti being the parameters of MLP Senti."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-100",
"text": "Finally, we fuse the extracted review sentiment feature and joint paper+review representation together to generate the overall recommendation score (DecisionLevel Fusion) using the affine transformation as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-101",
"text": "We minimize the Mean Square Error (MSE) between the actual and predicted recommendation score."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-102",
"text": "The motivation here is to augment the human judgement (review+embedded sentiment) regarding the quality of a paper in decision making."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-103",
"text": "The long-term objective is to have the AI learn the notion of good and bad papers from the human perception reflected in peer reviews in correspondence with paper full-text."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-105",
"text": "**ACCEPT/REJECT DECISIONS**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-106",
"text": "Instead of training the deep network on overall recommendation scores, we train the network with the final decisions of the papers in a classification setting."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-107",
"text": "The entire setup is same but we concatenate all the reviews of a particular paper together to get the review representation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-108",
"text": "And rather than doing decision-level fusion, we perform featurelevel fusion where the decision is given as"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-109",
"text": "where c is the output classification distribution across accept or reject classes."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-110",
"text": "r is the high-level representation of review text after concatenating all reviews corresponding to a paper and x rs is the output of MLP Senti on the concatenated review text."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-111",
"text": "We minimize Cross-Entropy Loss between predicted c and actual decisions."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-113",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-114",
"text": "As we mention earlier, we undertake two tasks: Task 1: Predicting the overall recommendation score (Regression) and Task 2: Predicting the Accept/Reject Decision (Classification)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-115",
"text": "To compare with Kang et al. (2018) , we keep the experimental setup (train vs test ratio) identical and re-implement their codes to generate the comparing figures."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-116",
"text": "However, Kang et al. (2018) performed Task 2 on ICLR 2017 dataset with handcrafted features, and Task 1 in a deep learning setting."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-117",
"text": "Since our approach is a deep neural network based, we crawl additional paper+reviews from ICLR 2018 to boost the training set."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-118",
"text": "For Task 1, n 1 is 666 and n 2 is 98 while for Task 2, n 1 is 1494 and n 2 is 525."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-119",
"text": "We employ a grid search for hyperparameter optimization."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-120",
"text": "For Task 1, F is 256, l is 5. ReLU is the non-linear function g(), learning rate is 0.007."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-121",
"text": "We train the model with SGD optimizer, set momentum as 0.9 (Kang et al., 2018 ) is feature-based and considers only paper, and not the reviews."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-122",
"text": "and batch size as 32."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-123",
"text": "We keep dropout at 0.5."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-124",
"text": "We use the same number of filters with the same kernel size for both paper and review."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-125",
"text": "In Task 2, for Paper CNN F is 128, l is 7 and for Review CNN F is 64 and l is 5."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-126",
"text": "Again we train the model with Adam Optimizer, keep the batch size as 64 and use 0.7 as the dropout rate to prevent overfitting."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-127",
"text": "We intentionally keep our CNN/MLP shallow due to less training data."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-128",
"text": "We make our codes 4 available for further explorations."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-129",
"text": "Table 2 and Table 3 show our results for both the tasks."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-130",
"text": "We propose a simple but effective architecture in this work since our primary intent is to establish that a sentiment-aware deep architecture would better suit these two problems."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-131",
"text": "For Task 1, we can see that our review sentiment augmented approach outperforms the baselines and the comparing systems by a wide margin (\u223c 29% reduction in error) on the ICLR 2017 dataset."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-132",
"text": "With only using review+sentiment information, we are still able to outperform Kang et al. (2018) by a margin of 11% in terms of RMSE."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-133",
"text": "A further relative error reduction of 19% with the addition of paper features strongly suggests that only review is not sufficient for the final recommendation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-134",
"text": "A joint model of the paper content and review text (the human touch) augmented with the underlying sentiment would efficiently guide the prediction."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-136",
"text": "**RESULTS AND ANALYSIS**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-137",
"text": "For Task 2, we observe that the handcrafted feature-based system by Kang et al. (2018) performs inferior compared to the baselines."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-138",
"text": "This is because the features were very naive and did not 4 https://github.com/aritzzz/DeepSentiPeer address the complexity involved in such a task."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-139",
"text": "We perform better with a relative improvement of 28% in terms of accuracy, and also our system is end-toend trained."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-140",
"text": "Presumably, to some extent, our deep neural network learned to distinguish between the probable accept versus probable reject by extracting useful information from the paper and review data."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-142",
"text": "**CROSS-DOMAIN EXPERIMENTS**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-143",
"text": "With the additional (but less) data of ACL 2017 and CoNLL 2016 in PeerRead, we perform the cross-domain experiments."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-144",
"text": "We do training with the ICLR data (core Machine Learning papers) and take the test set from the NLP conferences (ACL/CoNLL)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-145",
"text": "NLP nowadays is mostly machine learning (ML) centric, where we find several applications and extensive usage of ML algorithms to address different NLP problems."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-146",
"text": "Here we observe a relative error reduction of 4.8% and 14.5% over the comparing system for ACL 2017 and CoNLL 2016, respectively (Table 2 )."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-147",
"text": "For the decision prediction task, the comparing system performs even worse, and we outperform them by a considerable margin of 28% (ACL 2017) and 26% (CoNLL 2017), respectively ( Table 3 )."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-148",
"text": "The reason is that the work reported in Kang et al. (2018) relies on elementary handcrafted features extracted only from the paper; does not consider the review features whereas we include the review features along with the sentiment information in our deep neural architecture."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-149",
"text": "However, we also find that our approach with only Review+Sentiment performs inferior to the Paper+Review method in Kang et al. (2018) for ACL 2017."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-150",
"text": "This again seconds that inclusion of paper is vital in recommendation decisions."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-151",
"text": "Only paper is enough for a human reviewer, but with the current state of AI, an AI reviewer would need the supervision of her human counterparts to arrive at a recommendation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-152",
"text": "So our system is suited to cases where the editor needs an additional judgment regarding a submission (such as dealing with missing/non-responding reviewers, an added layer of confidence with an AI which is aware of the past acceptances/rejections of a specific venue)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-154",
"text": "**ANALYSIS: EFFECT OF SENTIMENT ON**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-155",
"text": "Reviewer's Recommendation Figure 3 shows the output activations 5 from the final layer of MLP Senti against the predicted recommendation scores."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-156",
"text": "We can see that the papers are discriminated into visible clusters according to their recommendation scores."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-157",
"text": "This proves that DeepSentiPeer can extract useful features in close correspondence to human judgments."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-158",
"text": "From Figure 3 and Table 4 , we see that the sentiment activations are strongly correlated (negatively) with 5 We call them as Sentiment Activations the actual and predicted recommendation scores."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-159",
"text": "Therefore, we hypothesize that our model draws considerable strength if the review text has proper sentiment embedded in it."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-160",
"text": "To further investigate this, we sample the papers/reviews from the ICLR 2017 test set."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-161",
"text": "We consider actual review text and the sentiment embedded therein to examine the performance of the system (See Table 5 )."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-162",
"text": "We truncate the lengthy review texts and provide the OpenReview links for reference."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-163",
"text": "Appendix A shows the heatmaps of Vader sentiment scores generated for individual sentences corresponding to each paper review in Table 5 ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-164",
"text": "We hereby acknowledge that since the scholarly review texts are mostly objective and not straightforward, the score for neutral polarity is strong as opposed to positive, and negative."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-165",
"text": "But still, we can see visible polarities for review sentences which are positive or negative in sentiment."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-166",
"text": "For instance, the second last sentence(s9): \"The paper is not well written either\" from R1 has visible negative weight in the heatmap ( Figure 5 in Appendix A)."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-167",
"text": "Same can be observed for the other review sentences as well."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-168",
"text": "Besides the objective evaluation of the paper in the peer reviews, the reviewer's opinion in the peer review text holds strong correspondence with the overall recommendation score."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-169",
"text": "We can qualitatively see that the reviews R1, R2, and R3 are polarized towards the negative sentiment (Table 5) ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-170",
"text": "Our model can efficiently predict a reasonable recommendation score with respect to human judgment."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-171",
"text": "Same we can say for R7 where the review mostly signifies a positive sentiment polarity."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-172",
"text": "R6 provides an interesting observation."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-173",
"text": "We see that the review R6 is not very expressive for such a"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-175",
"text": "**ACC**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-177",
"text": "**REJ**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-178",
"text": "Multi-label learning with the RNNs for Fashion Search -The technical contribution of this paper is not clear."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-179",
"text": "Most of the approaches used are standard state-of-art methods and there are not much novelties."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-180",
"text": "For a multi-label recognition task, there are other available methods, e.g. using binary models, changing cross-entropy loss function, etc."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-181",
"text": "There is not any comparison between the RNN method and other simple baselines."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-182",
"text": "The order of the sequential RNN prediction is not clear either."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-183",
"text": "It seems that the attributes form a tree hierarchy, and that is used as the order of sequence."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-184",
"text": "The paper is not well written either.-https://openreview.net/forum?id=HyWDCXjgx¬eId=B1Mp8grVl 4 3 0.01 R2 Transformation based Models of Video Sequences -While I agree with the authors on these points, I also find that the paper suffer from important flaws."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-185",
"text": "Specifically: -the choice of not comparing with previous approaches in term of pixel prediction error seems very \"convenient\", to say the least."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-186",
"text": "While it is clear that the evaluation metric is imperfect, it is not a reason to completely dismiss all quantitative comparisons with previous work."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-187",
"text": "The frames output by the network on, e.g. the moving digits datasets (Figure 4) , looks ok and can definitely be compared with other papers."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-188",
"text": "Yet, the authors chose not to, which is suspicious.-https://openreview.net/forum?id=HkxAAvcxx¬eId= SJE7-lkVx Furthermore, the corruption mechanism is nothing other than traditional dropout on the input layer."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-189",
"text": "Coupled with the word2vec-style loss and training methods, this paper offers little on the novelty front."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-190",
"text": "On the other hand, it is very efficient at generation time, requiring only an average of the word embeddings rather than a complicated inference step as in Doc2Vec."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-191",
"text": "Moreover, by construction, the embedding captures salient global information about the document -it captures specifically that information that aids in local-context prediction."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-192",
"text": "For such a simple model, the performance on sentiment analysis and document classification is quite encouraging."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-193",
"text": "Overall, despite the lack of novelty, the simplicity, efficiency, and performance of this model make it worthy of wider readership and study, and I recommend acceptance.-https://openreview.net/ forum?id=B1Igu2ogg¬eId=rJBM9YbVg 6 7 -1.04"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-194",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-195",
"text": "**R5 R5**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-196",
"text": "Towards a Neural Statistician -Hierarchical modeling is an important and high impact problem, and I think that it's underexplored in the Deep Learning literature."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-197",
"text": "Pros:-The few-shot learning results look good, but I' mm not an expert in this area.-The idea of using a \"double\" variational bound in a hierarchical generative model is well presented and seems widely applicable."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-198",
"text": "Questions:-When training the statistic network, are minibatches (i.e. subsets of the examples) used?-If not, does using minibatches actually give you an unbiased estimator of the full gradient (if you had used all examples)?"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-199",
"text": "For example, what if the statistic network wants to pull out if *any* example from the dataset has a certain feature and treat that as the characterization."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-200",
"text": "This seems to fit the graphical model on the right side of figure 1."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-201",
"text": "If your statistic network is trained on minibatches, it won't be able to learn this characterization, because a given minibatch will be missing some of the examples from the dataset."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-202",
"text": "Using minibatches (as opposed to using all examples in the dataset) to train the statistic network seems like it would limit the expressive power of the modelhttps://openreview.net/forum?id=HJDBUF5le¬eId=HyWm1orEx The authors of the paper set out to answer the question whether chaotic behaviour is a necessary ingredient for RNNs to perform well on some tasks."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-203",
"text": "For that question's sake,they propose an architecture which is designed to not have chaos."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-204",
"text": "The subsequent experiments validate the claim that chaos is not necessary."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-205",
"text": "This paper is refreshing."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-206",
"text": "Instead of proposing another incremental improvement, the authors start out with a clear hypothesis and test it."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-207",
"text": "This might set the base for future design principles of RNNs."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-208",
"text": "The only downside is that the experiments are only conducted on tasks which are known to be not that demanding from a dynamical systems perspective; it would have been nice if the authors had traversed the set of data sets more to find data where chaos is actually necessary."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-209",
"text": "https://openreview.net/forum?id=S1dIzvclg¬eId= H1LYxY84l The author propose to use a off-policy actor-critic algorithm in a batch-setting to improve chatbots."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-210",
"text": "The approach is well motivated and the paper is well written, except for some intuitions for why the batch version outperforms the on-line version (see comments on \"clarification regarding batch vs. online setting\").The artificial experiments are instructive, and the real-world experiments were performed very thoroughly although the results show only modest improvement."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-211",
"text": "https://openreview.net/forum?id=rJfMusFll¬eId=H1bSmrx4x 7 7 -1.77 Table 5 : A qualitative study of the effect of sentiment in the overall recommendation score prediction."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-212",
"text": "Prediction \u2192 is the overall recommendation score predicted by our system, Actual \u2192 is the recommendation score given by reviewers."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-213",
"text": "Senti Act are the output activations from the final layer of MLP Senti which are augmented to the decision layer for final recommendation score prediction."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-214",
"text": "The correspondence between the sentiment embedded within the review texts and Sentiment Activations are fairly visible in Figure 3 ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-215",
"text": "Kindly refer to Appendix A for polarity strengths in individual review sentences."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-216",
"text": "The OpenReview links in the table above give the full review texts."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-217",
"text": "high recommendation score 8."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-218",
"text": "It starts with introducing the authors work and listing the strengths and limitations of the work without much (and necessary) details."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-219",
"text": "Our model hence predicts 5 as the recommendation score."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-220",
"text": "Whereas R4 can be seen as the case of a usual well-written review, expressing the positive and negative aspects of the paper coherently."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-221",
"text": "Our model predicts 6 for an actual recommendation score of 7."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-222",
"text": "These validate the role of the reviewer's opinion and sentiment to predict the recommendation score, and our model is competent enough to take into account the overall polarity of the review-text to drive the prediction."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-223",
"text": "Figure 4 presents the confusion matrix of our proposed model on ICLR 2017 test data for Task 2."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-224",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-225",
"text": "**CONCLUSION**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-226",
"text": "Here in this work, we show that the reviewer sentiment information embedded within peer review texts could be leveraged to predict the peer review outcomes."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-227",
"text": "Our deep neural architecture makes use of three information channels: the paper full-text, corresponding peer review texts and the sentiment within the reviews to address the complex task of decision making in peer review."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-228",
"text": "With further exploration, we aim to mould the ongoing research to an efficient AI-enabled system that would assist the journal editors or conference chairs in making informed decisions."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-229",
"text": "However, considering the sensitivity of the topic, we would like to further dive deep into exploring the subtle nuances that leads into the grading of peer review aspects."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-230",
"text": "We found that review reliability prediction should prelude these tasks since not all reviews are of equal quality or are significant to the final decision making."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-231",
"text": "We aim to include review reliability prediction in the pipeline of our future work."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-232",
"text": "However, we are in consensus that scholarly language processing is not straightforward."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-233",
"text": "We need stronger, pervasive models to capture the high-level interplay of the paper and peer reviews to decide the fate of a manuscript."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-234",
"text": "We intend to work upon those and also explore more sophisticated techniques for sentiment polarity encoding."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-235",
"text": "----------------------------------"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-236",
"text": "**A HEATMAPS DEPICTING SENTIMENT POLARITY IN REVIEW TEXTS**"
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-237",
"text": "Figure 5: Heatmaps of the sentence-wise VADER sentiment polarity of reviews considered in Table 4 ."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-238",
"text": "Reviews generally reflect the polarity of the reviewer towards the respective work."
},
{
"sent_id": "dbb0178b572c2a451853737910ac86-C001-239",
"text": "s0...sn \u2192 are the sentences in the peer review texts."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"dbb0178b572c2a451853737910ac86-C001-23"
],
[
"dbb0178b572c2a451853737910ac86-C001-50"
]
],
"cite_sentences": [
"dbb0178b572c2a451853737910ac86-C001-23",
"dbb0178b572c2a451853737910ac86-C001-50"
]
},
"@USE@": {
"gold_contexts": [
[
"dbb0178b572c2a451853737910ac86-C001-41"
],
[
"dbb0178b572c2a451853737910ac86-C001-94"
],
[
"dbb0178b572c2a451853737910ac86-C001-95"
],
[
"dbb0178b572c2a451853737910ac86-C001-115"
],
[
"dbb0178b572c2a451853737910ac86-C001-121"
]
],
"cite_sentences": [
"dbb0178b572c2a451853737910ac86-C001-41",
"dbb0178b572c2a451853737910ac86-C001-94",
"dbb0178b572c2a451853737910ac86-C001-95",
"dbb0178b572c2a451853737910ac86-C001-115",
"dbb0178b572c2a451853737910ac86-C001-121"
]
},
"@DIF@": {
"gold_contexts": [
[
"dbb0178b572c2a451853737910ac86-C001-43"
],
[
"dbb0178b572c2a451853737910ac86-C001-116",
"dbb0178b572c2a451853737910ac86-C001-117"
],
[
"dbb0178b572c2a451853737910ac86-C001-132"
],
[
"dbb0178b572c2a451853737910ac86-C001-137"
],
[
"dbb0178b572c2a451853737910ac86-C001-148"
],
[
"dbb0178b572c2a451853737910ac86-C001-149"
]
],
"cite_sentences": [
"dbb0178b572c2a451853737910ac86-C001-43",
"dbb0178b572c2a451853737910ac86-C001-116",
"dbb0178b572c2a451853737910ac86-C001-132",
"dbb0178b572c2a451853737910ac86-C001-137",
"dbb0178b572c2a451853737910ac86-C001-148",
"dbb0178b572c2a451853737910ac86-C001-149"
]
},
"@MOT@": {
"gold_contexts": [
[
"dbb0178b572c2a451853737910ac86-C001-52"
]
],
"cite_sentences": [
"dbb0178b572c2a451853737910ac86-C001-52"
]
},
"@EXT@": {
"gold_contexts": [
[
"dbb0178b572c2a451853737910ac86-C001-116",
"dbb0178b572c2a451853737910ac86-C001-117"
]
],
"cite_sentences": [
"dbb0178b572c2a451853737910ac86-C001-116"
]
}
}
},
"ABC_5095f2af3f0c51283c8fbee08a17ac_7": {
"x": [
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-2",
"text": "Multi-hop reading comprehension (RC) across documents poses new challenge over singledocument RC because it requires reasoning over multiple documents to reach the final answer."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-3",
"text": "In this paper, we propose a new model to tackle the multi-hop RC problem."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-4",
"text": "We introduce a heterogeneous graph with different types of nodes and edges, which is named as Heterogeneous Document-Entity (HDE) graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-5",
"text": "The advantage of HDE graph is that it contains different granularity levels of information including candidates, documents and entities in specific document contexts."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-6",
"text": "Our proposed model can do reasoning over the HDE graph with nodes representation initialized with co-attention and self-attention based context encoders."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-7",
"text": "We employ Graph Neural Networks (GNN) based message passing algorithms to accumulate evidences on the proposed HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-8",
"text": "Evaluated on the blind test set of the Qangaroo WIKIHOP data set, our HDE graph based model (single model) achieves state-of-the-art result."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-11",
"text": "Being able to comprehend a document and output correct answer given a query/question about content in the document, often referred as machine reading comprehension (RC) or question answering (QA), is an important and challenging task in natural language processing (NLP)."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-12",
"text": "Plenty of data sets have been constructed to facilitate research on this topic, such as SQuAD (Rajpurkar et al., 2016 (Rajpurkar et al., , 2018 , NarrativeQA (Ko\u010disk\u1ef3 et al., 2018) and CoQA (Reddy et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-13",
"text": "Many neural models have been proposed to tackle the machine RC/QA problem (Seo et al., 2016; Xiong et al., 2016; Tay et al., 2018) , and great success has been achieved, especially after the release of the BERT (Devlin et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-14",
"text": "Query: record label get ready Support doc 1: Mason Durell Betha (born August 27, 1977) , better known by stage name Mase (formerly often stylized Ma$e or MA$E), is an American hip hop recording artist and minister."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-15",
"text": "He is best known for being signed to Sean \"Diddy\" Combs's label Bad Boy Records. . . ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-16",
"text": "Support doc 2: \"Get Ready\" was the only single released from Mase's second album, Double Up."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-17",
"text": "It was released on May 25, 1999, produced by Sean \"Puffy\" Combs, Teddy Riley and Andreao \"Fanatic\" Heard and featured R&B group, Blackstreet, it contains a sample of \"A Night to Remember\", performed by Shalamar. . . ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-18",
"text": "However, current research mainly focuses on machine RC/QA on a single document or paragraph, and still lacks the ability to do reasoning across multiple documents when a single document is not enough to find the correct answer."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-19",
"text": "To promote the study for multi-hop RC over multiple documents, two data sets are recently proposed: WIKIHOP (Welbl et al., 2018) and HotpotQA ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-20",
"text": "These two data sets require multi-hop reasoning over multiple supporting documents to find the answer."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-21",
"text": "In Figure 1 , we show an excerpt from one sample in WIKIHOP development set to illustrate the need for multi-hop reasoning."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-22",
"text": "Two types of approaches have been proposed on the multi-hop multi-document RC problem."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-23",
"text": "The first is based on previous neural RC models."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-24",
"text": "The earliest attempt in (Dhingra et al., 2018) concatenated all supporting documents and designed a recurrent layer to explicitly exploit the skip connections between entities given automatically gener-ated coreference annotations."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-25",
"text": "Adding this layer to the neural RC models improved performance on multi-hop tasks."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-26",
"text": "Recently, an attention based system (Zhong et al., 2019) utilizing both documentlevel and entity-level information achieved stateof-the-art results on WIKIHOP data set, proving that techniques like co-attention and self-attention widely employed in single-document RC tasks are also useful in multi-document RC tasks."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-27",
"text": "The second type of research work is based on graph neural networks (GNN) for multi-hop reasoning."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-28",
"text": "The study in Song et al. (2018) In this paper, we propose a new method to solve the multi-hop RC problem across multiple documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-29",
"text": "Inspired by the success of GNN based methods (Song et al., 2018; De Cao et al., 2018) for multi-hop RC, we introduce a new type of graph, called Heterogeneous Document-Entity (HDE) graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-30",
"text": "Our proposed HDE graph has the following advantages:"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-31",
"text": "\u2022 Instead of graphs with single type of nodes (Song et al., 2018; De Cao et al., 2018) , the HDE graph contains different types of queryaware nodes representing different granularity levels of information."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-32",
"text": "Specifically, instead of only entity nodes as in (Song et al., 2018; De Cao et al., 2018) , we include nodes corresponding to candidates, documents and entities."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-33",
"text": "In addition, following the success of Coarse-grain Fine-grain Coattention (CFC) network (Zhong et al., 2019) , we apply both co-attention and self-attention to learn queryaware node representations of candidates, documents and entities;"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-34",
"text": "\u2022 The HDE graph enables rich information interaction among different types of nodes thus facilitate accurate reasoning."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-35",
"text": "Different types of nodes are connected with different types of edges to highlight the various structural information presented among query, document and candidates."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-36",
"text": "Through ablation studies, we show the effectiveness of our proposed HDE graph for multi-hop multi-document RC task."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-37",
"text": "Evaluated on the blind test set of WIKIHOP, our proposed end-to-end trained single neural model beats the current stateof-the-art results in (Zhong et al., 2019) 1 , without using pretrained contextual ELMo embedding (Peters et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-39",
"text": "**RELATED WORK**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-40",
"text": "The study presented in this paper is directly related to existing research on multi-hop reading comprehension across multiple documents (Dhingra et al., 2018; Song et al., 2018; De Cao et al., 2018; Zhong et al., 2019; Kundu et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-41",
"text": "The method presented in this paper is similar to previous studies using GNN for multi-hop reasoning (Song et al., 2018; De Cao et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-42",
"text": "Our novelty is that we propose to use a heterogeneous graph instead of a graph with single type of nodes to incorporate different granularity levels of information."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-43",
"text": "The co-attention and self-attention based encoding of multi-level information presented in each input is also inspired by the CFC model (Zhong et al., 2019) because they show the effectiveness of attention mechanisms."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-44",
"text": "Our model is very different from the other two studies (Dhingra et al., 2018; Kundu et al., 2018) : these two studies both explicitly score the possible reasoning paths with extra NER or coreference resolution systems while our method does not require these modules and we do multi-hop reasoning over graphs."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-45",
"text": "Besides these studies, our work is also related to the following research directions."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-46",
"text": "Multi-hop RC: There exist several different data sets that require reasoning in multiple steps in literature, for example bAbI (Weston et al., 2015) , MultiRC (Khashabi et al., 2018) and OpenBookQA (Mihaylov et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-47",
"text": "A lot of systems have been proposed to solve the multi-hop RC problem with these data sets (Sun et al., 2018; Wu et al., 2019) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-48",
"text": "However, these data sets require multi-hop reasoning over multiple sentences or multiple common knowledge while the problem we want to solve in this paper requires collecting evidences across multiple documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-49",
"text": "GNN for NLP: Recently, there is considerable amount of interest in applying GNN to NLP tasks and great success has been achieved."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-50",
"text": "For exam-ple, in neural machine translation, GNN has been employed to integrate syntactic and semantic information into encoders (Bastings et al., 2017; Marcheggiani et al., 2018) ; applied GNN to relation extraction over pruned dependency trees; the study by Yao et al. (2018) employed GNN over a heterogeneous graph to do text classification, which inspires our idea of the HDE graph; Liu et al. (2018) proposed a new contextualized neural network for sequence learning by leveraging various types of non-local contextual information in the form of information passing over GNN."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-51",
"text": "These studies are related to our work in the sense that we both use GNN to improve the information interaction over long context or across documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-53",
"text": "**METHODOLOGY**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-54",
"text": "In this section, we describe different modules of the proposed Heterogeneous Document-Entity (HDE) graph-based multi-hop RC model."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-55",
"text": "The overall system diagram is shown in Figure 2 ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-56",
"text": "Our model can be roughly categorized into three parts: initializing HDE graph nodes with co-attention and self-attention based context encoding, reasoning over HDE graph with GNN based message passing algorithms and score accumulation from updated HDE graph nodes representations."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-58",
"text": "**CONTEXT ENCODING**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-59",
"text": "Given a query q with the form of (s, r, ?) which represents subject, relation and unknown object respectively, a set of support documents S q and a set of candidates C q , the task is to predict the correct answer a * to the query."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-60",
"text": "To encode information including in the text of query, candidates and support documents, we use a pretrained embedding matrix (Pennington et al., 2014) to convert word sequences to sequences of vectors."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-61",
"text": "Let X q \u2208 R lq\u00d7d , X i s \u2208 R l i s \u00d7d and X j c \u2208 R l j c \u00d7d represent the embedding matrices of query, i-th supporting document and j-th candidate of a sample, where l q , l i s and l j c are the numbers of words in query, i-th supporting document and j-th candidate respectively."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-62",
"text": "d is the dimension of the word embedding."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-63",
"text": "We use bidirectional recurrent neural networks (RNN) with gated recurrent unit (GRU) (Cho et al., 2014) to encode the contextual information present in the query, supporting documents and candidates separately."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-64",
"text": "The output of query, document and candidate encoders are H q \u2208 R lq\u00d7h , H i s \u2208 R l i s \u00d7h and Figure 2: System diagram."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-65",
"text": "S and C are the number of support documents and candidates respectively."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-66",
"text": "We use yellow nodes to represent query-aware candidate representation, blue nodes to represent extracted queryaware entity representation and green nodes to represent query-aware document representation."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-67",
"text": "Entity extraction: entities play an import role in bridging multiple documents and connecting a query and the corresponding answer as shown in figure 1."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-68",
"text": "For example, the entity \"get ready\" in query and two entities \"Mase\" and \"Sean Combs\" co-occur in the 2nd support document, and both \"Mase\" and \"Sean Combs\" can lead to the correct answer \"bad boy records\"."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-69",
"text": "Based on this observation, we propose to extract mentions of both query subject s and candidates C q from documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-70",
"text": "We will show later that by including mentions of query subject the performance can be improved."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-71",
"text": "We use simple exact match strategy (De Cao et al., 2018; Zhong et al., 2019) to find the locations of mentions of query subject and candidates, i.e. we need the start and end positions of each mention."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-72",
"text": "Each mention is treated as an entity."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-73",
"text": "Then, representations of entities can be taken out from the i-th document encoding H i s ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-74",
"text": "We denote an entity's representation as M \u2208 R lm\u00d7h where l m is the length of the entity."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-75",
"text": "Co-attention: Co-attention has achieved great success for single document reading comprehension tasks (Seo et al., 2016; Xiong et al., 2016) , and recently was applied to multiple-hop reading comprehension (Zhong et al., 2019) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-76",
"text": "Coattention enables the model to combine learned query contextual information attended by document and document contextual information attended by query, with inputs of one query and one document."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-77",
"text": "We follow the implementation of coattention in (Zhong et al., 2019) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-78",
"text": "We use the co-attention between a query and a supporting document for illustration."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-79",
"text": "Same operations can be applied to other documents, or between the query and extracted entities."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-80",
"text": "Given RNN-encoded sequences of the query H q \u2208 R lq\u00d7h and a document H i s \u2208 R l i s \u00d7h , the affinity matrix between the query and document can be calculated as"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-102",
"text": "Graph building: let a HDE graph be denoted as G = {V, E}, where V stands for node representations and E represents edges between nodes."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-103",
"text": "In our proposed HDE graph based model, we treat each document, candidate and entity extracted from documents as nodes in the HDE graph, i.e., each document (candidate/entity) corresponds to one node in the HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-104",
"text": "These nodes represent different granularity levels of query-aware information: document nodes encode documentlevel global information regarding to the query; candidate nodes encode query-aware information in candidates; entity nodes encode query-aware information in specific document context or the query subject."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-105",
"text": "The HDE graph is built to enable graph-based reasoning."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-106",
"text": "It exploits useful structural information among query, support documents and candidates."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-107",
"text": "We expect our HDE graph could perform multi-hop reasoning to locate the answer nodes or entity nodes of answers given a query."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-108",
"text": "Self-attentive pooling generates vector representations for each candidate, document and entity, which can be directly employed to initialize the node representations V. For edge connections E, we define the following types of edges between pairs of nodes to encode various structural information in the HDE graph:"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-109",
"text": "1. an edge between a document node and a candidate node if the candidate appear in the document at least one time."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-110",
"text": "2. an edge between a document node and an entity node if the entity is extracted from the document."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-111",
"text": "3. an edge between a candidate node and an entity node if the entity is a mention of the candidate."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-112",
"text": "4. an edge between two entity nodes if they are extracted from the same document."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-113",
"text": "5. an edge between two entity nodes if they are mentions of the same candidate or query subject and they are extracted from different documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-114",
"text": "6. all candidate nodes connect with each other."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-115",
"text": "7. entity nodes that do not meet previous conditions are connected."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-116",
"text": "Type 4, 5, 7 edges are also employed in (De Cao et al., 2018) where the authors show the effectiveness of those different types of edges."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-117",
"text": "Similarly, we treat these different edges differently to make information propagate differently over these seven different types of edges."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-118",
"text": "More details will be introduced in next paragraph about message passing figure."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-119",
"text": "over the HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-120",
"text": "In Figure 3 , we illustrate a toy example of the proposed HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-121",
"text": "Message passing: we define how information propagates over the graph in order to do reasoning over the HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-122",
"text": "Different variants of GNN have different implementations of message passing strategies."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-123",
"text": "In this study, we follow the message passing design in GCN (Kipf and Welling, 2016; De Cao et al., 2018) as it gives good performance on validation set compared to other strategies (Veli\u010dkovi\u0107 et al., 2017; Xu et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-124",
"text": "Generally, the message passing over graphs can be achieved in two steps: aggregation and combination (Hamilton et al., 2017) , and this process can be conducted multiple times (usually referred as layers or hops in GNN literature)."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-125",
"text": "Here, we give the aggregation and combination formulation of the message passing over the proposed HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-126",
"text": "The first step aggregates information from neighbors of each node, which can be formulated as"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-127",
"text": "where R is the set of all edge types, N r i is the neighbors of node i with edge type r and h k j is the node representation of node j in layer k (h 0 j initialized with self-attention outputs)."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-128",
"text": "|\u00b7| indicates the size of the neighboring set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-129",
"text": "f r defines a transformation on the neighboring node representations, and can be implemented with a MLP."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-130",
"text": "z k i represents the aggregated information in layer k for node i, and can be combined with the transformed node i representation:"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-131",
"text": "where f s can also be implemented with a MLP."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-132",
"text": "It has been shown that GNN suffers from the smoothing problem if the number of layers is large (Kipf and Welling, 2016) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-133",
"text": "The smoothing problem can result in similar nodes representation and lose the discriminative ability when doing classification on nodes."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-134",
"text": "To tackle this problem, we add a gating mechanism (Gilmer et al., 2017) on the combined information u k i ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-135",
"text": "(11) sigmoid(\u00b7) denotes the sigmoid function on transformed concatenation of u k i and h k i ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-136",
"text": "g k i is then applied to the combined information to control the amount information from computed update or from the original node representation."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-137",
"text": "tanh(\u00b7) functions as a non-linear activation function."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-138",
"text": "denotes element-wise multiplication."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-139",
"text": "In this study, f r , f s and f g are all implemented with single-layer MLPs, the output dimension of which is 2h."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-140",
"text": "After K times message passing, all candidate, document and entity nodes will have their final updated node representation."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-142",
"text": "**SCORE ACCUMULATION**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-143",
"text": "The final node representations of candidate and entity nodes corresponding to mentions of candidates are used to calculate classification scores."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-144",
"text": "This procedure can be formulated as"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-145",
"text": "where H C \u2208 R C\u00d72h is the node representation of all candidate nodes and C is the number of candidates."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-146",
"text": "H E \u2208 R M \u00d72h is the node representation of all entity nodes that correspond to candidates, and M is the number of those nodes."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-147",
"text": "ACC max is an operation that takes the maximum over scores of entities that belong to the same candidate."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-148",
"text": "f C and f E are implemented with two-layer MLPs with tanh activation function."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-149",
"text": "The hidden layer size is half of the input dimension, and the output dimension is 1."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-150",
"text": "We directly sum the scores from candidate nodes and entity nodes as the final scores over multiple candidates."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-151",
"text": "Thus, the output score vector a \u2208 R C\u00d71 gives a distribution over all candidates."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-152",
"text": "Since the task is multi-class classification, we use cross-entropy loss as training objective which takes a and the labels as input."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-154",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-156",
"text": "**DATASET**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-157",
"text": "We use WIKIHOP (Welbl et al., 2018) to validate the effectiveness of our proposed model."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-158",
"text": "The query of WIKIHOP is constructed with entities and relations from WIKIDATA, while supporting documents are from WIKIREADING (Hewlett et al., 2016) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-159",
"text": "A bipartite graph connecting entities and documents is first built and the answer for each query is located by traversal on this graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-160",
"text": "Candidates that are type-consistent with the answer and share the same relation in query with the answer are included, resulting in a set of candidates."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-161",
"text": "Thus, WIKIHOP is a multi-choice style reading comprehension data set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-162",
"text": "There are totally about 43K samples in training set, 5K samples in development set and 2.5K samples in test set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-163",
"text": "The test set is not provided and can only be evaluated on blindly."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-164",
"text": "The task is to predict the correct answer given a query and multiple supporting documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-165",
"text": "In the experiment, we train our proposed model on all training samples in WIKIHOP, and tune model hyperparameters on all samples in development set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-166",
"text": "We only evaluate our proposed model on the unmasked version of WIKIHOP."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-168",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-169",
"text": "Queries, support documents and candidates are tokenized into word sequences with NLTK (Loper and Bird, 2002) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-170",
"text": "We empirically split the query into relation and subject entity."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-171",
"text": "Exact matching strategy is employed to locate mentions of both subject entity and candidates in supporting documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-172",
"text": "300-dimensional GLoVe embeddings (with 840B tokens and 2.2M vocabulary size) (Pennington et al., 2014) and 100-dimensional character n-gram embeddings (Hashimoto et al., 2017) (Dhingra et al., 2018) 56.0 59.3 MHQA-GRN (Song et al., 2018) 62.8 65.4 Entity-GCN (De Cao et al., 2018) 64.8 67.6 CFC (Zhong et al., 2019) 66.4 70.6 Kundu et al. (2018) 67.1 -Proposed 68.1 70.9 PyTorch (Paszke et al., 2017) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-173",
"text": "More details about experimental and hyperparameter settings can be found in supplementary materials."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-174",
"text": "The performance on development set is measured after each training epoch, and the model with the highest accuracy is saved and submitted to be evaluated on the blind test set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-175",
"text": "We will make our code publicly available after the review process."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-177",
"text": "**RESULTS**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-178",
"text": "In Table 1 , we show the results of the our proposed HDE graph based model on both development and test set and compare it with previously published results."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-179",
"text": "We show that our proposed HDE graph based model improves the state-of-the-art accuracy on development set from 67.1% (Kundu et al., 2018) to 68.1%, on the blind test set from 70.6% (Zhong et al., 2019) to 70.9%."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-180",
"text": "Compared to two previous studies using GNN for multi-hop reading comprehension (Song et al., 2018; De Cao et al., 2018) , our model surpasses them by a large margin even though we do not use better pre-trained contextual embedding ELMo (Peters et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-182",
"text": "**ABLATION STUDIES**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-183",
"text": "In order to better understand the contribution of different modules to the performance, we conduct several ablation studies on the development set of WIKIHOP."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-184",
"text": "If we remove the proposed HDE graph and directly use the representations of candidates and entities corresponding to mentions of candidates (equation 7) for score accumulation, the accuracy on WIKIHOP development set drops 2.6% absolutely."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-185",
"text": "This proves the efficacy of the proposed HDE graph on multi-hop reasoning across multiple documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-186",
"text": "If we treat all edge types equally without using different GNN parameters for different edge types (equation 9), the accuracy drops 1.4%, which indicates that different information encoded by different types of edges is also important to retain good performance; If only scores of entity nodes (right part of equation 12) are considered in score accumulation, the accuracy on dev set degrades by 1.0%; if only scores of candidates nodes (left part of equation 12) are considered, the accuracy degrades by 1.5%."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-187",
"text": "This means that the scores on entity nodes contribute more to the classification, which is reasonable because entities carry context information in the document while candidates do not."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-188",
"text": "We also investigate the effect of removing different types of nodes."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-189",
"text": "Note that removing nodes is not the same as removing scores from candidate/entity nodes -it means we do not use the scores on these nodes during score accumulation but nodes still exist during message passing on the HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-190",
"text": "However, removing one type of nodes means the nodes and corresponding edges do not exist in the HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-191",
"text": "The ablation shows that removing entity nodes results in the largest degradation of performance while removing document nodes result in the least degradation."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-192",
"text": "This finding is consistent with the study by (De Cao et al., 2018) where they emphasize the importance of entities in multi-hop reasoning."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-193",
"text": "The small contribution of document nodes is probably caused by too much information loss during self-attentive pooling over long sequences."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-194",
"text": "Better ways are needed to encode document information into graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-195",
"text": "More ablation studies are included in the supplementary materials due to space constraint."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-197",
"text": "**RESULT ANALYSIS**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-198",
"text": "To investigate how the HDE graph helps multi-hop reasoning, we conduct experiments on WIKIHOP development set where we discard the HDE graph and only use the candidate and entity representations output by self-attention."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-199",
"text": "In Table 3 , \"Singlefollow\" (2069 samples in the dev set) means a single document is enough to answer the query, while \"Multi-follow\" (2601 samples) means multiple documents are needed."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-200",
"text": "These information is provided in (Welbl et al., 2018) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-201",
"text": "We observe in Table 2 that the performance is consistently better for \"with HDE graph\" in both cases."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-202",
"text": "In \"Singlefollow\" case the absolute accuracy improvement is 1.1%, while a significant 4.0% improvement is achieved in the \"Multi-follow\" case, which has even more samples than \"Single-follow\" case."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-203",
"text": "This proves that the proposed HDE graph is good at reasoning over multiple documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-204",
"text": "We also investigate how our model performs w.r.t."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-205",
"text": "the number of support documents and number of candidates given an input sample."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-206",
"text": "In Figure 4 , the blue line with square markers shows the number of support documents in one sample (x-axis) and the corresponding frequencies in the development set (y-axis)."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-207",
"text": "The orange line with diamond markers shows the change of accuracy with the increasing of number of support documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-208",
"text": "We choose the number of support documents with more than 50 appearances in the development set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-209",
"text": "For example, there are about 300 samples with 5 support documents and the accuracy of our model on these 300 samples is about 80%."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-210",
"text": "Overall, we find the accuracy decreases with the increasing number of support documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-211",
"text": "This is reasonable because more documents possibly means more entities and bigger graph, and is more challenging for reasoning."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-212",
"text": "Figure 5 indicates the similar trend (when the number of candidates are less than 20) with the increasing number of candidates, which we believe is partly caused by the larger HDE graph."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-213",
"text": "Also, more candidates cause more confusion in the selection."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-214",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-215",
"text": "**CONCLUSION**"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-216",
"text": "We propose a new GNN-based method for multihop RC across multiple documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-217",
"text": "We introduce the HDE graph, a heterogeneous graph for multiple-hop reasoning over nodes representing different granularity levels of information."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-218",
"text": "We use co-attention and self-attention to encode candidates, documents, entities of mentions of candidates and query subjects into query-aware representations, which are then employed to initialize graph node representations."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-219",
"text": "Evaluated on WIKI-HOP, our end-to-end trained single neural model achieves the state-of-the-art performance on the blind test set."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-220",
"text": "In the future, we would like to investigate explainable GNN for this task, such as explicit reasoning path in (Kundu et al., 2018) , and work on other data sets such as HotpotQA."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-81",
"text": "where denotes matrix transpose."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-82",
"text": "Each entry of the matrix A i qs indicates how related two words are, one from the query and one from the document."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-83",
"text": "For simplification, in later context, we ignore the superscript i which indicates the operation on the i-th document."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-84",
"text": "Next we derive the attention context of the query and document as follows:"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-85",
"text": "sof tmax(\u00b7) denotes column-wise normalization."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-86",
"text": "We further encode the co-attended document context using a bidirectional RNN f with GRU:"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-87",
"text": "The final co-attention context is the columnwise concatenation of C s and D s :"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-88",
"text": "We expect S ca carries query-aware contextual information of supporting documents as shown by Zhong et al. (2019) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-89",
"text": "The same co-attention module can also be applied to query and candidates, and query and entities (as shown in Figure 2 ) to get C ca and E ca . Note that we do not do coattention between query and entities corresponding to query subject because query subject is already a part of the query."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-90",
"text": "To keep the dimensionality consistent, we apply a single-layer multi-layer perceptron (MLP) with tanh activation function to increase the dimension of the query subject entities to 2h."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-91",
"text": "Self-attentive pooling: while co-attention yields a query-aware contextual representation of documents, self-attentive pooling is designed to convert the sequential contextual representation to a fixed dimensional non-sequential feature vector by selecting important query-aware information (Zhong et al., 2019) ."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-92",
"text": "Self-attentive pooling summarizes the information presented in the coattention output by calculating a score for each word in the sequence."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-93",
"text": "The scores are normalized and a weighted sum based pooling is applied to the sequence to get a single feature vector as the summarization of the input sequence."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-94",
"text": "Formally, the self-attention module can be formulated as the following operations given S ca as input:"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-95",
"text": "where M LP (\u00b7) is a two-layer MLP with tanh as activation function."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-96",
"text": "Similarly, after self-attentive pooling, we can get c sa and e sa for each candidate and entity."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-97",
"text": "Our context encoding module is different from the one used in Zhong et al. (2019) in following aspects: 1) we compute the co-attention between query and candidates which is not presented in the CFC model."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-98",
"text": "2) For entity word sequences, we first calculate co-attention with query and then use selfattention to summarize each entity word sequence while Zhong et al. (2019) first do self-attention on entity word sequences to get a sequence of entity vectors in each documents."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-99",
"text": "Then, they apply coattention with query."
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "5095f2af3f0c51283c8fbee08a17ac-C001-101",
"text": "**REASONING OVER HDE GRAPH**"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5095f2af3f0c51283c8fbee08a17ac-C001-26"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-75"
]
],
"cite_sentences": [
"5095f2af3f0c51283c8fbee08a17ac-C001-26",
"5095f2af3f0c51283c8fbee08a17ac-C001-75"
]
},
"@USE@": {
"gold_contexts": [
[
"5095f2af3f0c51283c8fbee08a17ac-C001-33"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-40",
"5095f2af3f0c51283c8fbee08a17ac-C001-41"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-43"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-71"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-77"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-88"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-91"
]
],
"cite_sentences": [
"5095f2af3f0c51283c8fbee08a17ac-C001-33",
"5095f2af3f0c51283c8fbee08a17ac-C001-40",
"5095f2af3f0c51283c8fbee08a17ac-C001-43",
"5095f2af3f0c51283c8fbee08a17ac-C001-71",
"5095f2af3f0c51283c8fbee08a17ac-C001-77",
"5095f2af3f0c51283c8fbee08a17ac-C001-88",
"5095f2af3f0c51283c8fbee08a17ac-C001-91"
]
},
"@DIF@": {
"gold_contexts": [
[
"5095f2af3f0c51283c8fbee08a17ac-C001-37"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-88"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-97"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-98"
],
[
"5095f2af3f0c51283c8fbee08a17ac-C001-179"
]
],
"cite_sentences": [
"5095f2af3f0c51283c8fbee08a17ac-C001-37",
"5095f2af3f0c51283c8fbee08a17ac-C001-88",
"5095f2af3f0c51283c8fbee08a17ac-C001-97",
"5095f2af3f0c51283c8fbee08a17ac-C001-98",
"5095f2af3f0c51283c8fbee08a17ac-C001-179"
]
}
}
},
"ABC_7a2f56cb4bbcd09ba35934ca76c9a9_7": {
"x": [
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-87",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-2",
"text": "Answering questions while reasoning over multiple supporting facts has long been a goal of artificial intelligence."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-3",
"text": "Recently, remarkable advances have been made, focusing on reasoning over natural language-based stories."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-4",
"text": "In particular, end-to-end memory networks (N2N), have achieved state-of-the-art results over such tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-5",
"text": "However, N2Ns are limited by the necessity to choose between two weight tying schemes, neither of which performs consistently well over all tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-6",
"text": "We propose a unified model generalising weight tying and in doing so, make the model more expressive."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-7",
"text": "The proposed model achieves uniformly high performance, improving on the best results for memory network-based models on the bAbI dataset, and competitive results on Dialog bAbI."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-10",
"text": "Deep neural network models have demonstrated strong performance on a number of challenging tasks, such as image classification (He et al., 2016) , speech recognition (Graves et al., 2013) , and various natural language processing tasks Kim, 2014; Xiong et al., 2016) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-11",
"text": "Recently, the augmentation of neural networks with external memory components has been shown to be a powerful means of capturing context of different types (Graves et al., 2014 Rae et al., 2016) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-12",
"text": "Of particular interest to this work is the work by Sukhbaatar et al. (2015) , on end-toend memory networks (N2Ns), which exhibit remarkable reasoning capabilities, e.g. for reasoning and goal-oriented dialogue tasks ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-13",
"text": "Typically, such tasks consist of three key components: a sequence of supporting facts (the story), a question, and its answer."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-14",
"text": "An example task is given in Figure 1 ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-15",
"text": "Given the first two as input, it is the model's job to reason over the supporting facts and predict the answer to the question."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-16",
"text": "One drawback of N2Ns is the problem of choosing between two types of weight tying (adjacent and layer-wise; see Section 2 for a technical description)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-17",
"text": "While N2Ns generally work well with either weight tying approach, as reported in Sukhbaatar et al. (2015) , the performance is uneven on some difficult tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-18",
"text": "That is, for some tasks, one weight tying approach attains nearperfect accuracy and the other performs poorly, but for other tasks, this trend is reversed."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-135",
"text": "Following , we replace the final prediction step in Equation (4) with:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-19",
"text": "In this paper, focusing on improving N2N, we propose a unified model, UN2N, capable of dynamically determining the appropriate type of weight tying for a given task."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-20",
"text": "This is realised through the use of a gating vector, inspired by Liu and Perez (2017) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-21",
"text": "Our method achieves the best performance for a memory network-based model on the bAbI dataset, superior to both adjacent and layer-wise weight tying, and competitive results on Dialog bAbI."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-22",
"text": "The paper is organised as follows: after we review N2N and related reasoning models in Section 2, we describe our motivation and detail the elements of our proposed model in Section 3."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-23",
"text": "Section 4 and 5 present the experimental results on the bAbI and Dialog bAbI datasets with analyses in Section 6."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-24",
"text": "Lastly, Section 7 concludes the paper."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-26",
"text": "**RELATED WORK**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-27",
"text": "End-to-End Memory Networks: Building on top of memory networks , Sukhbaatar et al. (2015) ing the memory position supervision and making the model trainable in an end-to-end fashion, through the advent of supporting memories and a memory access controller."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-28",
"text": "Representations of the context sentences x 1 , . . . , x n in the story are encoded using two sets of embedding matrices A and C (both of size d \u21e5 |V | where d is the embedding size and |V | the vocabulary size), and stored in the input and output memory cells m 1 , . . . , m n and c 1 , . . . , c n , each of which is obtained via"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-29",
"text": "is a function that maps the input into a bag of dimension |V |."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-30",
"text": "The input question q is encoded with another embedding matrix B 2 R d\u21e5|V | such that u = B (q)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-31",
"text": "N2N utilises the question embedding u and the input memory representations m i to measure the relevance between the question and each supporting context sentence, resulting in a vector of attention weights:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-32",
"text": "where softmax(a i ) = e a i P j e a j ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-33",
"text": "Once the attention weights have been computed, the memory access controller receives the response o in the form of a weighted sum over the output memory representations:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-34",
"text": "To enhance the model's ability to cope with more challenging tasks requiring multiple supporting facts from the memory, Sukhbaatar et al. (2015) further extended the model by stacking multiple memory layers (also known as \"hops\"), in which case the output of the k th hop is taken as input to the (k + 1) th hop:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-35",
"text": "Lastly, N2N predicts the answer to question q using a softmax function: where\u0177 is the predicted answer distribution, W 2 R |V |\u21e5d is a parameter matrix for the model to learn (note that in the context of bAbI tasks, answers are single words), and K is the total number of hops."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-37",
"text": "**CURRENT ISSUES AND MOTIVATION:**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-38",
"text": "In Sukhbaatar et al. (2015) , two types of weight tying were explored for N2N, namely adjacent (\"ADJ\") and layer-wise (\"LW\")."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-39",
"text": "With LW, the input and output embedding matrices are shared across different hops (i.e., A 1 = A 2 = . . . = A K and C 1 = C 2 = . . . = C K ), resembling RNNs."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-160",
"text": "Performance on task 6."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-40",
"text": "With ADJ, on the other hand, not only is the output embedding for a given layer shared with the corresponding input embedding (i.e., A k+1 = C k ), the answer prediction matrix W and question embedding matrix B are also constrained such that W > = C K and B = A 1 ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-41",
"text": "While both ADJ and LW work well, achieving comparable overall performance in terms of mean error over the 20 bAbI tasks, their performance on a subset of the tasks (i.e., tasks 3, 16, 17 and 19, as shown in Table 1 ) is inconsistent, with one performing very well, and the other performing poorly."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-42",
"text": "Based on this observation, we propose a unified weight tying mechanism exploiting the benefits of both ADJ and LW, and capable of dynamically determining the best weight tying approach for a given task."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-43",
"text": "Table 1 : Accuracy (%) reported in (Sukhbaatar et al., 2015) on a selected subset of the 20 bAbI 10k tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-44",
"text": "Note that performance in the LW column is obtained with a larger embedding size d = 100 and ReLU non-linearity applied to the internal state after each hop."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-45",
"text": "Related reasoning models: Gated End-to-End Memory Networks (GN2Ns) (Liu and Perez, 2017) are a variant of N2N with a simple yet effective gating mechanism on the connections between hops, allowing the model to dynamically regulate the information flow between the controller and the memory."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-46",
"text": "Dynamic Memory Networks (DMNs) and its improved version (DMN+) employ RNNs to sequentially process contextual information stored in the memory."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-47",
"text": "All these models have been shown to have competent reasoning capabilities over the bAbI dataset ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-49",
"text": "**PROPOSED MODEL**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-51",
"text": "**1**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-52",
"text": "The key idea in this work is to let the model determine which type of weight tying mechanism it should rely on."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-53",
"text": "Recall that there are two types:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-54",
"text": "ADJ and LW."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-55",
"text": "With ADJ, the input embedding (A k ) of the k th hop is constrained to share the same parameters with the output embedding (C k 1 ) of the (k 1) th hop (i.e., A k = C k 1 )."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-56",
"text": "In contrast, with LW, the same input/output embedding matrices are shared across different hops (i.e.,"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-57",
"text": "To this end, we design a dynamic mechanism, allowing the model to decide on the preferred type of weight tying based on the input."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-58",
"text": "Specifically, the key element in UN2N is that embedding matrices are constructed dynamically for each instance."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-59",
"text": "This is in contrast to N2N and GN2N where the same embedding matrices are used for every input."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-60",
"text": "UN2N, utilising a gating vector z (described in Equation (8)), constructs the embedding matrices (i.e., A k , C k , B and W) on the fly, influenced by the information carried by z regarding the input question u 0 as well as the context sentences in the story m t :"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-61",
"text": "(1 z) (6) where is the column element-wise multiplication operation, andC k+1 the unconstrained embedding matrix."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-62",
"text": "We further define A 1 =\u00c3 1 and C 1 =C 1 , where\u00c3 1 andC 1 are the unconstrained embedding matrices for hop 1."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-63",
"text": "As shown in Equation (5), the input embedding matrix A k+1 for the (k + 1) th hop is composed of a weighted sum of the input and output embedding matrix A k and C k for the k th hop, resembling LW and ADJ, respectively."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-64",
"text": "The summation is weighted by the gating vector z. C k+1 is constructed in a similar fashion."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-65",
"text": "Ultimately, the larger the values of the elements of z, the more UN2N leans towards LW."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-66",
"text": "Conversely, smaller z values indicate an inclination for ADJ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-67",
"text": "Another key to this approach is the gating vector z, which is formulated to incorporate knowledge informative to the choice of the weight tying scheme."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-68",
"text": "Here, we take inspiration from gated endto-end memory networks (\"GN2N\": Liu and Perez (2017) ), where a gating mechanism is learned in an end-to-end fashion to regulate the information flow between memory hops."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-69",
"text": "In GN2N, the gate value T k (u k ) depends only on the input to the k th hop u k , whereas in this work, we further condition the determination of the weight tying approach on the story (or memory)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-70",
"text": "Concretely, similarly to DMN and DMN+, we encode the story by first reading the memory one step at a time with a GRU:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-71",
"text": "where t is the current time step and m t is the context sentence in the story at time t. Then, the last hidden state h T of the GRU is taken to be the representation of the story."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-72",
"text": "2 Next, z is defined to be a vector of dimension d:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-73",
"text": "where W z is a weight matrix, b z a bias term , the sigmoid function and"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-74",
"text": "Dim. of u 0 and h T ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-75",
"text": "Essentially, the gating vector z is now dependent on not only the question u 0 , but also the context sentences in the memory encoded in h T ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-76",
"text": "Note that the gating vector z can be replaced by a gating scalar z, but we choose to use a vector for more fine-grained control as in LSTMs (Hochreiter and Schmidhuber, 1997) and GRUs ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-77",
"text": "To simplify the model, we constrain B and W > to share the same parameters as A 1 and C K ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-78",
"text": "Moreover, following Sukhbaatar et al. (2015) , we add a linear mapping H 2 R d\u21e5d to the update connection between memory hops, but in our case, down-weight it by 1 z, resulting in:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-79",
"text": "Regularisation: In order to prevent the input and output embedding matrices A k and C k from being dominated by the unconstrained embedding matrices, it is necessary to restrain the magnitude of the values in\u00c3 1 andC k ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-80",
"text": "Therefore, in addition to the cross entropy loss over N training instances:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-81",
"text": "where y and\u0177 are the true and predicted answer, we enforce a regularisation penalty and formulate the new objective function as:"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-82",
"text": "Model implementation: we implement UN2N with TensorFlow (Abadi et al., 2015) and the code is available at https://github.com/ liufly/umemn2n."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-84",
"text": "**QA BABI EXPERIMENTS**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-85",
"text": "In this section, the experimental setup is detailed, followed by the results."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-88",
"text": "Dataset: We evaluate the proposed model over the bAbI dataset ) (v1.2), featuring 20 different natural-language-based reasoning tasks in the form of: (1) a list of supporting statements (x 1 , . . . , x n ); (2) a question (q); and (3) the answer (y, typically a single word or short phrase)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-89",
"text": "Each task in bAbI is synthetically generated with a distinct emphasis on a specific type of reasoning."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-90",
"text": "In order to predict the desired answer, the model is required to locate (or focus on) the relevant context sentences among irrelevant distractors in the memory."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-91",
"text": "As with N2N, our model can be trained in a fully end-to-end fashion and requires only the answers themselves as the supervision signal."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-92",
"text": "The dataset comes in two sizes, with Table 3 : Accuracy (%) on the 20 bAbI 10k tasks for our proposed method (UN2N) and various benchmark methods."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-93",
"text": "Bold indicates the best result for a given task."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-94",
"text": "either 1k or 10k training instances per task."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-95",
"text": "In this work, we focus exclusively on the 10k version."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-96",
"text": "3"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-97",
"text": "Training Details: Following Sukhbaatar et al. (2015), we hold out 10% of the bAbI training set to form a development set."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-98",
"text": "Position encoding and temporal encoding (with 10% random noise) are also incorporated into the model."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-99",
"text": "Training is performed over 100 epochs with a batch size of 32 using the Adam optimiser (Kingma and Ba, 2015) with a learning rate of 0.005."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-100",
"text": "Following Sukhbaatar et al. (2015) , linear start is employed in all our experiments for the first 20 epochs."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-101",
"text": "All weight parameters are initialised based on a Gaussian distribution with zero mean and = 0.1."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-102",
"text": "Gradients with an`2 norm of 40 are divided by a scalar to have norm 40."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-103",
"text": "Also following Sukhbaatar et al. (2015) , we use only the most recent 50 sentences as the memory and set the number of memory hops to 3, the embedding size to 20, and to"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-105",
"text": "**0.001.**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-106",
"text": "Consistent with other published results over bAbI (Sukhbaatar et al., 2015; Seo et al., 2017) , we repeat training 30 times for each task, and select the model which performs best on the development set."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-108",
"text": "**RESULTS**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-109",
"text": "The results on the 20 bAbI QA tasks are presented in Table 3 ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-110",
"text": "We benchmark against other memory network based models: (1) N2N with ADJ and LW (Sukhbaatar et al., 2015) ; (2) DMN (Kumar et al., 2016) and its improved version DMN+ (Xiong et al., 2016) ; and (3) GN2N (Liu and Perez, 2017) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-111",
"text": "Major improvements on the difficult tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-112",
"text": "The most noticeable performance gains are over tasks 16, 17 and 19 where, compared with the vanilla N2N, UN2N achieves much better results than the worst of ADJ and LW, surpassing both ADJ and LW in the case of tasks 17 and 18."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-113",
"text": "This confirms the validity of the model."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-114",
"text": "UN2N maintains equally competitive performance on the other tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-115",
"text": "Our unified weight tying scheme does not degrade performance on the less challenging tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-116",
"text": "Best performing memory network on bAbI 10k."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-117",
"text": "UN2N achieves the best combined results over bAbI 10k, superior to the previous top model DMN+ where two GRUs are employed to process the memory at sentence and hop level; there is a particularly big improvement on task 16 (UN2N = 99.9 vs. DMN+ = 54.7)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-118",
"text": "It is worth noting that UN2N achieves this with a much smaller embedding size d = 20 compared to 80 in DMN+."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-119",
"text": "Comparison with the state-of-the-art."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-121",
"text": "**DIALOG BABI EXPERIMENTS**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-122",
"text": "In addition to the experiments on the natural language-based QA bAbI dataset in Section 4, we conduct further experiments on a goal-oriented, dialog-based dataset: Dialog bAbI (Bordes and Weston, 2016)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-124",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-125",
"text": "Dataset: In this work, we employ a collection of goal-oriented dialog tasks, Dialog bAbI, all in a restaurant reservation scenario, developed by , consisting of 6 categories each with a specific focus on tasking on aspect of an end-to-end dialog system: 1. issuing API calls, 2. updating API calls, 3. displaying options, 4. providing extra-information, 5. conducting full dialogs (the aggregation of the first 4 tasks), 6."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-126",
"text": "Dialog State Tracking Challenge 2 corpus (DSTC-2)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-127",
"text": "Task 1-5 are generated synthetically in the form of conversation between a user and a bot with entities drawn from a knowledge base with facts defining restaurants and their associated properties (e.g., location and price range, 7 properties in total)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-128",
"text": "Starting with a request from the user, a dialog proceeds with subsequent and alternating user-bot utterances."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-129",
"text": "The bot (or system) needs to figure out the user intention and answer (or react) accordingly."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-130",
"text": "A separate collection of test sets, with entities not occurring in the training set, have also been developed to evaluate the ability of the bot to deal with out-of-vocabulary (OOV) items."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-131",
"text": "Task 6 is based on and derived from the second Dialog State Tracking Challenge (Henderson et al., 2014) with real human-bot conversations."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-132",
"text": "Training Details: Following the works of (Bordes and and (Liu and Perez, 2017) , we frame the task in the same fashion: at the t-th time step, the preceding sequence of utterances, c u 1 , c r 1 , c u 2 , c r 2 , . . . , c u t 1 , c r t 1 (alternating between the user request, denoted c u i and the system response, denoted c r i ), is stored in the memory as m i and c i ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-133",
"text": "Taking the memory as contextual evidence, the goal of the model is to offers an answer c r t (the bot utterance at time t) to the question c u t (the user utterance at time t)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-134",
"text": "It is important to notice that the answers in this dataset may no longer be a single but can be comprised of multiple ones."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-136",
"text": "where W 0 2 R d\u21e5|V | is the weight parameter matrix for the model to learn, u = o K +u K (K is the total number of hops), y i is the i th response in the candidate set C such that y i 2 C, |C| the size of the candidate set, and (\u00b7) a function which maps the input text into a bag of dimension |V |."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-137",
"text": "Additionally, we also append several key features to , following and (Liu and Perez, 2017) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-138",
"text": "First, we mark the identity of the speaker of a given utterance (either user or bot)."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-139",
"text": "Second, we extend by 7 additional features, one for each of the 7 properties associated with a restaurant."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-140",
"text": "Each of these 7 features indicates whether there are any exact matches between words in the candidate and those in the question or memory."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-141",
"text": "We refer to these 7 features as the match features."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-142",
"text": "In terms of the training procedure, experiments are carried out with the same configuration as described in Section 4.1."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-143",
"text": "As a large variance can be observed due to how sensitive memory-based models are to parameter initialisation, following (Sukhbaatar et al., 2015) and (Liu and Perez, 2017) , we repeat each training 10 times using the Table 4 : Per-response accuracy on the Dialog bAbI tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-144",
"text": "N2N: ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-145",
"text": "GN2N: (Liu and Perez, 2017) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-146",
"text": "+match suggests the use of the match features in Section 5.1."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-147",
"text": "Bold indicates the best result in each group (with or without the match features) for a given task."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-148",
"text": "same hyper-parameters and choose the best system based on validation performance."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-150",
"text": "**RESULTS**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-151",
"text": "The results on the Dialog bAbI tasks are shown in Table 4 ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-152",
"text": "In terms of baselines, we benchmark against other memory network-based models: 4 (1) N2N (Sukhbaatar et al., 2015) ; and (2) GN2N (Sukhbaatar et al., 2015) ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-153",
"text": "While the results of GN2N is achieved with ADJ, the type of weight tying for N2N is not reported in ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-154",
"text": "Improvements on task 5."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-155",
"text": "It can be observed that UN2N offers consistent performance boost on task 5 across all experiments settings, especially in the non-OOV group."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-156",
"text": "Given that task 5 is the aggregation of the first 4 tasks, the performance increase suggests that the hybrid weight-tying mechanism in UN2N is better capable of coping with tasks of various nature."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-157",
"text": "Equally competitive performance on task 1-4."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-158",
"text": "UN2N achieves comparable, if not slightly better in some cases, performance on task 1-4."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-159",
"text": "4 Seo et al. (2017) report a higher accuracy on Dialog bAbI. However, their model, based on RNNs, is rather different from memory networks and therefore deemed not immediately comparable and unrelated to the goal of this work: improving memory network-based models."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-161",
"text": "Compared to N2N, UN2N improves the performance consistently with or without the match features."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-162",
"text": "In contrast to GN2N, however, this is not the case."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-163",
"text": "The cause for this performance gap requires further investigation and we leave this exercise for future work."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-165",
"text": "**ANALYSIS**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-166",
"text": "To gain a better understanding of what the model has learned, we visualise the gating vectors z trained on the difficult bAbI tasks (i.e., 3, 16, 17 and 19) , in the form of a 2-d PCA scatter plot in Figure 3 ."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-167",
"text": "Four distinct clusters, representing the 4 different tasks, are easily identifiable."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-168",
"text": "Moreover, it can be observed that tasks 3, 17 and 19 are rather close, reflecting the fact that LW performs better than ADJ over these three tasks."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-169",
"text": "That is, Figure 3 is further evidence that our model dynamically learns the best weight tying method for a given task."
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-171",
"text": "**CONCLUSION**"
},
{
"sent_id": "7a2f56cb4bbcd09ba35934ca76c9a9-C001-172",
"text": "In this paper, we have presented UN2N, a model based on N2N with a unified weight tying scheme and demonstrated the effectiveness of the proposed method on a set of natural-language-based reasoning and dialog tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-12"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-27"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-34"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-38"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-43"
]
],
"cite_sentences": [
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-12",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-27",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-34",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-38",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-43"
]
},
"@MOT@": {
"gold_contexts": [
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-17",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-18"
]
],
"cite_sentences": [
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-17"
]
},
"@USE@": {
"gold_contexts": [
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-78"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-97"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-100"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-103"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-106"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-110"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-143"
],
[
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-152"
]
],
"cite_sentences": [
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-78",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-97",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-100",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-103",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-106",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-110",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-143",
"7a2f56cb4bbcd09ba35934ca76c9a9-C001-152"
]
}
}
},
"ABC_a301586ed006905275ab42c5e40d88_7": {
"x": [
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-2",
"text": "A sparse representation is known to be an effective means to encode precise lexical cues in information retrieval tasks by associating each dimension with a unique ngram-based feature."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-3",
"text": "However, it has often relied on term frequency (such as tf-idf and BM25) or hand-engineered features that are coarse-grained (document-level) and often task-specific, hence not easily generalizable and not appropriate for finegrained (word or phrase-level) retrieval."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-4",
"text": "In this work, we propose an effective method for learning a highly contextualized, word-level sparse representation by utilizing rectified self-attention weights on the neighboring n-grams."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-5",
"text": "We kernelize the inner product space during training for memory efficiency without the explicit mapping of the large sparse vectors."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-6",
"text": "We particularly focus on the application of our model to phrase retrieval problem, which has recently shown to be a promising direction for open-domain question answering (QA) and requires lexically sensitive phrase encoding."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-7",
"text": "We demonstrate the effectiveness of the learned sparse representations by not only drastically improving the phrase retrieval accuracy (by more than 4%), but also outperforming all other (pipeline-based) open-domain QA methods with up to x97 faster inference in SQUAD OPEN and CURATEDTREC."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-8",
"text": "* Most work done during internship with Clova AI."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-11",
"text": "Retrieving text from a large collection of documents is an important problem in several natural language tasks such as question answering (QA) and information retrieval."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-12",
"text": "In the literature, sparse representations have been successfully used in encoding text at sentence or document level, capturing precise lexical information that can be sparsely activated by n-gram based features."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-13",
"text": "For instance, frequency-based sparse representations such as tf-idf map each text segment to a vocabulary space where the weight of each dimension is determined by the associated word's term and inverse document frequency."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-14",
"text": "However, these sparse representations are mainly coarse-grained, task-specific, and not suitable for word-level representations since they statically assign the identical weight to each n-gram and do not change dynamically depending on the context."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-15",
"text": "In this paper, we introduce an effective method for learning a word-level sparse representation that encodes precise lexical information."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-16",
"text": "Our model, COSPR, learns a contextualized sparse phrase representation leveraging rectified self-attention weights on the neighboring n-grams and dynamically encoding important lexical information of each phrase (such as named entities) given its context."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-17",
"text": "Consequently, in contrast to previous sparse regularization techniques on dense embedding of at most few thousand dimensions (Faruqui et al., 2015; Subramanian et al., 2018) , our method is able to produce more interpretable representations with billion-scale cardinality."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-18",
"text": "Such large sparse vector space is prohibitive for explicit mapping; we leverage the fact that our sparse representations only interact through inner product and kernelize the inner-product space for memory-efficient training, inspired by the kernel method in SVMs (Cortes & Vapnik, 1995) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-19",
"text": "This allows us to handle extremely large sparse vectors without worrying about computational bottlenecks."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-20",
"text": "The overview of our model is illustrated in Figure 1 ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-21",
"text": "Figure 1 : Overview of our model."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-22",
"text": "Using the billions of precomputed phrase representations, we perform a maximum inner product search between the phrase vectors and an input question vector."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-23",
"text": "We propose to learn contextualized sparse phrase representations which are also very interpretable."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-24",
"text": "We demonstrate the effectiveness of our model in open-domain question answering (QA), the task of retrieving answer phrases given a web-scale collection of documents."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-25",
"text": "Following Seo et al. (2019) , we concatenate both sparse and dense vectors to encode every phrase in Wikipedia and use maximum similarity search to find the closest candidate phrase to answer each question."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-26",
"text": "We only substitute (or augment) the baseline sparse encoding which is entirely based on frequency-based embedding (tf-idf) with our contextualized sparse representation (COSPR)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-27",
"text": "Our empirical results demonstrate its state-of-the-art performance in open-domain QA datasets, SQUAD OPEN and CURATEDTREC."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-28",
"text": "Notably, our method significantly outperforms DENSPI (Seo et al., 2019) , the previous end-toend QA model, by more than 4% with negligible drop in inference speed."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-29",
"text": "Moreover, our method achieves up to 2% better accuracy and x97 speedup in inference compared to pipeline (retrievalbased) approaches."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-30",
"text": "Our analysis particularly shows that fine-grained sparse representation is crucial for doing well in phrase retrieval task."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-31",
"text": "In summary, the contributions of our paper are:"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-32",
"text": "1. we show that learning sparse representations for embedding lexically important context words of a phrase can be achieved by contextualized sparse representations, 2. we introduce an efficient training strategy that leverages the kernelization of the sparse inner-product space, and 3. we achieve the state-of-the-art performance in two open-domain QA datasets with up to x97 faster inference time."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-34",
"text": "**RELATED WORK**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-36",
"text": "**OPEN-DOMAIN QUESTION ANSWERING**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-37",
"text": "Most open-domain QA models for unstructured texts use a retriever to find documents to read, and then apply a reading comprehension (RC) model to find answers (Chen et al., 2017; Wang et al., 2018a; Lin et al., 2018; Das et al., 2019; Yang et al., 2019; Wang et al., 2019) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-38",
"text": "To improve the performance of the open-domain question answering, various modifications have been studied to the pipelined models which include improving the retriever-reader interaction (Wang et al., 2018a; Das et al., 2019) , re-ranking paragraphs and/or answers (Wang et al., 2018b; Lin et al., 2018; Lee et al., 2018; Kratzwald et al., 2019) , learning end-to-end models with weak supervision , or simply making a better retriever and a reader model (Yang et al., 2019; Wang et al., 2019) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-39",
"text": "Due to the pipeline nature, however, these models inevitably suffer error propagation from the retrievers."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-40",
"text": "To mitigate this problem, Seo et al. (2019) propose to learn query-agnostic representations of phrases in Wikipedia and retrieve phrases that best answers a question."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-41",
"text": "While Seo et al. (2019) have shown that encoding both dense and sparse representations for each phrase could keep the lexically important words of a phrase to some extent, their sparse representations are based on static tf-idf vectors which have globally the same weight for each n-gram."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-42",
"text": "Phrase Representations In NLP, phrase representations can be either obtained in a similar manner as word representations (Mikolov et al., 2013) , or by learning a parametric function of word representations (Cho et al., 2014) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-43",
"text": "In extractive question answering, phrases are often referred to as spans, but most models do not consider explicitly learning phrase representations as these answer spans can be obtained by predicting only start and end positions in a paragraph (Wang & Jiang, 2017; ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-44",
"text": "Nevertheless, few studies have focused on directly learning and classifying phrase representations (Lee et al., 2017) which achieve strong performance when combined with attention mechanism."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-45",
"text": "In this work, we are interested in learning query-agnostic sparse phrase representations which enables the precomputation of re-usable phrase representations (Seo et al., 2018) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-46",
"text": "Sparse Representations One of the most basic sparse representations is the bag-of-words modeling, and recent works often emphasize the use of bag-of-words models as strong baselines for sentence classification and question answering (Joulin et al., 2017; Weissenborn et al., 2017) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-47",
"text": "tf-idf is another good example of sparse representations that is used for document retrieval, and is still widely adopted both in IR and QA community."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-48",
"text": "Our work could be seen as an attempt to build a trainable tf-idf model for phrase (n-gram) representations, which should be more fine-grained than paragraphs or documents."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-49",
"text": "There have been some attempts in IR that learn sparse representations of documents for duplicate detection (Hajishirzi et al., 2010) , inverted indexing (Zamani et al., 2018) , and more."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-50",
"text": "Unlike previous works, however, our method does not require hand-engineered features for sparse n-grams while keeping the original vocabulary space."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-51",
"text": "In NLP, there are some works on training sparse representations specifically designed for improving interpretability of word representations (Faruqui et al., 2015; Subramanian et al., 2018) , but they lose an important role of sparse represenations, which is keeping the exact lexical information, as they are trained by sparsifying dense representations of a higher (but much lower than V ) dimension."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-53",
"text": "**BACKGROUND: OPEN-DOMAIN QA THROUGH PHRASE RETRIEVAL**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-54",
"text": "Open-Domain QA We primarily focus on open-domain QA on unstructured data where the answer is a text span in the corpus."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-55",
"text": "Formally, given a set of K documents x 1 , ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-56",
"text": ". . , x K and a question q, the task is to design a model that obtains the answer\u00e2 b\u0177 a = arg max"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-57",
"text": "where F is the score model to learn and x k i:j is a phrase consisting of words from the i-th to the j-th word in the k-th document."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-58",
"text": "Typically, the number of documents (K) is in the order of millions for open-domain QA (e.g. 5 million for English Wikipedia), which makes the task computationally challenging."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-59",
"text": "Pipeline-based methods typically leverage a document retriever to reduce the number of documents to read, but they suffer from error propagation when wrong documents are retrieved and can be slow if the reader model is computationally cumbersome."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-60",
"text": "Open-domain QA with Phrase Encoding and Retrieval As an alternative, phrase-retrieval approaches (Seo et al., 2018; mitigate this problem by directly accessing all phrases in K documents by decomposing F into two functions, a = arg max"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-61",
"text": "where \u00b7 denotes inner product operation."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-62",
"text": "Unlike running a complex reading comprehension model in pipeline-based approaches, H x query-agnostically encode (all possible phrases of) each document just once, so that we just need to compute H q (which is very fast) and perform similarity search on the phrase encoding (which is also fast)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-63",
"text": "Seo et al. (2019) have shown that encoding each phrase with a concatenation of dense and sparse representations is effective, where the dense part is computed from BERT (Devlin et al., 2019) and the sparse part is obtained from the tf-idf vector of the document and the paragraph of the phrase."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-64",
"text": "We briefly describe how the dense part is obtained below."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-65",
"text": "Dense Representation Assuming that the document x has N words as x 1 , . . . , x N , Seo et al."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-66",
"text": "(2019) use BERT (Devlin et al., 2019) to compute contextualized representation of each word as h 1 , . . . , h N = BERT(x 1 , . . . , x N )."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-67",
"text": "Based on the contextualized embeddings, we obtain dense phrase representations as follows: We split each"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-68",
"text": "and d c are chosen to make d = 2d se + 2d c )."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-69",
"text": "Then each phrase x i:j is densely represented as follows:"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-70",
"text": "where \u00b7 denotes inner product operation."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-71",
"text": "h 1 i and h 2 j are start/end representations of a phrase, and the inner product of h 3 i and h 4 j is used for computing coherency of the phrase."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-72",
"text": "Refer to Seo et al. (2019) for details; we mostly reuse its architecture."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-74",
"text": "**SPARSE ENCODING OF PHRASES**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-75",
"text": "Sparse representations are often suitable for keeping the precise lexical information present in the text, complementing dense representations that are often good for encoding semantic and syntactic information."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-76",
"text": "While the sparsification of dense word embeddings has been explored previously (Faruqui et al., 2015) , its main limitations are that (1) it starts from the dense embedding which might have already lost rich lexical information, and (2) its cardinality is in the order of thousands at max, which we hypothesize is too small to encode sufficient information."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-77",
"text": "Our work hence focuses on creating the sparse representation of each phrase which is not bottlenecked by dense embedding and is capable of increasing the cardinality to billion-scale without much computational cost."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-79",
"text": "**WHY DO WE NEED TO LEARN SPARSE REPRESENTATIONS?**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-80",
"text": "Suppose we are given a question \"How many square kilometres of the Amazon forest was lost by 1991?\" and the target answer is in the following sentence, Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres (160,000 to 227,000 sq mi)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-81",
"text": "To answer the question, the model should know that the target answer (415,000) corresponds to the year 1991 while the (confusing) phrase 587,000 corresponds to the year 2000, which requires syntactic understanding of English parallelism."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-82",
"text": "The dense phrase encoding is likely to have a difficulty in precisely differentiating between 1991 and 2000 since it needs to also encode several different kinds of information."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-83",
"text": "Window-based tf-idf would not help either because the year 2000 is closer (in word distance) to 415,000."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-84",
"text": "This example illustrates the strong need to create an n-gram-based sparse encoding that is highly syntax-and context-aware."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-86",
"text": "**CONTEXTUALIZED SPARSE REPRESENTATIONS**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-87",
"text": "Our sparse model, unlike pre-computed sparse embeddings such as tf-idf, dynamically computes the weight of each n-gram that depends on the context."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-88",
"text": "Intuitively, for each phrase, we want to compute a positive weight for each n-gram near the phrase depending on how important the n-gram is to the phrase."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-89",
"text": "Sparse Representation Sparse representation of each phrase is also obtained as the concatenation of its start word's and end word's sparse embedding, i.e. s i:j = [s start i , s end j ]."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-90",
"text": "This way, similarly to how the dense phrase embedding is obtained, we can efficiently compute them without explicitly enumerating all possible phrases."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-91",
"text": "We obtain each (start and end) sparse embedding in the same way (with unshared parameters), so we just describe how we obtain the start sparse embedding here and omit the superscript 'start'."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-92",
"text": "Given the contextualized encoding of each document H = [h 1 , . . . , h N ] \u2208 R N \u00d7d , we obtain its (start or end) sparse encoding S = [s 1 , . . . , s N ] \u2208 R N \u00d7F by"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-93",
"text": "where Q, K \u2208 R N \u00d7d are query, key matrices obtained by applying a (different) linear transformation on H (i.e., using W Q , W K : R N \u00d7d \u2192 R N \u00d7d ), and F \u2208 R N \u00d7F is an one-hot n-gram feature representation of the input document x. That is, for instance, if we want to encode unigram (1-gram) features, F i will be simply a one-hot representation of the word x i , and F will be equivalent to the vocabulary size."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-94",
"text": "Note that F will be very large, so it should always exists as an efficient sparse matrix format (e.g. csc) and one should not explicitly create its dense form."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-95",
"text": "We also discuss how we can leverage it during training in an efficient manner in Section 4.3."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-96",
"text": "Note that Equation 2 is similar to the scaled dot-product self-attention (Vaswani et al., 2017) , with two key differences that (1) ReLU instead of softmax is used, and (2) a sparse matrix F instead of (dense) value matrix is used."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-97",
"text": "In fact, our sparse embedding is related to attention mechanism in that we want to compute how important each n-gram is for each phrase's start and end word."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-98",
"text": "Intuitively, s i contains a weighted bag-of-ngram representation where each n-gram is weighted by its relative importance on each start or end word of a phrase."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-99",
"text": "Unlike attention mechanism whose role is to summarize the target vectors via weighted summation (thus softmax is used, which sums up to 1), we do not perform the summation and directly output the unnormalized attention weights to a large sparse embedding space."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-100",
"text": "In fact, we experimentally observe that ReLU is more effective than softmax for this objective."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-101",
"text": "Since we want to handle several different sizes of n-grams, we create the sparse encoding S for each n-gram and concatenate the resulting sparse encodings."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-102",
"text": "That is, let S n be the sparse encoding for n-gram, then the final sparse encoding is the concatenation of different n-grams we consider."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-103",
"text": "In practice, we experimentally find that unigram and bigram are sufficient for most use cases; in this case, the sparse vector for the (start) word x i will be s i = [s 1 i , s 2 i ]."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-104",
"text": "We do not share linear transformation parameters for across different n-grams."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-105",
"text": "Note that this is also analogous to multiple attention heads in Vaswani et al. (2017) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-106",
"text": "Question Encoding For a question q = [CLS], . . . , q M where [CLS] denotes a special token for BERT inputs, contextualized question representations are computed in a similar way (h [CLS] , . . ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-107",
"text": "h M = BERT([CLS], . . . , q M ))."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-108",
"text": "We share the same BERT used for the phrase encoding."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-109",
"text": "We compute sparse encodings on the question side (s \u2208 R F ) in a similar way to the document side, with the only difference that we use the [CLS] token instead of start and end words to represent the entire question."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-110",
"text": "Linear transformation weights are not shared."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-112",
"text": "**TRAINING**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-114",
"text": "**KERNEL FUNCTION**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-134",
"text": "Baudi\u0161 &\u0160ediv\u1ef3 (2015) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-135",
"text": "The queries mostly come from search engine logs generated by real users."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-136",
"text": "We use 694 test set QA pairs for testing our model."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-137",
"text": "Note that we only train on SQuAD and test on both SQuAD and CuratedTREC, relying on the generalization ability of our model for zero-shot inference on Cu-ratedTREC."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-138",
"text": "This is in a clear contrast to previous work that utilize distant supervision (Chen et al., 2017) or weak supervision (Lee et al., 2018; Min et al., 2019) on CuratedTREC."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-140",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-142",
"text": "**IMPLEMENTATION DETAILS**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-143",
"text": "We use and finetune BERT LARGE for our encoders."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-144",
"text": "We use BERT vocabulary which has 30522 unique tokens based on byte pair encodings."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-145",
"text": "As a result, we have F = 30522 when using unigram feature for F, and F \u2248 1B when using both uni/bigram features."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-146",
"text": "We do not finetune the word embedding during training."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-147",
"text": "We pre-compute and store all encoded phrase representations of all documents in Wikipedia (more than 5 million documents)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-148",
"text": "It takes 600 GPU hours to index all phrases in Wikipedia."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-149",
"text": "Each phrase representation has 2d se + 2F + 1 dimensions."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-150",
"text": "We use the same storage reduction and search techniques by Seo et al. (2019) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-151",
"text": "For storage, the total size of the index is 1.3 TB including unigram and bigram sparse representations."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-152",
"text": "For search, we either perform dense search first and then rerank with sparse scores (DFS) or perform sparse search first and rerank with dense scores (SFS), and also consider a combination of both (Hybrid)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-154",
"text": "**CONSTRAINING SPARSE ENCODING**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-155",
"text": "We find that injecting some prior domain knowledge to avoid spurious matchings allows us to learn better sparse representations."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-156",
"text": "For example, in machine reading comprehension, answer phrases do not occur as an exact phrase in questions (e.g., the phrase \"August 4, 1961\" would not appear in the question \"When was Barack Obama born?\")."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-157",
"text": "Therefore, we can zero the elements in the diagonal axis of QK \u2208 R N \u00d7N in Equation 2 which correspond to the attention values of target phrase itself."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-158",
"text": "Additionally, we mask special tokens in BERT such as [SEP] or [CLS] to have zero weights as matching of these tokens means nothing."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-160",
"text": "**RESULTS**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-161",
"text": "We evaluate the effectiveness of COSPR by augmenting DENSPI (Seo et al., 2019) with contextualized sparse representations (DENSPI+COSPR)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-162",
"text": "We extensively compare the model with the original DENSPI and previous pipeline-based QA models."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-163",
"text": "Model SQUADOPEN CURATEDTREC EM F1 Exact Match s/Q DrQA (Chen et al., 2017) 29.8 ** -25.4 * 35 R 3 (Wang et al., 2018a) 29.1 37.5 28.4 * -Paragraph Ranker (Lee et al., 2018) 30.2 -35.4 * -Multi-Step-Reasoner (Das et al., 2019) 31.9 39.2 --BERTserini (Yang et al., 2019) 38.6 46.1 -115 ORQA 20.2 -30.1 -Multi-passage BERT \u2020 \u2020 (Wang et al., 2019) 53.0 60.9 --DENSPI (Seo et al., 2019) 36 (Yang et al., 2019) while being almost two orders of magnitude faster."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-164",
"text": "We expect much bigger speed gaps between ours and other pipeline methods as most of them put additional complex components to the original pipelined methods."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-165",
"text": "On CURATEDTREC, which is constructed from real user queries, our model also achieves the stateof-the-art performance."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-166",
"text": "Even though our model is only trained on SQUAD (i.e. zero-shot), it outperforms all other models which are either distant-or semi-supervised with at least 29x faster inference."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-167",
"text": "Note that our method is orthogonal to end-to-end training or weak supervision (Min et al., 2019) methods and future work can potentially benefit from these."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-169",
"text": "**ABLATION STUDY**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-170",
"text": "One interesting observation is that adding trigram features in our sparse representations is worse than using uni-/bigram representations only."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-171",
"text": "We suspect that the model becomes too dependent on trigram features, which means we might need a stronger regularization for high-order n-gram features."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-172",
"text": "In Table 2 (right), we show how we consistently improve over DENSPI when COSPR is added in different search strategies."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-173",
"text": "Note that on SQUAD OPEN , SFS is better than DFS as questions in SQUAD are created by turkers given particular paragraphs, whereas on CURATEDTREC where the questions more resemble real user queries, DFS outperforms SFS showing the effectiveness of dense search when not knowing which documents to read."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-174",
"text": "Similar observation was pointed by who showed different performance tendencies between two datasets, but as we are using both dense and sparse representations, our model can achieve state-of-the-art performances on both datasets."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-175",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-176",
"text": "**ANALYSIS**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-177",
"text": "Interpretability of Sparse Representations Sparse representations often have better interpretability than dense representations as each dimension of a sparse vector corresponds to a specific word."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-178",
"text": "We compare tf-idf vectors and COSPR (uni/bigram) by showing top weighted n-grams in each representation."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-179",
"text": "Note that the scale of weights in tf-idf vectors is normalized in open-domain setups to match the scale between tf-idf vectors and dense vectors."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-180",
"text": "We observe that tf-idf vectors usually assign high weights on infrequent (often meaningless) n-grams, while COSPR focuses on contextually important entities such as 1991 for 415,000 or california state, state university for 12."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-181",
"text": "Our sparse question representation also learns meaningful n-gram weights compared to tf-idf vectors."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-182",
"text": "Table 4 shows the outputs of three OpenQA models: DrQA (Chen et al., 2017) , DENSPI (Seo et al., 2019) , and our DENSPI+COSPR."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-183",
"text": "DENSPI+COSPR is able to retrieve various correct answers from different documents, and it often correctly answers questions with specific dates or numbers compared to DENSPI showing the effectiveness of learned sparse representations."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-185",
"text": "**PREDICTION SAMPLES**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-186",
"text": "----------------------------------"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-187",
"text": "**CONCLUSION**"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-188",
"text": "In this paper, we demonstrate the effectiveness of contextualized sparse vectors, COSPR, for encoding phrase with rich lexical information in open-domain question answering."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-189",
"text": "We efficiently train our sparse representations by kernelizing the sparse inner product space."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-190",
"text": "Experimental results show that our fast open-domain QA model that augments the previous model (DENSPI) with learned sparse representation (COSPR) outperforms previous open-domain QA models, including recent BERTbased pipeline models, with two orders of magnitude faster inference time."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-191",
"text": "Future work includes extending the contextualized sparse representations to other challenging QA settings such as multihop reasoning and building a full inverted index of the learned sparse representations (Zamani et al., 2018) for more powerful Sparse-First Search (SFS)."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-115",
"text": "As training phrase encoders on the whole Wikipedia is computationally prohibitive, we use training examples from an extractive question answering dataset (SQuAD) to train our encoders."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-116",
"text": "Given a pair of question q and a golden document x (a paragraph in the case of SQuAD), we first compute the dense logit of each phrase x i:j by l i,j = h i:j \u00b7 h ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-117",
"text": "Unlike Seo et al. (2019) , each phrase's sparse embedding is also trained, so it needs to be considered in the loss function."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-118",
"text": "We define the sparse logit for phrase x i:j as l sparse"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-119",
"text": ". For brevity, we describe how we compute the first term s start i \u00b7s start"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-120",
"text": "[CLS] corresponding to the start word (and dropping the superscript 'start'); the second term can be computed in the same way."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-121",
"text": "denote the question side query, key, and n-gram feature matrices, respectively."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-122",
"text": "We can efficiently compute it if we precompute FF \u2208 R N \u00d7M ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-123",
"text": "Note that FF can be considered as applying a kernel function, i.e. K(F, F ) = FF where its (i, j)-th entry is 1 if and only if n-gram at i-th position of the context is equivalent to j-th n-gram of the question, which can be efficiently computed as well."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-124",
"text": "One can also think of this as kernel trick (in the literature of SVM (Cortes & Vapnik, 1995) ) that allows us to compute the loss function without explicit mapping."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-125",
"text": "The loss to minimize is computed from the negative log likelihood over the sum of the dense and sparse logits:"
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-126",
"text": "where i * , j * denote the true start and end positions of the answer phrase."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-127",
"text": "While the loss above is an unbiased estimator, in practice, we adopt early loss summation as suggested by Seo et al. (2019) for larger gradient signals in early training."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-128",
"text": "Additionally, we also add dense-only loss that omits the sparse logits (i.e. original loss in Seo et al. (2019) ) to the final loss, in which case we find that we obtain higher-quality dense phrase representations."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-129",
"text": "Negative Sampling We train our model on SQuAD v1.1 which always has a positive paragraph that contains the answer."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-130",
"text": "To learn robust phrase representations, we concatenate negative paragraphs to the original SQuAD paragraphs."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-131",
"text": "To each paragraph x, we concatenate the paragraph x neg which was paired with the question whose dense representation h neg is most similar to the original dense question representation h , following Seo et al. (2019) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-132",
"text": "Note the difference, however, that we concatenate the negative example instead of considering it as an independent example with noanswer option Levy et al. (2017) ."
},
{
"sent_id": "a301586ed006905275ab42c5e40d88-C001-133",
"text": "During training, we find that adding tf-idf matching scores on the word-level logits of the negative paragraphs further improves the quality of sparse representations as our sparse models have to give stronger attentions to positively related words in this biased setting."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"a301586ed006905275ab42c5e40d88-C001-25"
],
[
"a301586ed006905275ab42c5e40d88-C001-127"
],
[
"a301586ed006905275ab42c5e40d88-C001-128"
],
[
"a301586ed006905275ab42c5e40d88-C001-131"
],
[
"a301586ed006905275ab42c5e40d88-C001-150"
],
[
"a301586ed006905275ab42c5e40d88-C001-182"
]
],
"cite_sentences": [
"a301586ed006905275ab42c5e40d88-C001-25",
"a301586ed006905275ab42c5e40d88-C001-127",
"a301586ed006905275ab42c5e40d88-C001-128",
"a301586ed006905275ab42c5e40d88-C001-131",
"a301586ed006905275ab42c5e40d88-C001-150",
"a301586ed006905275ab42c5e40d88-C001-182"
]
},
"@DIF@": {
"gold_contexts": [
[
"a301586ed006905275ab42c5e40d88-C001-28"
],
[
"a301586ed006905275ab42c5e40d88-C001-117"
]
],
"cite_sentences": [
"a301586ed006905275ab42c5e40d88-C001-28",
"a301586ed006905275ab42c5e40d88-C001-117"
]
},
"@BACK@": {
"gold_contexts": [
[
"a301586ed006905275ab42c5e40d88-C001-40"
],
[
"a301586ed006905275ab42c5e40d88-C001-72"
]
],
"cite_sentences": [
"a301586ed006905275ab42c5e40d88-C001-40",
"a301586ed006905275ab42c5e40d88-C001-72"
]
},
"@MOT@": {
"gold_contexts": [
[
"a301586ed006905275ab42c5e40d88-C001-41"
]
],
"cite_sentences": [
"a301586ed006905275ab42c5e40d88-C001-41"
]
},
"@EXT@": {
"gold_contexts": [
[
"a301586ed006905275ab42c5e40d88-C001-161"
]
],
"cite_sentences": [
"a301586ed006905275ab42c5e40d88-C001-161"
]
}
}
},
"ABC_5177188d88391f08325262dbdefabf_7": {
"x": [
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-2",
"text": "Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-3",
"text": "Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequenceto-sequence task, rather than a word-tosequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-4",
"text": "We implement this approach in a Transformerbased sequence-to-sequence model."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-5",
"text": "Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-6",
"text": "We achieve stateof-the-art results both in contextual and non-contextual definition modeling."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-9",
"text": "The task of definition modeling, introduced by Noraset et al. (2017) , consists in generating the dictionary definition of a specific word: for instance, given the word \"monotreme\" as input, the system would need to produce a definition such as \"any of an order (Monotremata) of egg-laying mammals comprising the platypuses and echidnas\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-10",
"text": "1 Following the tradition set by lexicographers, we call the word being defined a definiendum (pl. definienda), whereas a word occurring in its definition is called a definiens (pl. definientia)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-11",
"text": "Definition modeling can prove useful in a variety of applications."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-12",
"text": "Systems trained for the task may generate dictionaries for low resource languages, or extend the coverage of existing lexicographic resources where needed, e.g. of domainspecific vocabulary."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-13",
"text": "Such systems may also be 1 Definition from Merriam-Webster."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-14",
"text": "able to provide reading help by giving definitions for words in the text."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-15",
"text": "A major intended application of definition modeling is the explication and evaluation of distributed lexical representations, also known as word embeddings (Noraset et al., 2017) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-16",
"text": "This evaluation procedure is based on the postulate that the meaning of a word, as is captured by its embedding, should be convertible into a human-readable dictionary definition."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-17",
"text": "How well the meaning is captured must impact the ability of the model to reproduce the definition, and therefore embedding architectures can be compared according to their downstream performance on definition modeling."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-18",
"text": "This intended usage motivates the requirement that definition modeling architectures take as input the embedding of the definiendum and not retrain it."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-19",
"text": "From a theoretical point of view, usage of word embeddings as representations of meaning (cf."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-20",
"text": "Lenci, 2018; Boleda, 2019 , for an overview) is motivated by the distributional hypothesis (Harris, 1954) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-21",
"text": "This framework holds that meaning can be inferred from the linguistic context of the word, usually seen as co-occurrence data."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-22",
"text": "The context of usage is even more crucial for characterizing meanings of ambiguous or polysemous words: a definition that does not take disambiguating context into account will be of limited use (Gadetsky et al., 2018) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-23",
"text": "We argue that definition modeling should preserve the link between the definiendum and its context of occurrence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-24",
"text": "The most natural approach to this task is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it (cf."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-25",
"text": "sections 3 & 4) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-26",
"text": "We implement this approach in a Transformer-based sequence-to-sequence model that achieves state-of-the-art performances (sections 5 & 6)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-27",
"text": "arXiv:1911.05715v1 [cs.CL] 13 Nov 2019 2 Related Work"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-28",
"text": "In their seminal work on definition modeling, Noraset et al. (2017) likened systems generating definitions to language models, which can naturally be used to generate arbitrary text."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-29",
"text": "They built a sequential LSTM seeded with the embedding of the definiendum; its output at each time-step was mixed through a gating mechanism with a feature vector derived from the definiendum."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-30",
"text": "Gadetsky et al. (2018) stressed that a definiendum outside of its specific usage context is ambiguous between all of its possible definitions."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-31",
"text": "They proposed to first compute the AdaGram vector (Bartunov et al., 2016) for the definiendum, to then disambiguate it using a gating mechanism learned over contextual information, and finally to run a language model over the sequence of definientia embeddings prepended with the disambiguated definiendum embedding."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-32",
"text": "In an attempt to produce a more interpretable model, map the definiendum to a sparse vector representation."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-33",
"text": "Their architecture comprises four modules."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-34",
"text": "The first encodes the context in a sentence embedding, the second converts the definiendum into a sparse vector, the third combines the context embedding and the sparse representation, passing them on to the last module which generates the definition."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-35",
"text": "Related to these works, specifically tackle definition modeling in the context of Chinese-whereas all previous works on definition modeling studied English."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-36",
"text": "In a Transformer-based architecture, they incorporate \"sememes\" as part of the representation of the definiendum to generate definitions."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-37",
"text": "On a more abstract level, definition modeling is related to research on the analysis and evaluation of word embeddings (Levy and Goldberg, 2014a,b; Arora et al., 2018; Batchkarov et al., 2016; Swinger et al., 2018, e.g.) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-38",
"text": "It also relates to other works associating definitions and embeddings, like the \"reverse dictionary task\" (Hill et al., 2016 )-retrieving the definiendum knowing its definition, which can be argued to be the opposite of definition modeling-or works that derive embeddings from definitions (Wang et al., 2015; Tissier et al., 2017; Bosc and Vincent, 2018) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-39",
"text": "3 Definition modeling as a sequence-to-sequence task Gadetsky et al. (2018) remarked that words are often ambiguous or polysemous, and thus generating a correct definition requires that we either use sense-level representations, or that we disambiguate the word embedding of the definiendum."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-40",
"text": "The disambiguation that Gadetsky et al. (2018) proposed was based on a contextual cue-ie."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-41",
"text": "a short text fragment."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-42",
"text": "As notes, the cues in Gadetsky et al.'s (2018) dataset do not necessarily contain the definiendum or even an inflected variant thereof."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-43",
"text": "For instance, one training example disambiguated the word \"fool\" using the cue \"enough horsing around-let's get back to work!\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-44",
"text": "Though the remark that definienda must be disambiguated is pertinent, the more natural formulation of such a setup would be to disambiguate the definiendum using its actual context of occurrence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-45",
"text": "In that respect, the definiendum and the contextual cue would form a linguistically coherent sequence, and thus it would make sense to encode the context together with the definiendum, rather than to merely rectify the definiendum embedding using a contextual cue."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-46",
"text": "Therefore, definition modeling is by its nature a sequence-to-sequence task: mapping contexts of occurrence of definienda to definitions."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-47",
"text": "This remark can be linked to the distributional hypothesis (Harris, 1954) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-48",
"text": "The distributional hypothesis suggests that a word's meaning can be inferred from its context of usage; or, more succinctly, that \"you shall know a word by the company it keeps\" (Firth, 1957) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-49",
"text": "When applied to definition modeling, the hypothesis can be rephrased as follows: the correct definition of a word can only be given when knowing in what linguistic context(s) it occurs."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-50",
"text": "Though different kinds of linguistic contexts have been suggested throughout the literature, we remark here that sentential context may sometimes suffice to guess the meaning of a word that we don't know (Lazaridou et al., 2017) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-51",
"text": "Quoting from the example above, the context \"enough around-let's get back to work!\" sufficiently characterizes the meaning of the omitted verb to allow for an approximate definition for it even if the blank is not filled (Taylor, 1953; Devlin et al., 2018) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-52",
"text": "This reformulation can appear contrary to the original proposal by Noraset et al. (2017) , which conceived definition modeling as a \"word-tosequence task\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-53",
"text": "They argued for an approach related to, though distinct from sequence-to-sequence architectures."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-54",
"text": "Concretely, a specific encoding procedure was applied to the definiendum, so that it could be used as a feature vector during generation."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-55",
"text": "In the simplest case, vector encoding of the definiendum consists in looking up its vector in a vocabulary embedding matrix."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-56",
"text": "We argue that the whole context of a word's usage should be accessible to the generation algorithm rather than a single vector."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-57",
"text": "To take a more specific case of verb definitions, we observe that context explicitly represents argument structure, which is obviously useful when defining the verb."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-58",
"text": "There is no guarantee that a single embedding, even if it be contextualized, would preserve this wealth of information-that is to say, that you can cram all the information pertaining to the syntactic context into a single vector."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-59",
"text": "Despite some key differences, all of the previously proposed architectures we are aware of (Noraset et al., 2017; Gadetsky et al., 2018; followed a pattern similar to sequence-to-sequence models."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-60",
"text": "They all implicitly or explicitly used distinct submodules to encode the definiendum and to generate the definientia."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-61",
"text": "In the case of Noraset et al. (2017) , the encoding was the concatenation of the embedding of the definiendum, a vector representation of its sequence of characters derived from a characterlevel CNN, and its \"hypernym embedding\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-62",
"text": "Gadetsky et al. (2018) used a sigmoid-based gating module to tweak the definiendum embedding."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-63",
"text": "The architecture proposed by is comprised of four modules, only one of which is used as a decoder: the remaining three are meant to convert the definiendum as a sparse embedding, select some of the sparse components of its meaning based on a provided context, and encode it into a representation adequate for the decoder."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-64",
"text": "Aside from theoretical implications, there is another clear gain in considering definition modeling as a sequence-to-sequence task."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-65",
"text": "Recent advances in embedding designs have introduced contextual embeddings (McCann et al., 2017; Peters et al., 2018; Devlin et al., 2018) ; and these share the particularity that they are a \"function of the entire sentence\" (Peters et al., 2018) : in other words, vector representations are assigned to tokens rather than to word types, and moreover semantic information about a token can be distributed over other token representations."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-66",
"text": "To extend definition modeling to contextual embeddings therefore requires that we devise architectures able to encode a word in its context; in that respect sequence-to-sequence architectures are a natural choice."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-67",
"text": "A related point is that not all definienda are comprised of a single word: multi-word expressions include multiple tokens, yet receive a single definition."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-68",
"text": "Word embedding architectures generally require a pre-processing step to detect these expressions and merge them into a single token."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-69",
"text": "However, as they come with varying degrees of semantic opacity (Cordeiro et al., 2016) , a definition modeling system would benefit from directly accessing the tokens they are made up from."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-70",
"text": "Therefore, if we are to address the entirety of the language and the entirety of existing embedding architectures in future studies, reformulating definition modeling as a sequence-to-sequence task becomes a necessity."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-72",
"text": "**FORMALIZATION**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-73",
"text": "A sequence-to-sequence formulation of definition modeling can formally be seen as a mapping between contexts of occurrence of definienda and their corresponding definitions."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-74",
"text": "It moreover requires that the definiendum be formally distinguished from the remaining context: otherwise the definition could not be linked to any particular word of the contextual sequence, and thus would need to be equally valid for any word of the contextual sequence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-75",
"text": "We formalize definition modeling as mapping to sequences of definientia from sequences of pairs w 1 , i 1 , . . . , w n , i n , where w k is the k th word in the input and i k \u2208 {0, 1} indicates whether the k th token is to be defined."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-76",
"text": "As only one element of the sequence should be highlighted, we expect the set of all indicators to contain only two elements: the one, i d = 1, to mark the definiendum, the other, i c = 0, to mark the context; this entails that we encode this marking using one bit only."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-77",
"text": "2 To treat definition modeling as a sequence-tosequence task, the information from each pair w k , i k has to be integrated into a single repre-sentation marked k :"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-78",
"text": "This marking function can theoretically take any form."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-79",
"text": "Considering that definition modeling uses the embedding of the definiendum w d = e(w d ), in this work we study a multiplicative and an additive mechanism, as they are conceptually the simplest form this marking can take in a vector space."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-80",
"text": "They are given schematically in Figure 1 , and formally defined as:"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-81",
"text": "The last point to take into account is where to set the marking."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-82",
"text": "Two natural choices are to set it either before or after encoded representations were obtained."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-83",
"text": "We can formalize this using either of the following equation, with E the model's encoder:"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-85",
"text": "**MULTIPLICATIVE MARKING: SELECT**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-86",
"text": "The first option we consider is to use scalar multiplication to distinguish the word to define."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-87",
"text": "In such a scenario, the marked token encoding is"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-88",
"text": "As we use bit information as indicators, this form of marking entails that only the representation of the definiendum be preserved and that all other contextual representations are set to 0 = (0, \u00b7 \u00b7 \u00b7 , 0): thus multiplicative marking amounts to selecting just the definiendum embedding and discarding other token embeddings."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-89",
"text": "The contextualized definiendum encoding bears the trace of its context, but detailed information is irreparably lost."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-90",
"text": "Hence, we refer to such an integration mechanism as a SELECT marking of the definiendum."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-91",
"text": "When to apply marking, as introduced by eq. 4, is crucial when using the multiplicative marking scheme SELECT."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-92",
"text": "Should we mark the definiendum before encoding, then only the definiendum embedding is passed into the encoder: the resulting system provides out-of-context definitions, like in Noraset et al. (2017) where the definition is not linked to the context of a word but to its definiendum only."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-93",
"text": "For context to be taken into account under the multiplicative strategy, tokens w k must be encoded and contextualized before integration with the indicator i k ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-94",
"text": "Figure 1a presents the contextual SELECT mechanism visually."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-95",
"text": "It consists in coercing the decoder to attend only to the contextualized representation for the definiendum."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-96",
"text": "To do so, we encode the full context and then select only the encoded representation of the definiendum, dropping the rest of the context, before running the decoder."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-97",
"text": "In the case of the Transformer architecture, this is equivalent to using a multiplicative marking on the encoded representations: vectors that have been zeroed out are ignored during attention and thus cannot influence the behavior of the decoder."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-98",
"text": "This SELECT approach may seem intuitive and naturally interpretable, as it directly controls what information is passed to the decoder-we carefully select only the contextualized definiendum, thus the only remaining zone of uncertainty would be how exactly contextualization is performed."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-99",
"text": "It also seems to provide a strong and reasonable bias for training the definition generation system."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-100",
"text": "Such an approach, however, is not guaranteed to excel: forcibly omitted context could contain important information that might not be easily incorporated in the definiendum embedding."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-101",
"text": "Being simple and natural, the SELECT approach resembles architectures like that of Gadetsky et al. (2018) and : the full encoder is dedicated to altering the embedding of the definiendum on the basis of its context; in that, the encoder may be seen as a dedicated contextualization sub-module."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-103",
"text": "**ADDITIVE MARKING: ADD**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-104",
"text": "We also study an additive mechanism shown in Figure 1b (henceforth ADD)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-105",
"text": "It concretely consists in embedding the word w k and its indicator bit i k in the same vector space and adding the corresponding vectors:"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-106",
"text": "In other words, under ADD we distinguish the definiendum by adding a vector D to the definiendum embedding, and another vector C to the remaining context token embeddings; both markers D and C are learned during training."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-107",
"text": "In our implementation, markers are added to the input of the encoder, so that the encoder has access to this information; we leave the question of whether to integrate indicators and words at other points of the encoding process, as suggested in eq. 4, to future work."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-108",
"text": "Additive marking of substantive features has its precedents."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-109",
"text": "For example, BERT embeddings (Devlin et al., 2018) are trained using two sentences at once as input; sentences are distinguished with added markers called \"segment encodings\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-110",
"text": "Tokens from the first sentence are all marked with an added vector seg A , whereas tokens from second sentences are all marked with an added vector seg B ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-111",
"text": "The main difference here is that we only mark one item with the marker D, while all others are marked with C."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-112",
"text": "This ADD marking is more expressive than the SELECT architecture."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-113",
"text": "Sequence-to-sequence decoders typically employ an attention to the input source (Bahdanau et al., 2014) , which corresponds to a re-weighting of the encoded input sequence based on a similarity between the current state of the decoder (the 'query') and each member of the input sequence (the 'keys')."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-114",
"text": "This re-weighting is normalized with a softmax function, producing a probability distribution over keys."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-115",
"text": "However, both non-contextual definition modeling and the SELECT approach produce singleton encoded sequences: in such scenarios the attention mechanism assigns a single weight of 1 and thus devolves into a simple linear transformation of the value and makes the attention mechanism useless."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-116",
"text": "Using an additive marker, rather than a selective mechanism, will prevent this behavior."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-118",
"text": "**EVALUATION**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-119",
"text": "We implement several sequence to sequence models with the Transformer architecture (Vaswani et al., 2017) , building on the OpenNMT library (Klein et al., 2017) with adaptations and modifications when necessary."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-120",
"text": "3 Throughout this work, we use GloVe vectors (Pennington et al., 2014) and freeze weights of all embeddings for a fairer comparison with previous models; words not in GloVe but observed in train or validation data and missing definienda in our test sets were randomly initialized with components drawn from a normal distribution N (0, 1)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-121",
"text": "We train a distinct model for each dataset."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-122",
"text": "We batch examples by 8,192, using gradient accumulation to circumvent GPU limitations."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-123",
"text": "We optimize the network using Adam with \u03b2 1 = 0.99, \u03b2 2 = 0.998, a learning rate of 2, label smoothing of 0.1, Noam exponential decay with 2000 warmup steps, and dropout rate of 0.4."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-124",
"text": "The parameters are initialized using Xavier."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-125",
"text": "Models were trained for up to 120,000 steps with checkpoints at each 1000 steps; we stopped training if perplexity on the validation dataset stopped improving."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-126",
"text": "We report results from checkpoints performing best on validation."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-128",
"text": "**IMPLEMENTATION OF THE NON-CONTEXTUAL DEFINITION MODELING SYSTEM**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-129",
"text": "In non-contextual definition modeling, definienda are mapped directly to definitions."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-130",
"text": "As the source corresponds only to the definiendum, we conjecture that few parameters are required for the encoder."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-131",
"text": "We use 1 layer for the encoder, 6 for the decoder, 300 dimensions per hidden representations and 6 heads for multi-head attention."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-132",
"text": "We do not share vocabularies between the encoder and the decoder: therefore output tokens can only correspond to words attested as definientia."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-133",
"text": "4 The dropout rate and warmup steps number were set using a hyperparameter search on the dataset from Noraset et al. (2017) , during which encoder and decoder vocabulary were merged for computational simplicity and models stopped after 12,000 steps."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-134",
"text": "We first fixed dropout to 0.1 and tested warmup step values between 1000 and 10,000 by increments of 1000, then focused on the most promising span (1000-4000 steps) and exhaustively tested dropout rates from 0.2 to 0.8 by increments of 0.1."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-136",
"text": "**IMPLEMENTATION OF CONTEXTUALIZED DEFINITION MODELING SYSTEMS**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-137",
"text": "To compare the effects of the two integration strategies that we discussed in section 4, we implement both the additive marking approach (ADD, cf. section 4.2) and the alternative 'encode and select' approach (SELECT, cf."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-138",
"text": "section 4.1)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-139",
"text": "To match with the complex input source, we define encoders with 6 layers; we reemploy the set of hyperparameters previously found for the non-contextual system."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-140",
"text": "Other implementation details, initialization strategies and optimization algorithms are kept the same as described above for the non-contextual version of the model."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-141",
"text": "We stress that the two approaches we compare for contextualizing the definiendum are applicable to almost any sequence-to-sequence neural architecture with an attention mechanism to the input source."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-142",
"text": "5 Here we chose to rely on a Transformerbased architecture (Vaswani et al., 2017) , which has set the state of the art in a wide range of tasks, from language modeling (Dai et al., 2019) to machine translation (Ott et al., 2018) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-143",
"text": "It is therefore expected that the Transformer architecture will also improve performances for definition modeling, if our arguments for treating it as a sequence to sequence task are on the right track."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-145",
"text": "**DATASETS**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-146",
"text": "We train our models on three distinct datasets, which are all borrowed or adapted from previous works on definition modeling."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-147",
"text": "As a consequence, our experiments focus on the English language."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-148",
"text": "The dataset of Noraset et al. (2017) (henceforth D Nor ) maps definienda to their respective definientia, as well as additional information not used here."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-149",
"text": "In the dataset of Gadetsky et al. (2018) (henceforth D Gad ), each example consists of a definiendum, the definientia for one of its meanings and a contextual cue sentence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-150",
"text": "D Nor contains on average shorter definitions than D Gad ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-151",
"text": "Definitions in D Nor have a mean length of 6.6 and a standard deviation of 5.78, whereas those in D Gad have a mean length of 11.01 and a standard deviation of 6.96."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-152",
"text": "stress that the dataset D Gad includes many examples where the definiendum is absent from the associated cue."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-153",
"text": "About half of these cues doe not contain an exact match for the corresponding definiendum, but up to 80% contains either an exact match or an inflected form of the definiendum according to lemmatization by the NLTK toolkit (Loper and Bird, 2002) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-154",
"text": "To cope with this problematic characteristic, we converted the dataset into the word-in-context format assumed by our model by concatenating the definiendum with the cue."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-155",
"text": "To illustrate this, consider the actual input from D Gad comprised of the definiendum \"fool\" and its associated cue \"enough horsing around-let's get back to work!\": to convert this into a single sequence, we simply prepend the definiendum to the cue, which results in the sequence \"fool enough horsing around-let's get back to work!\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-156",
"text": "Hence the input sequences of D Gad do not constitute linguistically coherent sequences, but it does guarantee that our sequenceto-sequence variants have access to the same input as previous models; therefore the inclusion of this dataset in our experiments is intended mainly for comparison with previous architectures."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-157",
"text": "We also note that this conversion procedure entails that our examples have a very regular structure: the word marked as a definiendum is always the first word in the input sequence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-158",
"text": "Our second strategy was to restrict the dataset by selecting only cues where the definiendum (or its inflected form) is present."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-159",
"text": "The curated dataset (henceforth D Ctx ) contains 78,717 training examples, 9,413 for validation and 9,812 for testing."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-160",
"text": "In each example, the first occurrence of the definiendum is annotated as such."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-161",
"text": "D Ctx thus differs from D Gad in two ways: some definitions have been removed, and the exact citation forms of the definienda are not given."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-162",
"text": "Models trained on D Ctx implicitly need to lemmatize the definiendum, since inflected variants of a given word are to be aligned to a common representation; thus they are not directly comparable with models trained with the citation form of the definiendum that solely use context as a cue-viz."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-163",
"text": "Gadetsky et al. (2018 ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-164",
"text": "All this makes D Ctx harder, but at the same time closer to a realistic application than the other two datasets, since each word appears inflected and in a specific sentential context."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-165",
"text": "For applications of definition modeling, it would only be beneficial to take up these challenges; for example, the output \"monotremes: plural of monotreme\" 6 would not have been self-contained, necessitating a second query for \"monotreme\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-166",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-167",
"text": "**RESULTS**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-168",
"text": "We use perplexity, a standard metric in definition modeling, to evaluate and compare our models."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-169",
"text": "Informally, perplexity assesses the model's confidence in producing the ground-truth output when presented the source input."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-170",
"text": "It is formally defined as the exponentiation of cross-entropy."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-171",
"text": "We do not report BLEU or ROUGE scores due to the fact that an important number of ground-truth definitions are comprised of a single word, in particular in D Nor (\u2248 25%)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-172",
"text": "Single word outputs can either be assessed as entirely correct or entirely wrong using BLEU or ROUGE."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-173",
"text": "However consider for instance the word \"elation\": that it be defined either as \"mirth\" or \"joy\" should only influence our metric slightly, and not be discounted as a completely wrong prediction. , as they did not report the perplexity of their system and focused on a different dataset; likewise, consider only the Chinese variant of the task."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-174",
"text": "Perplexity measures for Noraset et al. (2017) and Gadetsky et al. (2018) are taken from the authors' respective publications."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-175",
"text": "All our models perform better than previous proposals, by a margin of 4 to 10 points, for a relative improvement of 11-23%."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-176",
"text": "Part of this improvement may be due to our use of Transformerbased architectures (Vaswani et al., 2017) , which is known to perform well on semantic tasks (Radford, 2018; Cer et al., 2018; Devlin et al., 2018; Radford et al., 2019, eg.) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-177",
"text": "Like Gadetsky et al. (2018) , we conclude that disambiguating the definiendum, when done correctly, improves performances: our best performing contex-tual model outranks the non-contextual variant by 5 to 6 points."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-178",
"text": "The marking of the definiendum out of its context (ADD vs. SELECT) also impacts results."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-179",
"text": "Note also that we do not rely on taskspecific external resources (unlike Noraset et al., 2017; or on pre-training (unlike Gadetsky et al., 2018) ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-180",
"text": "Our contextual systems trained on the D Gad dataset used the concatenation of the definiendum and the contextual cue as inputs."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-181",
"text": "The definiendum was always at the start of the training example."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-182",
"text": "This regular structure has shown to be useful for the models' performance: all models perform significantly worse on the more realistic data of D Ctx than on D Gad ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-183",
"text": "The D Ctx dataset is intrinsically harder for other reasons as well: it requires some form of lemmatization in every three out of eight training examples, and contains less data than other datasets, only half as many examples as D Nor , and 20% less than D Gad ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-184",
"text": "The surprisingly poor results of SELECT on the D Ctx dataset may be partially blamed on the absence of a regular structure in D Ctx ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-185",
"text": "Unlike D Gad , where the model must only learn to contextualize the first element of the sequence, in D Ctx the model has to single out the definiendum which may appear anywhere in the sentence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-186",
"text": "Any information stored only in representations of contextual tokens will be lost to the decoders."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-187",
"text": "The SELECT model therefore suffers of a bottleneck, which is highly regular in D Gad and that it may therefore learn to cope with; however predicting where in the input sequence the bottleneck will appear is far from trivial in the D Ctx dataset."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-188",
"text": "We also attempted to retrain this model with various settings of hyperparameters, modifying dropout rate, number of warmup steps, and number of layers in the encoder-but to no avail."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-189",
"text": "An alternative explanation may be that in the case of the D Gad dataset, the regular structure of the input entails that the first positional encoding is used as an additive marking device: only definienda are marked with the positional encoding pos(1), and thus the architecture does not purely embrace a selective approach but a mixed one."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-190",
"text": "In any event, even on the D Gad dataset where the margin is very small, the perplexity of the additive marking approach ADD is better than that of the SELECT model."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-191",
"text": "This lends empirical support to our claim that definition modeling is a nontrivial sequence-to-sequence task, which can be better treated with sequence methods."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-192",
"text": "The stability of the performance improvement over the noncontextual variant in both contextual datasets also highlights that our proposed additive marking is fairly robust, and functions equally well when confronted to somewhat artificial inputs, as in D Gad , or to linguistically coherent sequences, as in D Ctx ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-193",
"text": "A manual analysis of definitions produced by our system reveals issues similar to those discussed by Noraset et al. (2017) , namely selfreference, 7 POS-mismatches, over-and underspecificity, antonymy, and incoherence."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-194",
"text": "Annotating distinct productions from the validation set, for the non-contextual model trained on D Nor , we counted 9.9% of self-references, 11.6% POSmismatches, and 1.3% of words defined as their antonyms."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-195",
"text": "We counted POS-mismatches whenever the definition seemed to fit another part-of-speech than that of the definiendum, regardless of both of their meanings; cf."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-196",
"text": "Table 2 for examples."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-197",
"text": "For comparison, we annotated the first 1000 productions of the validation set from our ADD model trained on D Ctx ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-198",
"text": "We counted 18.4% POS mismatches and 4.4% of self-referring definitions; examples are shown in Table 3 ."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-199",
"text": "The higher rate of POS-mismatch may be due to the model's hardship in finding which word is to be defined since the model is not presented with the definiendum alone: access to the full context may confuse it."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-200",
"text": "On the other hand, the lower number of self-referring definitions may also be linked to this richer, more varied input: this would allow the model not to fall 7 Self-referring definitions are those where a definiendum is used as a definiens for itself."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-201",
"text": "Dictionaries are expected to be exempt of such definitions: as readers are assumed not to know the meaning of the definiendum when looking it up. back on simply reusing the definiendum as its own definiens."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-202",
"text": "Self-referring definitions highlight that our models equate the meaning of the definiendum to the composed meaning of its definientia."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-203",
"text": "Simply masking the corresponding output embedding might suffice to prevent this specific problem; preliminary experiments in that direction suggest that this may also help decrease perplexity further."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-204",
"text": "As for POS-mismatches, we do note that the work of Noraset et al. (2017) had a much lower rate of 4.29%: we suggest that this may be due to the fact that they employ a learned character-level convolutional network, which arguably would be able to capture orthography and rudiments of morphology."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-205",
"text": "Adding such a sub-module to our proposed architecture might diminish the number of mistagged definienda."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-206",
"text": "Another possibility would be to pre-train the model, as was done by Gadetsky et al. (2018) : in our case in particular, the encoder could be trained for POS-tagging or lemmatization."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-207",
"text": "Lastly, one important kind of mistakes we observed is hallucinations."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-208",
"text": "Consider for instance this production by the ADD model trained on D Ctx , for the word \"beta\": \"the twentieth letter of the Greek alphabet (\u03ba), transliterated as 'o'.\"."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-209",
"text": "Nearly everything it contains is factually wrong, though the general semantics are close enough to deceive an unaware reader."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-210",
"text": "8 We conjecture that filtering out hallucinatory productions will be a main challenge for future definition modeling architectures, for two main reasons: firstly, the tools and metrics necessary to assess and handle such hallucinations have yet to be developed; secondly, the input given to the system being word embeddings, research will be faced with the problem of grounding these distributional representations-how can we ensure that \"beta\" is correctly defined as \"the second letter of the Greek alphabet, transliterated as 'b'\", if we only have access to a representation derived from its contexts of usage?"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-211",
"text": "Integration of word embeddings with structured knowledge bases might be needed for accurate treatment of such cases."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-213",
"text": "**ERROR TYPE**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-214",
"text": "Context (definiendum in bold)"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-215",
"text": "Production POS-mismatch her major is linguistics most important or important Self-reference he wrote a letter of apology to the hostess a formal expression of apology"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-217",
"text": "**CONCLUSION**"
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-218",
"text": "We introduced an approach to generating word definitions that allows the model to access rich contextual information about the word token to be defined."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-219",
"text": "Building on the distributional hypothesis, we naturally treat definition generation as a sequence-to-sequence task of mapping the word's context of usage (input sequence) into the contextappropriate definition (output sequence)."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-220",
"text": "We showed that our approach is competitive against a more naive 'contextualize and select' pipeline."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-221",
"text": "This was demonstrated by comparison both to the previous contextualized model by Gadetsky et al. (2018) and to the Transformerbased SELECT variation of our model, which differs from the proposed architecture only in the context encoding pipeline."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-222",
"text": "While our results are encouraging, given the existing benchmarks we were limited to perplexity measurements in our quantitative evaluation."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-223",
"text": "A more nuanced semantically driven methodology might be useful in the future to better assess the merits of our system in comparison to alternatives."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-224",
"text": "Our model opens several avenues of future explorations."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-225",
"text": "One could straightforwardly extend it to generate definitions of multiword expressions or phrases, or to analyze vector compositionality models by generating paraphrases for vector representations produced by these algorithms."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-226",
"text": "Another strength of our approach is that it can provide the basis for a standardized benchmark for contextualized and non-contextual embeddings alike: downstream evaluation tasks for embeddings systems in general either apply to non-contextual embeddings (Gladkova et al., 2016, eg.) or to contextual embeddings (Wang et al., 2019, eg.) exclusively, redefining definition modeling as a sequence-tosequence task will allow in future works to compare models using contextual and non-contextual embeddings in a unified fashion."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-227",
"text": "Lastly, we also intend to experiment on languages other than English, especially considering that the required resources for our model only amount to a set of pretrained embeddings and a dataset of definitions, either of which are generally simple to obtain."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-228",
"text": "While there is a potential for local improvements, our approach has demonstrated its ability to account for contextualized word meaning in a principled way, while training contextualized token encoding and definition generation end-toend."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-229",
"text": "Our implementation is efficient and fast, building on free open source libraries for deep learning, and shows good empirical results."
},
{
"sent_id": "5177188d88391f08325262dbdefabf-C001-230",
"text": "Our code, trained models, and data will be made available to the community."
}
],
"y": {
"@MOT@": {
"gold_contexts": [
[
"5177188d88391f08325262dbdefabf-C001-22"
]
],
"cite_sentences": [
"5177188d88391f08325262dbdefabf-C001-22"
]
},
"@BACK@": {
"gold_contexts": [
[
"5177188d88391f08325262dbdefabf-C001-39"
],
[
"5177188d88391f08325262dbdefabf-C001-40"
],
[
"5177188d88391f08325262dbdefabf-C001-59"
],
[
"5177188d88391f08325262dbdefabf-C001-149"
]
],
"cite_sentences": [
"5177188d88391f08325262dbdefabf-C001-39",
"5177188d88391f08325262dbdefabf-C001-40",
"5177188d88391f08325262dbdefabf-C001-59",
"5177188d88391f08325262dbdefabf-C001-149"
]
},
"@SIM@": {
"gold_contexts": [
[
"5177188d88391f08325262dbdefabf-C001-101"
],
[
"5177188d88391f08325262dbdefabf-C001-177"
],
[
"5177188d88391f08325262dbdefabf-C001-220",
"5177188d88391f08325262dbdefabf-C001-221"
]
],
"cite_sentences": [
"5177188d88391f08325262dbdefabf-C001-101",
"5177188d88391f08325262dbdefabf-C001-177",
"5177188d88391f08325262dbdefabf-C001-221"
]
},
"@USE@": {
"gold_contexts": [
[
"5177188d88391f08325262dbdefabf-C001-174"
]
],
"cite_sentences": [
"5177188d88391f08325262dbdefabf-C001-174"
]
},
"@DIF@": {
"gold_contexts": [
[
"5177188d88391f08325262dbdefabf-C001-179"
]
],
"cite_sentences": [
"5177188d88391f08325262dbdefabf-C001-179"
]
},
"@FUT@": {
"gold_contexts": [
[
"5177188d88391f08325262dbdefabf-C001-205",
"5177188d88391f08325262dbdefabf-C001-206"
]
],
"cite_sentences": [
"5177188d88391f08325262dbdefabf-C001-206"
]
}
}
},
"ABC_7f2622701e1f6c8492ec627b6ac32b_7": {
"x": [
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-2",
"text": "Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-3",
"text": "However, there has been a longstanding debate whether enhancement should also be carried out on the ASR training data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-4",
"text": "In an extensive experimental evaluation on the acoustically very challenging CHiME-5 dinner party data we show that: (i) cleaning up the training data can lead to substantial error rate reductions, and (ii) enhancement in training is advisable as long as enhancement in test is at least as strong as in training."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-5",
"text": "This approach stands in contrast and delivers larger gains than the common strategy reported in the literature to augment the training database with additional artificially degraded speech."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-6",
"text": "Together with an acoustic model topology consisting of initial CNN layers followed by factorized TDNN layers we achieve with 41.6 % and 43.2 % WER on the DEV and EVAL test sets, respectively, a new single-system state-of-the-art result on the CHiME-5 data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-7",
"text": "This is a 8 % relative improvement compared to the best word error rate published so far for a speech recognizer without system combination."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-10",
"text": "Neural networks have outperformed earlier Gaussian Mixture Model (GMM) based acoustic models in terms of modeling power and increased robustness to acoustic distortions."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-11",
"text": "Despite that, speech enhancement has been shown to deliver additional word error rate (WER) improvements, if multi-channel data is available."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-12",
"text": "This is due to their ability to exploit spatial information, which is reflected by phase differences of microphone channels in the Short Time Fourier Transform (STFT) domain."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-13",
"text": "This information is not accessible by the Automatic Speech Recognition (ASR) system, at least not if it operates on the common log mel spectral or cepstral feature sets."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-14",
"text": "Also, dereverberation algorithms have been shown to consistently improve ASR results, since the temporal dispersion of the signal caused by reverberation is difficult to capture by an ASR acoustic model [1] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-15",
"text": "However, there has been a long debate whether it is advisable to apply speech enhancement on data used for ASR training, because it is generally agreed upon that the recognizer should be exposed to as much acoustic variability as possible during training, as long as this variability matches the test scenario [2] [3] [4] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-16",
"text": "Multi-channel speech enhancement, such as acoustic beamforming (BF) or source separation, would not only reduce the acoustic variability, it would also result in a reduction of the amount of training data by a factor of M , where M is the number of microphones [5] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-17",
"text": "Previous studies have shown the benefit of training an ASR on matching enhanced speech [6, 7] or on jointly training the enhancement and the acoustic model [8] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-18",
"text": "Alternatively, the training data is often artificially increased by adding even more degraded speech to it."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-19",
"text": "For instance, Ko et al. [9] found that adding simulated reverberated speech improves accuracy significantly on several large vocabulary tasks."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-20",
"text": "Similarly, Manohar et al. [10] improved the WER of the baseline CHiME-5 system by relative 5.5 % by augmenting the training data with approx."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-21",
"text": "160 hrs of simulated reverberated speech."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-22",
"text": "However, not only can the generation of new training data be costly and time consuming, the training process itself is also prolonged if the amount of data is increased."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-23",
"text": "In this contribution we advocate for the opposite approach."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-24",
"text": "Although we still believe in the argument that ASR training should see sufficient variability, instead of adding degraded speech to the training data, we clean up the training data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-25",
"text": "We make, however, sure that the remaining acoustic variability is at least as large as on the test data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-26",
"text": "By applying a beamformer to the multi-channel input, we even reduce the amount of training data significantly."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-27",
"text": "Consequently, this leads to cheaper and faster acoustic model training."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-28",
"text": "We perform experiments using data from the CHiME-5 challenge which focuses on distant multi-microphone conversational ASR in real home environments [11] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-29",
"text": "The CHiME-5 data is heavily degraded by reverberation and overlapped speech."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-30",
"text": "As much as 23 % of the time more than one speaker is active at the same time [12] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-31",
"text": "The challenge's baseline system poor performance (about 80 % WER) is an indication that ASR training did not work well."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-32",
"text": "Recently, Guided Source Separation (GSS) enhancement on the test data was shown to significantly improve the performance of an acoustic model, which had been trained with a large amount of unprocessed and simulated noisy data [13] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-33",
"text": "GSS is a spatial mixture model based blind source separation approach which exploits the annotation given in the CHiME-5 database for initialization and, in this way, avoids the frequency permutation problem [14] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-34",
"text": "We conjectured that cleaning up the training data would enable a more effective acoustic model training for the CHiME-5 scenario."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-35",
"text": "We have therefore experimented with enhancement algorithms of various strengths, from relatively simple beamforming over singlearray GSS to a quite sophisticated multi-array GSS approach, and tested all combinations of training and test data enhancement methods."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-36",
"text": "Furthermore, compared to the initial GSS approach in [14] , we describe here some modifications, which led to improved performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-37",
"text": "We also propose an improved neural acoustic modeling structure compared to the CHiME-5 baseline system described in [10] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-38",
"text": "It consists of initial Convolutional Neural Network (CNN) layers followed by factorized TDNN (TDNN-F) layers, instead of a homogeneous TDNN-F architecture."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-39",
"text": "Using a single acoustic model trained with 308 hrs of training data, which resulted after applying multi-array GSS data cleaning and a three-fold speed perturbation, we achieved a WER of 41.6 % on the development (DEV) and 43.2 % on the evaluation (EVAL) test set of CHiME-5, if the test data is also enhanced with multi-array GSS."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-40",
"text": "This compares very favorably with the recently published topline in [13] , where the single-system best result, i.e., the WER without system combination, was 45.1 % and 47.3 % on DEV and EVAL, respectively, using an augmented training data set of 4500 hrs total."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-41",
"text": "The rest of this paper is structured as follows."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-42",
"text": "Section 2 describes the CHiME-5 corpus, Section 3 briefly presents the guided source separation enhancement method, Section 4 shows the ASR experiments and the results, followed by a discussion in Section 5."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-43",
"text": "Finally, the paper is concluded in Section 6."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-45",
"text": "**CHIME-5 CORPUS DESCRIPTION**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-46",
"text": "The CHiME-5 corpus comprises twenty dinner party recordings (sessions) lasting for approximately 2 hrs each."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-47",
"text": "A session contains the conversation among the four dinner party participants."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-48",
"text": "Recordings were made in kitchen, dining and living room areas with each phase lasting for a minimum of 30 mins."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-49",
"text": "16 dinner parties were used for training, 2 were used for development, and 2 were used for evaluation."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-50",
"text": "There were two types of recording devices collecting CHiME-5 data: distant 4-channels (linear) Microsoft Kinect arrays (referred to as units or 'U') and in-ear Soundman OKM II Classic Studio binaural microphones (referred to as worn microphones or 'W')."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-51",
"text": "Six Kinect arrays were used in total and they were placed such that at least two units were able to capture the acoustic environment in each recording area."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-52",
"text": "Each dinner party participant wore in-ear microphones which were subsequently used to facilitate human audio transcription of the data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-53",
"text": "The devices were not time synchronized during recording."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-54",
"text": "Therefore, the W and the U signals had to be aligned afterwards using a correlation based approach provided by the organizers."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-55",
"text": "Depending on how many arrays were available during test time, the challenge had a single (reference) array and a multiple array track."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-56",
"text": "For more details about the corpus, the reader is referred to [11] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-58",
"text": "**GUIDED SOURCE SEPARATION**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-59",
"text": "GSS enhancement is a blind source separation technique originally proposed in [14] 1 to alleviate the speaker overlap problem in CHiME-5."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-60",
"text": "Given a mixture of reverberated overlapped speech, GSS aims to separate the sources using a pure signal processing approach."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-61",
"text": "An Expectation Maximization (EM) algorithm estimates the parameters of a spatial mixture model and the posterior probabilities of each speaker being active are used for mask based beamforming."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-62",
"text": "An overview block diagram of this enhancement by source separation is depicted in Fig. 1 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-63",
"text": "It follows the approach presented in [13] , which was shown to outperform the baseline version."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-64",
"text": "The system operates in the STFT domain and consists of two stages: (1) a dereverberation stage, and (2) a guided source separation stage."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-65",
"text": "For the sake of simplicity, the overall system is referred to as GSS for the rest of the paper."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-66",
"text": "Regarding the first stage, the multiple input multiple output version of the Weighted Prediction Error (WPE) method was used for dereverberation (M inputs and M outputs) [15, 16] five mixture components, one representing each speaker, and an additional component representing the noise class."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-67",
"text": "The role of the MM is to support the source extraction component for estimating the target speech."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-68",
"text": "The class affiliations computed in the E-step of the EM algorithm are employed to estimate spatial covariance matrices of target signals and interferences, from which the coefficients of an Minimum Variance Distortionless Response (MVDR) beamformer are computed [18] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-69",
"text": "The reference channel for the beamformer is estimated based on an SNR criterion [19] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-70",
"text": "The beamformer is followed by a postfilter to reduce the remaining speech distortions [20] , which in turn is followed by an additional (optional) masking stage to improve crosstalk suppression."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-71",
"text": "Those masks are also given by the mentioned class affiliations."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-72",
"text": "For the single array (CHiME-5) track, simulations have shown that multiplying the beamformer output with the target speaker mask improves the performance on the U data, but the same approach degrades the performance in the multiple array track [14] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-73",
"text": "This is because the spatial selectivity of a single array is very limited in CHiME-5: the speakers' signals arrive at the array, which is mounted on the wall at some distance, at very similar impinging angles, rendering single array beamforming rather ineffective."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-74",
"text": "Consequently, additional masking has the potential to improve the beamformer performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-75",
"text": "Conversely, the MM estimates are more accurate in the multiple array case since they benefit from a more diverse spatial arrangement of the microphones, and the signal distortions introduced by the additional masking rather degrade the performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-76",
"text": "Consequently, for our experiments we have used the masking approach for the single array track, but not for the multiple array one."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-77",
"text": "GSS exploits the baseline CHiME-5 speaker diarization information available from the transcripts (annotations) to determine when multiple speakers talk simultaneously (see Fig. 2 )."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-78",
"text": "This crosstalk information is then used to guide the parameter estimation of the MM both during EM initialization (posterior masks set to one divided by the number of active speakers for active speakers' frames, and zero for the non-active speakers) and after each E-step (posterior masks are clamped to zero for non-active speakers)."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-79",
"text": "The initialization of the EM for each mixture component is very important for the correct convergence of the algorithm."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-80",
"text": "If the EM initialization is close enough to the final solution, then it is expected that the algorithm will correctly separate the sources and source indices are not permuted across frequency bins."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-81",
"text": "This has a major practical application, since frequency permutation solvers like [21] become obsolete."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-82",
"text": "Temporal context also plays an important role in the EM initialization."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-83",
"text": "Simulations have shown that a large context of 15 seconds left and right of the considered segment improves the mixture model estimation performance significantly for CHiME-5 [14] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-84",
"text": "However, having such a large temporal context may become problematic when the speakers are moving, because the estimated spatial covariance matrix can become outdated due to the movement [13] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-85",
"text": "Alternatively, one can run the EM first with a larger temporal context until convergence, then drop the context and re-run it for some more iterations."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-86",
"text": "As shown later in the paper, this approach did not improve ASR performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-87",
"text": "Therefore, the temporal context was only used for dereverberation and the mixture model parameter estimation, while for the estimation of covariance matrices for beamforming the context was dropped and only the original segment length was considered [13] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-88",
"text": "Another avenue we have explored for further source separation improvement was to refine the baseline CHiME-5 annotations using ASR output (see Fig. 1 )."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-89",
"text": "A first-pass decoding using an ASR system is used to predict silence intervals."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-90",
"text": "Then this information is used to adjust the time annotations, which are used in the EM algorithm as described above."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-91",
"text": "When the ASR decoder indicates silence for a speaker, the corresponding class posterior in the MM is forced to zero."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-92",
"text": "Depending on the number of available arrays for CHiME-5, two flavours of GSS enhancement were used in this work."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-93",
"text": "In the single array track, all 4 channels of the array are used as input (M = 4), and the system is referred to as GSS1."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-94",
"text": "In the multi array track, all six arrays are stacked to form a 24 channels super-array (M = 24), and this system is denoted as GSS6."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-95",
"text": "The baseline time synchronization provided by the challenge organizers was sufficient to align the data for GSS6."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-97",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-99",
"text": "**GENERAL CONFIGURATION**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-100",
"text": "Experiments were performed using the CHiME-5 data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-101",
"text": "Distant microphone recordings (U data) during training and/or testing were processed using the speech enhancement methods depicted in Table 1 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-102",
"text": "Speech was either left unprocessed, enhanced using a weighted delay-and-sum beamformer (BFIt) [22] with or without dereverberation (WPE), or processed using the guided source separation (GSS) approach described in Section 3."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-103",
"text": "In Table 1 , the strength of the enhancement increases from top to bottom, i.e., GSS6 signals are much cleaner than the unprocessed ones."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-104",
"text": "The standard CHiME-5 recipes were used to: (i) train GMM-HMM alignment models, (ii) clean up the training data, and (iii) Single/Multi None BeamformIt [22] Single BFIt WPE + BeamformIt [10] Single WPE+BFIt WPE + GSS1 + BF w/o Context [14] Single GSS1 WPE + GSS6 + BF w/o Context [14] Multi GSS6 augment the training data using three-fold speed perturbation."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-105",
"text": "The acoustic feature vector consisted of 40-dimensional MFCCs appended with 100-dimensional i-vectors."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-106",
"text": "By default, the acoustic models were trained using the Lattice-Free Maximum Mutual Information (LF-MMI) criterion and a 3-gram language model was used for decoding [11] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-107",
"text": "Discriminative training (DT) [23] and an additional RNN-based language model (RNN-LM) [24] were applied to improve recognition accuracy for the best performing systems."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-109",
"text": "**ACOUSTIC MODEL**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-110",
"text": "The initial baseline system [11] of the CHiME-5 challenge uses a Time Delay Neural Network (TDNN) acoustic model (AM)."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-111",
"text": "However, recently it has been shown that introducing factorized layers into the TDNN architecture facilitates training deeper networks and also improves the ASR performance [25] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-112",
"text": "This architecture has been employed in the new baseline system for the challenge [10] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-113",
"text": "The TDNN-F has 15 layers with a hidden dimension of 1536 and a bottleneck dimension of 160; each layer also has a resnet-style bypassconnection from the output of the previous layer, and a \"continuous dropout\" schedule [10] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-114",
"text": "In addition to the TDNN-F, the newly released baseline 3 also uses simulated reverberated speech from worn microphone recordings for augmenting the training set, it employes front-end speech dereverberation and beamforming (WPE+BFIt), as well as robust i-vector extraction using 2-stage decoding."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-115",
"text": "CNNs have been previously shown to improve ASR robustness [26] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-116",
"text": "Therefore, combining CNN and TDNN-F layers is a promising approach to improve the baseline system of [10] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-117",
"text": "To test this hypothesis, a CNN-TDNNF AM architecture 4 consisting of 6 CNN layers followed by 9 TDNN-F layers was compared against an AM having 15 TDNN-F layers."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-118",
"text": "All TDNN-F layers have the topology described above."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-119",
"text": "ASR results are given in Table 2 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-120",
"text": "The first two rows show that replacing the TDNN-F with the CNN-TDNNF AM yielded more than 2 % absolute WER reduction."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-121",
"text": "We also trained another CNN-TDNNF model using only a small subset (worn + 100k utterances from arrays) of training data (about 316 hrs in total) which has pro- duced slightly better WERs compared with the baseline TDNN-F trained on a much larger dataset (roughly 1416 hrs in total)."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-122",
"text": "For consistency, 2-stage decoding was used for all results in Table 2 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-123",
"text": "We conclude that the CNN-TDNNF model outperforms the TDNNF model for the CHiME-5 scenario and, therefore, for the remainder of the paper we only report results using the CNN-TDNNF AM."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-125",
"text": "**ENHANCEMENT EFFECTIVENESS FOR ASR TRAINING AND TEST**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-126",
"text": "An extensive set of experiments was performed to measure the WER impact of enhancement on the CHiME-5 training and test data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-127",
"text": "We test enhancement methods of varying strengths, as described in Section 4.1, and the results are depicted in Table 3 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-128",
"text": "In all cases, the (unprocessed) worn dataset was also included for AM training since it was found to improve performance (supporting therefore the argument that data variability helps ASR robustness)."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-129",
"text": "In Table 3 , in each row the recognition accuracy improves monotonically from left to right, i.e., as the enhancement strategy on the test data becomes stronger."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-130",
"text": "Reading the table in each column from top to bottom, one observes that accuracy improves with increasing power of the enhancement on the training data, however, only as long as the enhancement on the training data is not stronger than on the test data."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-131",
"text": "Compared with unprocessed training and test data (None-None), GSS6-GSS6 yields roughly 35 % (24 %) relative WER reduction on the DEV (EVAL) set, and 12 % (11 %) relative WER reduction when compared with the None-GSS6 scenario."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-132",
"text": "Comparing the amount of training data used to train the acoustic models, we observe that it decreases drastically from no enhancement to the GSS6 enhancement."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-134",
"text": "**STATE-OF-THE-ART SINGLE-SYSTEM FOR CHIME-5**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-135",
"text": "To facilitate comparison with the recently published top-line in [13] (H/UPB), we have conducted a more focused set of experiments whose results are depicted in Table 4 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-136",
"text": "As explained in Section 5.1, we opted for [13] instead of [14] as baseline because the former system is stronger."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-137",
"text": "The experiments include refining the GSS enhancement using time annotations from ASR output (GSS w/ ASR), performing discriminative training on top of the AMs trained with LF-MMI and performing RNN LM rescoring."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-138",
"text": "All the above helped further improve ASR performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-139",
"text": "We report performance of our system on both single and multiple array tracks."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-140",
"text": "To have a fair comparison, the results are compared with the single-system performance reported in [13] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-141",
"text": "For the single array track, the proposed system without RNN LM rescoring achieves 16 % (11 %) relative WER reduction on the DEV (EVAL) set when compared with System8 in [13] (row one in Table 4 )."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-142",
"text": "RNN LM rescoring further helps improve the proposed system performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-143",
"text": "For the multi array track, the proposed system without RNN LM rescoring achieved 6 % (7 %) relative WER reduction on the DEV (EVAL) set when compared with System16 in [13] (row six in Table 4 )."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-144",
"text": "We also performed a test using GSS with the oracle alignments (GSS w/ oracle) to assess the potential of time annotation refinement (gray shade lines in Table 4 )."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-145",
"text": "It can be seen that there is some, however not much room for improvement."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-146",
"text": "Finally, cleaning up the training set not only boosted the recognition performance, but managed to do so using a fraction of the training data in [13] , as shown in Table 5 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-147",
"text": "This translates to significantly faster and cheaper training of acoustic models, which is a major advantage in practice."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-149",
"text": "**DISCUSSION**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-151",
"text": "**TEMPORAL CONTEXT CONFIGURATION FOR GSS**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-152",
"text": "Our experiments have shown that the temporal context of some GSS components has a significant effect on the WER."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-153",
"text": "Two cases are investigated: (i) partially dropping the temporal context for the EM stage, and (ii) dropping the temporal context for beamforming."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-154",
"text": "The evaluation was conducted with an acoustic model trained on unprocessed speech and the enhancement was applied during test only."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-155",
"text": "Results are depicted in Table 6 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-156",
"text": "The first row corresponds to the GSS configuration in [14] while the second one corresponds to the GSS configuration in [13] ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-157",
"text": "First two rows show that dropping the temporal context for estimating statistics for beamforming improves ASR accuracy."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-158",
"text": "For the last row, the EM algorithm was run 20 iterations with temporal context, followed by another 10 without context."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-159",
"text": "Since the performance de- [14] w/ context 54.7 (52.3) 20 w/ context [13] w/o context 51.8 (51.6) 20 w/ + 10 w/o context w/o context 52."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-160",
"text": "2 (52.5) creased, we concluded that the best configuration for the GSS enhancement in CHiME-5 scenario is using full temporal context for the EM stage and dropping it for the beamforming stage."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-161",
"text": "Consequently, we have chosen system [13] as baseline in this study since is using the stronger GSS configuration."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-163",
"text": "**ANALYSIS OF SPEAKER OVERLAP EFFECT ON WER ACCURACY**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-164",
"text": "The results presented so far were overall accuracies on the test set of CHiME-5."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-165",
"text": "However, since speaker overlap is a major issue for these data, it is of interest to investigate the methods' performance as a function of the amount of overlapped speech."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-166",
"text": "Employing the original CHiME-5 annotations, the word distribution of overlapped speech was computed for DEV and EVAL sets (silence portions were not filtered out)."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-167",
"text": "The five-bin normalized histogram of the data is plotted in Fig. 3 ."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-168",
"text": "Interestingly, the percentage of segments with low overlapped speech is significantly higher for the EVAL than for the DEV set, and, conversely, the number of words with high overlapped speech is considerably lower for the EVAL than for the DEV set."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-169",
"text": "This distribution may explain the difference in performance observed between the DEV and EVAL sets."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-170",
"text": "Based on the distributions in Fig. 3 , the test data was split."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-171",
"text": "Two cases were considered: (a) same enhancement for training and test data (matched case, Table 7 ), and (b) unprocessed training data and enhanced test data (mismatched case, Table 8 )."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-172",
"text": "As expected, the WER increases monotonically as the amount of overlap increases in both scenarios, and the recognition accuracy improves as the enhancement method becomes stronger."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-173",
"text": "Graphical representations of WER gains (relative to the unprocessed case) in Tables 7 and 8 source separation algorithm."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-174",
"text": "Conversely, as the amount of speaker overlap increases, the accuracy gain (relative to None) of the stronger GSS enhancement improves quite significantly."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-175",
"text": "A rather small decrease in accuracy is observed in the mismatched case ( Fig. 5) for GSS1 in the lower overlap regions."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-176",
"text": "As already mentioned in Section 3, this is due to the masking stage."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-177",
"text": "It has previously been observed that using masking for speech enhancement without a cross talker decreases ASR recognition performance."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-178",
"text": "We have also included in Fig. 5 the GSS1 version without masking (GSS w/o Mask), which indeed yields significant accuracy gains on segments with little overlap."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-179",
"text": "However, since the overall accuracy of GSS1 with masking is higher than the overall gain of GSS1 without masking, GSS w/o mask was not included in the previous experiments."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-181",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-182",
"text": "In this paper we performed an extensive experimental evaluation on the acoustically very challenging CHiME-5 dinner party data showing that: (i) cleaning up training data can lead to substantial word error rate reduction, and (ii) enhancement in training is advisable as long as enhancement in test is at least as strong as in training."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-183",
"text": "This approach stands in contrast and delivers larger accuracy gains at a fraction of training data than the common data simulation strategy found in the literature."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-184",
"text": "Using a CNN-TDNNF acoustic model topology along with GSS enhancement refined with time annotations from ASR, discriminative training and RNN LM rescoring, we achieved a new single-system state-of-the-art result on CHiME-5, which is 41.6 % (43.2 %) on the development (evaluation) set, which is a 8 % relative improvement of the word error rate over a comparable system reported so far."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-185",
"text": "Fig. 5 : Relative WER gain for the mismatched case vs unprocessed, Table 8 row one (CNN-TDNNF AM trained on unprocessed)."
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-186",
"text": "----------------------------------"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-187",
"text": "**ACKNOWLEDGMENTS**"
},
{
"sent_id": "7f2622701e1f6c8492ec627b6ac32b-C001-188",
"text": "Parts of computational resources required in this study were provided by the Paderborn Center for Parallel Computing."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7f2622701e1f6c8492ec627b6ac32b-C001-32"
]
],
"cite_sentences": [
"7f2622701e1f6c8492ec627b6ac32b-C001-32"
]
},
"@DIF@": {
"gold_contexts": [
[
"7f2622701e1f6c8492ec627b6ac32b-C001-39",
"7f2622701e1f6c8492ec627b6ac32b-C001-40"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-141"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-143"
]
],
"cite_sentences": [
"7f2622701e1f6c8492ec627b6ac32b-C001-40",
"7f2622701e1f6c8492ec627b6ac32b-C001-141",
"7f2622701e1f6c8492ec627b6ac32b-C001-143"
]
},
"@USE@": {
"gold_contexts": [
[
"7f2622701e1f6c8492ec627b6ac32b-C001-63"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-87"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-135"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-136"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-140"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-146"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-155",
"7f2622701e1f6c8492ec627b6ac32b-C001-156"
],
[
"7f2622701e1f6c8492ec627b6ac32b-C001-161"
]
],
"cite_sentences": [
"7f2622701e1f6c8492ec627b6ac32b-C001-63",
"7f2622701e1f6c8492ec627b6ac32b-C001-87",
"7f2622701e1f6c8492ec627b6ac32b-C001-135",
"7f2622701e1f6c8492ec627b6ac32b-C001-136",
"7f2622701e1f6c8492ec627b6ac32b-C001-140",
"7f2622701e1f6c8492ec627b6ac32b-C001-146",
"7f2622701e1f6c8492ec627b6ac32b-C001-156",
"7f2622701e1f6c8492ec627b6ac32b-C001-161"
]
},
"@MOT@": {
"gold_contexts": [
[
"7f2622701e1f6c8492ec627b6ac32b-C001-84"
]
],
"cite_sentences": [
"7f2622701e1f6c8492ec627b6ac32b-C001-84"
]
},
"@EXT@": {
"gold_contexts": [
[
"7f2622701e1f6c8492ec627b6ac32b-C001-87"
]
],
"cite_sentences": [
"7f2622701e1f6c8492ec627b6ac32b-C001-87"
]
}
}
},
"ABC_24506b0aa7a859eb8744e390f9fb60_7": {
"x": [
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-2",
"text": "We present and implement a fourth-order projective dependency parsing algorithm that effectively utilizes both \"grand-sibling\" style and \"tri-sibling\" style interactions of third-order and \"grand-tri-sibling\" style interactions of forth-order factored parts for performance enhancement."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-3",
"text": "This algorithm requires O(n 5 ) time and O(n 4 ) space."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-4",
"text": "We implement and evaluate the parser on two languages-English and Chinese, both achieving state-of-the-art accuracy."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-5",
"text": "This results show that a higher-order (\u22654) dependency parser gives performance improvement over all previous lower-order parsers."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-8",
"text": "In recent years, dependency parsing has gained universal interest due to its usefulness in a wide range of applications such as synonym generation (Shinyama et al., 2002) , relation extraction (Nguyen et al., 2009 ) and machine translation (Katz-Brown et al., 2011; Xie et al., 2011) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-9",
"text": "CoNLL-X shared task on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007) made a comparison of many algorithms, and graph-based parsing models have achieved stateof-the-art accuracy for a wide range of languages."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-10",
"text": "Graph-based dependency parsing algorithms usually use the factored representations of dependency trees: a set of small parts with special structures."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-11",
"text": "The types of features that the model can exploit depend on the information included in the factorizations."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-12",
"text": "Several previous works have shown that higher-order parsers utilizing richer contextual information achieve higher accuracy than lower-order ones- Chen et al. (2010) illustrated that a wide range of decision history can lead to significant improvements in accuracy for graph-based dependency parsing models."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-13",
"text": "Meanwhile, several previous works (Carreras, 2007; Koo and Collins, 2010) have shown that grandchild interactions provide important information for dependency parsing."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-14",
"text": "However, the computational cost of the parsing algorithm increases with the need for more expressive factorizations."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-15",
"text": "Consequently, the existing most powerful parser (Koo and Collins, 2010 ) is limited to third-order parts, which requires O(n 4 ) time and O(n 3 ) space."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-16",
"text": "In this paper, we further present a fourth-order parsing algorithm that can utilize more richer information by enclosing grand-sibling and tri-sibling parts into a grand-tri-sibling part."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-17",
"text": "Koo and Collins (2010) discussed the possibility that the third-order parsers are extended to fourth-order by increasing vertical context (e.g. from grand-siblings to \"great-grand-siblings\") or horizontal context (e.g. from grand-siblings to \"grand-tri-siblings\"), and Koo (2010) first described this algorithm."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-18",
"text": "In this work, we show that grand-tri-siblings can effectively work."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-19",
"text": "The computational requirements of this algorithm are O(n 5 ) time and O(n 4 ) space."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-20",
"text": "To achieve empirical evaluations of our parser, we implement and evaluate the proposed parsing algorithm on the Penn WSJ Treebank (Marcus et al., 1993) for English, and Penn Chinese Treebank (Xue et al., 2005) for Chinese, both achieving state-of-the-art accuracy."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-21",
"text": "A free distribution of our implementation in C++ has been put on the Internet."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-22",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-23",
"text": "**RELATED WORK**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-24",
"text": "There have been several existing graph-based dependency parsing algorithms, which are the backbones of the new fourth-order dependency parser."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-25",
"text": "In this section, we mainly describe four graph-based dependency parsers with different types of factorization."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-26",
"text": "The first-order parser (McDonald et al., 2005 ) decomposes a dependency tree into its individual edges."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-27",
"text": "Eisner (2000) introduced a widely-used dynamic programming algorithm for first-order parsing, which is to parse the left and right dependents of a word independently, and combine them at a later stage."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-28",
"text": "This algorithm introduces two types of dynamic programming structures: complete spans, and incomplete spans (McDonald, 2006) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-29",
"text": "Larger spans are created from two smaller, adjacent spans by recursive combination in a bottom-up procedure."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-30",
"text": "McDonald and Pereira (2006) defined a second-order sibling dependency parser in which interactions between adjacent siblings are allowed."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-31",
"text": "Koo and Collins (2010) proposed an algorithm that factors each dependency tree into a set of grandchild parts."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-32",
"text": "Formally, a grandchild part is a triple of indices (g, s, t) where g is the head of s and s is the head of t. In order to parse this factorization, it is necessary to augment both complete and incomplete spans with grandparent indices."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-33",
"text": "Following Koo and Collins (2010) , we refer to these augmented structures as g-spans."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-34",
"text": "The second-order parser proposed in Carreras (2007) is capable of scoring both sibling and grandchild parts with complexities of O(n 4 ) time and O(n 3 ) space."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-35",
"text": "However, the parser suffers a crucial limitation that it can only evaluate events of grandchild parts for outermost grandchildren."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-36",
"text": "Koo and Collins (2010) proposed a third-order grand-sibling parser that decomposes each tree into set of grand-sibling parts-parts combined with sibling parts and grandchild parts."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-37",
"text": "This factorization defines all grandchild and sibling parts and still requires O(n 4 ) time and O(n 3 ) space."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-38",
"text": "Koo and Collins (2010) also discussed the possibility that the third-order parsers are extended to fourth-order by increasing vertical context or horizontal context and Koo (2010) first described this algorithm."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-39",
"text": "Zhang and McDonald (2012) generalized the Eisner (1996) algorithm to handle arbitrary features over higher-order dependencies."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-40",
"text": "However, their generalizing algorithm suffers quite high complexities of time and space -for instance, the parsing complexity of time is O(n 5 ) for a third-order factored model."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-41",
"text": "In order to achieve asymptotic efficiency of cost, cube pruning for decoding is utilized (Chiang, 2007) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-42",
"text": "Another dominant category of data-driven dependency parsing systems is local-and-greedy transition-based parsing (Yamada and Matsumoto, 2003; Nivre and Scholz, 2004; Attardi, 2006; McDonald and Nivre, 2007) which parameterizes models over transitions from state to another in an abstract state-machine."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-43",
"text": "In these models, dependency trees are constructed by making a series of incremental decisions."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-44",
"text": "Parameters in these models are typically learned using standard classification techniques."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-46",
"text": "**FOURTH-ORDER PARSING ALGORITHM**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-47",
"text": "In this section, we propose our fourth-order dependency parsing algorithm, which factors each dependency tree into a set of grand-tri-sibling parts."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-48",
"text": "Specifically, a grand-tri-sibling is a 5-tuple of indices (g, s, r, m, t) where (s, r, m, t) is a tri-sibling part and (g, s, r, m) and (g, s, m, t) are grand-sibling parts."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-49",
"text": "The algorithm is characterized by introducing a new type incomplete g-spans structure: grandsibling-spans or gs-spans, by augmenting incomplete g-spans with a sibling index."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-50",
"text": "Formally, we denote gs-spans as [g, s, m, t] Figure 1 : The dynamic-programming structures and derivation of fourth-order grand-trisibling parser."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-51",
"text": "Symmetric right-headed versions are elided for brevity."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-52",
"text": "end for end for Figure 2 : Pseudo-code of bottom-up chart parser for fourth-order grand-tri-sibling parsing algorithm Figure 1 provides a graphical specification of the fourth-order grand-tri-sibling parsing algorithm."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-53",
"text": "An incomplete gs-span is constructed by combining a smaller incomplete gs-span, representing the next-innermost pair of modifiers, with a sibling g-span."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-54",
"text": "The algorithm resembles the third-order grand-sibling parser except that the incomplete g-spans are constructed by an incomplete gs-span with the same region."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-55",
"text": "We will now describe the fourth-order grand-tri-sibling parsing algorithm in more detail."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-57",
"text": "**FEATURE SPACE**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-58",
"text": "Following previous works (McDonald and Pereira, 2006; Koo and Collins, 2010) , the fourthorder parser captures not only features associated with corresponding fourth-order grand-trisibling parts, but also the features of relevant lower-order parts that are enclosed in its factorization."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-59",
"text": "The lower-order features (first-order features of dependency parts and second-order features of grandchild and sibling parts) are based on feature sets from previous work (McDonald et al., 2005; McDonald and Pereira, 2006; Carreras, 2007) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-60",
"text": "We added lexicalized versions of several features."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-61",
"text": "For example, second-order grandchild feature set defines lexical trigram features, while previous work only used POS trigram features."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-62",
"text": "Table 1 outlines all feature templates of third-order grand-sibling, third-order tri-sibling, and fourth-order grand-tri-sibling parts."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-63",
"text": "The fourth-order feature set consists of two sets of features."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-64",
"text": "The first set of features is defined to be 5-gram features that is a 5-tuple consisting of five relevant indices using words and POS tags."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-65",
"text": "The second set of features is defined as backed-off features (Koo and Collins, 2010) for grand-tri-sibling part (g, s, r, m, t)-the 4-gram (g, r, m, t), which never exist in any lower-order part."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-66",
"text": "The determination of this feature set is based on on experiments on the development data for both English and Chinese."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-67",
"text": "In section 5.1 we examine the impact of these new features on parsing performance."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-68",
"text": "According to Table 1 , several features in our parser depend on part-of-speech (POS) tags of input sentences."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-69",
"text": "For English, POS tags are automatically assigned by the SVMTool tagger (Gimenez and Marquez, 2004) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-70",
"text": "The accuracy of the SVMTool tagger on PTB is 97.3%; For Chinese, we used gold-standard POS tags in CTB."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-71",
"text": "Following Koo and Collins (2010) , two versions of POS tags are used for any features involve POS: one using is normal POS tags and another is a coarsened version of the POS tags."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-73",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-74",
"text": "The proposed fourth-order dependency parsing algorithm is evaluated on the Penn English Treebank (PTB 3.0) (Marcus et al., 1993) and the Penn Chinese Treebank (CTB 5.0)."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-75",
"text": "For English, the PTB data is prepared by using the standard split: sections 2-21 are used for training, section 22 is for development, and section 23 for test."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-76",
"text": "For Chinese, we adopt the identical training/validation/testing data split and experimental set-up as Zhang and Clark (2009) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-77",
"text": "Dependencies are extracted by using Penn2Malt 3 tool."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-78",
"text": "Parsing accuracy is measured with unlabeled attachment score (UAS): the percentage of words with the correct head, and the percentage of complete matches (CM)."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-80",
"text": "**4**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-81",
"text": "The k-best version of the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003; Crammer et al., 2006; McDonald, 2006) for the max-margin models (Taskar et al., 2003) is chosen for parameter estimation of our parsing model, In practice, we set k = 10 and exclude the sentences containing more than 100 words in both the training data sets of English and Chinese in all experiments."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-82",
"text": "Table 2 : The effect of different types of features on the development sets for English and Chinese."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-84",
"text": "**DEVELOPMENT EXPERIMENTS**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-85",
"text": "In this section, we dissect the contributions of each type of features."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-86",
"text": "Table 2 shows the effect of different types of features on the development data sets for English and Chinese."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-87",
"text": "Each row in Table 2 uses a super set of features than the previous one."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-88",
"text": "Third-order grand-sibling parser is used as the baseline, and third-order tri-sibling, 5-gram grand-tri-sibling and 4-gram backed-off feature templates in Table 1 are incrementally added."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-89",
"text": "All systems use our proposed fourth-order parsing algorithm."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-90",
"text": "Since the only difference between systems is the set of features used, we can analyze the improvement from additional features."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-91",
"text": "From Table 2 , we can see that each of the following parser capturing a group of new feature templates makes improvement on parsing performance over the previous one."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-92",
"text": "Thus, we can conclude that the improvements come from the factorization's ability of capturing richer features which contains more context information."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-93",
"text": "The parser with all these features achieves UAS of 93.77% and CM of 50.82% on PTB and UAS of 87.74%, CM of 39.23% on CTB."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-95",
"text": "**RESULTS AND ANALYSIS**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-96",
"text": "Our parser obtains UAS of 93.4% and CM 50.3% of on PTB, and UAS of 87.4%, CM of 36.8% on CTB."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-97",
"text": "Both of the results are state-of-the-art performance on these two treebanks."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-98",
"text": "Table 3 illustrates the UAS and CM of the fourth-order parser on PTB, together with some relevant results from related work."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-99",
"text": "We compare our method to first-order and secondorder sibling dependency parsers (McDonald and Pereira, 2006) , and two third-order graphbased parsers (Koo and Collins, 2010) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-100",
"text": "Additionally, we compare to a state-of-the-art graphbased parser (Zhang and McDonald, 2012) as well as a state-of-the-art transition-based parser (Zhang and Nivre, 2011) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-101",
"text": "Our experimental results show an improvement in performance over the results in Zhang and Nivre (2011) , which are based on a transition-based dependency parser with rich non-local features."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-102",
"text": "Our results are also better than the results of the two third-order graph-based dependency parsing models in Koo and Collins (2010) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-103",
"text": "Moreover, our algorithm achieves better parsing performance than the generalized higher-order parser with cubepruning (Zhang and McDonald, 2012) , which is the state-of-the-art graph-based dependency parser so far."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-104",
"text": "The models marked \u2020 or \u2021 are not directly comparable to our work."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-105",
"text": "The models marked \u2020 use semi-supervised methods with large amount of unlabeled data, and those marked \u2021 utilize phrase-structure annotations, whiling our parser obtains results competitive with these works."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-106",
"text": "All three models marked \u2020 or \u2021 are based on the Carreras (2007) parser, which might be replaced by our fourth-order parser to get an even better performance."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-107",
"text": "Next, we turn to the impact of our fourth-order parser on Chinese."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-108",
"text": "Table 4 shows the comparative results for Chinese."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-109",
"text": "Here we compare our method to an implement of the third-order grand-sibling parser -whose parsing performance on CTB is not reported in Koo and Collins (2010) , and the dynamic programming transition-based parser of Huang and Sagae (2010) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-110",
"text": "Additionally, we compare to the state-of-the-art graph-based dependency parser (Zhang and McDonald, 2012) as well as a state-of-the-art transition-based parser (Zhang and Nivre, 2011) ."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-111",
"text": "The results indicates that our parser achieved significant improvement of the previous systems on this data set."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-112",
"text": "The parsing model of Zhang and Clark (2009) , which is marked \u2021, also depends on phrase-structure annotations."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-113",
"text": "So it cannot compare with ours directly, even through our results are better."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-115",
"text": "**PARSER**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-117",
"text": "**CONCLUSION**"
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-118",
"text": "We have presented an even higher-order projective dependency parsing algorithm that can evaluate the fourth-order sub-structures of grand-tri-siblings."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-119",
"text": "This algorithm achieves stage-ofthe-art performance on both PTB and CTB, which demonstrates that the fourth-order grandtri-sibling features have important contribution to dependency parsing."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-120",
"text": "A wide range of further research involving the fourth-order parsing algorithm is available."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-121",
"text": "One idea would be to identify the highest n for which the information of nth-order part still improves parsing performance."
},
{
"sent_id": "24506b0aa7a859eb8744e390f9fb60-C001-122",
"text": "Moreover, as the fourth-order parser has achieved state-of-theart accuracy on standard parsing benchmarks, many NLP tasks may benefit from it."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"24506b0aa7a859eb8744e390f9fb60-C001-13"
]
],
"cite_sentences": [
"24506b0aa7a859eb8744e390f9fb60-C001-13"
]
},
"@MOT@": {
"gold_contexts": [
[
"24506b0aa7a859eb8744e390f9fb60-C001-13",
"24506b0aa7a859eb8744e390f9fb60-C001-14",
"24506b0aa7a859eb8744e390f9fb60-C001-15"
]
],
"cite_sentences": [
"24506b0aa7a859eb8744e390f9fb60-C001-13",
"24506b0aa7a859eb8744e390f9fb60-C001-15"
]
},
"@USE@": {
"gold_contexts": [
[
"24506b0aa7a859eb8744e390f9fb60-C001-33"
],
[
"24506b0aa7a859eb8744e390f9fb60-C001-58"
],
[
"24506b0aa7a859eb8744e390f9fb60-C001-65"
],
[
"24506b0aa7a859eb8744e390f9fb60-C001-71"
],
[
"24506b0aa7a859eb8744e390f9fb60-C001-99"
]
],
"cite_sentences": [
"24506b0aa7a859eb8744e390f9fb60-C001-33",
"24506b0aa7a859eb8744e390f9fb60-C001-58",
"24506b0aa7a859eb8744e390f9fb60-C001-65",
"24506b0aa7a859eb8744e390f9fb60-C001-71",
"24506b0aa7a859eb8744e390f9fb60-C001-99"
]
},
"@DIF@": {
"gold_contexts": [
[
"24506b0aa7a859eb8744e390f9fb60-C001-102"
],
[
"24506b0aa7a859eb8744e390f9fb60-C001-109"
]
],
"cite_sentences": [
"24506b0aa7a859eb8744e390f9fb60-C001-102",
"24506b0aa7a859eb8744e390f9fb60-C001-109"
]
}
}
},
"ABC_4f1a48dc79b9a099783d7e63741883_7": {
"x": [
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-2",
"text": "Identifying peer-review helpfulness is an important task for improving the quality of feedback that students receive from their peers."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-3",
"text": "As a first step towards enhancing existing peerreview systems with new functionality based on helpfulness detection, we examine whether standard product review analysis techniques also apply to our new context of peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-4",
"text": "In addition, we investigate the utility of incorporating additional specialized features tailored to peer review."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-5",
"text": "Our preliminary results show that the structural features, review unigrams and meta-data combined are useful in modeling the helpfulness of both peer reviews and product reviews, while peer-review specific auxiliary features can further improve helpfulness prediction."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-8",
"text": "Peer reviewing of student writing has been widely used in various academic fields."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-9",
"text": "While existing web-based peer-review systems largely save instructors effort in setting up peer-review assignments and managing document assignment, there still remains the problem that the quality of peer reviews is often poor (Nelson and Schunn, 2009 )."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-10",
"text": "Thus to enhance the effectiveness of existing peer-review systems, we propose to automatically predict the helpfulness of peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-11",
"text": "In this paper, we examine prior techniques that have been used to successfully rank helpfulness for product reviews, and adapt them to the peer-review domain."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-12",
"text": "In particular, we use an SVM regression algorithm to predict the helpfulness of peer reviews based on generic linguistic features automatically mined from peer reviews and students' papers, plus specialized features based on existing knowledge about peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-13",
"text": "We not only demonstrate that prior techniques from product reviews can be successfully tailored to peer reviews, but also show the importance of peer-review specific features."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-14",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-15",
"text": "**RELATED WORK**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-16",
"text": "Prior studies of peer review in the Natural Language Processing field have not focused on helpfulness prediction, but instead have been concerned with issues such as highlighting key sentences in papers (Sandor and Vorndran, 2009) , detecting important feedback features in reviews (Cho, 2008; , and adapting peer-review assignment (Garcia, 2010) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-17",
"text": "However, given some similarity between peer reviews and other review types, we hypothesize that techniques used to predict review helpfulness in other domains can also be applied to peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-18",
"text": "Kim et al. (2006) used regression to predict the helpfulness ranking of product reviews based on various classes of linguistic features."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-19",
"text": "Ghose and Ipeirotis (2010) further examined the socio-economic impact of product reviews using a similar approach and suggested the usefulness of subjectivity analysis."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-20",
"text": "Another study (Liu et al., 2008) of movie reviews showed that helpfulness depends on reviewers' expertise, their writing style, and the timeliness of the review."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-21",
"text": "Tsur and Rappoport (2009) proposed RevRank to select the most helpful book reviews in an unsupervised fashion based on review lexicons."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-22",
"text": "However, studies of Amazon's product reviews also show that the per- Meta-data MET the overall ratings of papers assigned by reviewers, and the absolute difference between the rating and the average score given by all reviewers."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-23",
"text": "Table 1 : Generic features motivated by related work of product reviews (Kim et al., 2006) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-24",
"text": "ceived helpfulness of a review depends not only on its review content, but also on social effects such as product qualities, and individual bias in the presence of mixed opinion distribution (Danescu-NiculescuMizil et al., 2009 )."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-25",
"text": "Nonetheless, several properties distinguish our corpus of peer reviews from other types of reviews: 1) The helpfulness of our peer reviews is directly rated using a discrete scale from one to five instead of being defined as a function of binary votes (e.g. the percentage of \"helpful\" votes (Kim et al., 2006) ); 2) Peer reviews frequently refer to the related students' papers, thus review analysis needs to take into account paper topics; 3) Within the context of education, peer-review helpfulness often has a writing specific semantics, e.g. improving revision likelihood; 4) In general, peer-review corpora collected from classrooms are of a much smaller size compared to online product reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-26",
"text": "To tailor existing techniques to peer reviews, we will thus propose new specialized features to address these issues."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-28",
"text": "**DATA AND FEATURES**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-29",
"text": "In this study, we use a previously annotated peerreview corpus (Nelson and Schunn, 2009; Patchan et al., 2009 ), collected using a freely available webbased peer-review system (Cho and Schunn, 2007) in an introductory college history class."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-30",
"text": "The corpus consists of 16 papers (about six pages each) and 267 reviews (varying from twenty words to about two hundred words)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-31",
"text": "Two experts (a writing instructor and a content instructor) (Patchan et al., 2009) were asked to rate the helpfulness of each peer review on a scale from one to five (Pearson correlation r = 0.425, p < 0.01)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-32",
"text": "For our study, we consider the average ratings given by the two experts (which roughly follow a normal distribution) as the gold standard of review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-33",
"text": "Two example rated peer reviews (shown verbatim) follow:"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-34",
"text": "A helpful peer review of average-rating 5:"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-35",
"text": "The support and explanation of the ideas could use some work. broading the explanations to include all groups could be useful."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-36",
"text": "My concerns come from some of the claims that are put forth."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-37",
"text": "Page 2 says that the 13th amendment ended the war."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-38",
"text": "is this true? was there no more fighting or problems once this amendment was added? ..."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-39",
"text": "The arguments were sorted up into paragraphs, keeping the area of interest clear, but be careful about bringing up new things at the end and then simply leaving them there without elaboration (ie black sterilization at the end of the paragraph)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-40",
"text": "An unhelpful peer review of average-rating 1:"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-41",
"text": "Your paper and its main points are easy to find and to follow."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-42",
"text": "As shown in Table 1 , we first mine generic linguistic features from reviews and papers based on the results of syntactic analysis of the texts, aiming to replicate the feature sets used by Kim et al. (2006) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-43",
"text": "While structural, lexical and syntactic features are created in the same way as suggested in their paper, we adapt the semantic and meta-data features to peer reviews by converting the mentions of product properties to mentions of the history topics and by using paper ratings assigned by peers instead of product scores."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-44",
"text": "1"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-45",
"text": "In addition, the following specialized features are motivated by an empirical study in cognitive science (Nelson and Schunn, 2009 ), which suggests that students' revision likelihood is significantly correlated with certain feedback features, and by our prior work for detecting these cognitive science constructs automatically:"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-46",
"text": "Cognitive-science features (cogS): For a given review, cognitive-science constructs that are significantly correlated with review implementation likelihood are manually coded for each idea unit (Nelson and Schunn, 2009 ) within the review."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-47",
"text": "Note, however, that peer-review helpfulness is rated for the whole review, which can include multiple idea units."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-48",
"text": "2 Therefore in our study, we calculate the distribution of feedbackType values (praise, problem, and summary) (kappa = .92), the percentage of problems that have problem localization -the presence of information indicating where the problem is localized in the related paper-(kappa = .69), and the percentage of problems that have a solutionthe presence of a solution addressing the problem mentioned in the review-(kappa = .79) to model peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-49",
"text": "These kappa values (Nelson and Schunn, 2009) were calculated from a subset of the corpus for evaluating the reliability of human annotations 3 . Consider the example of the helpful review presented in Section 3 which was manually separated into two idea units (each presented in a separate paragraph)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-50",
"text": "As both ideas are coded as problem with the presence of problem localization and solution, the cognitive-science features of this review are praise%=0, problem%=1, summary%=0, localization%=1, and solution%=1."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-51",
"text": "Lexical category features (LEX2): Ten categories of keyword lexicons developed for automatically detecting the previously manually annotated feedback types ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-52",
"text": "The categories are learned in a semi-supervised way based on syntactic and semantic functions, such as suggestion dents' papers using topic signature (Lin and Hovy, 2000) software kindly provided by Annie Louis."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-53",
"text": "Positive and negative sentiment words are extracted from the General Inquirer Dictionaries (http://www.wjh.harvard.edu/ inquirer/homecat.htm)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-54",
"text": "2 Details of different granularity levels of annotation can be found in (Nelson and Schunn, 2009) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-55",
"text": "3 These annotators are not the same experts who rated the peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-56",
"text": "modal verbs (e.g. should, must, might, could, need), negations (e.g. not, don't, doesn't), positive and negative words, and so on."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-57",
"text": "We first manually created a list of words that were specified as signal words for annotating feedbackType and problem localization in the coding manual; then we supplemented the list with words selected by a decision tree model learned using a Bag-of-Words representation of the peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-58",
"text": "These categories will also be helpful for reducing the feature space size as discussed below."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-59",
"text": "Localization features (LOC): Five features developed in our prior work for automatically identifying the manually coded problem localization tags, such as the percentage of problems in reviews that could be matched with a localization pattern (e.g. \"on page 5\", \"the section about\"), the percentage of sentences in which topic words exist between the subject and object, etc."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-61",
"text": "**EXPERIMENT AND RESULTS**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-62",
"text": "Following Kim et al. (2006) , we train our helpfulness model using SVM regression with a radial basis function kernel provided by SVM light (Joachims, 1999) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-63",
"text": "We first evaluate each feature type in isolation to investigate its predictive power of peerreview helpfulness; we then examine them together in various combinations to find the most useful feature set for modeling peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-64",
"text": "Performance is evaluated in 10-fold cross validation of our 267 peer reviews by predicting the absolute helpfulness scores (with Pearson correlation coefficient r) as well as by predicting helpfulness ranking (with Spearman rank correlation coefficient r s )."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-65",
"text": "Although predicted helpfulness ranking could be directly used to compare the helpfulness of a given set of reviews, predicting helpfulness rating is desirable in practice to compare helpfulness between existing reviews and new written ones without reranking all previously ranked reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-66",
"text": "Results are presented regarding the generic features and the specialized features respectively, with 95% confidence bounds."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-68",
"text": "**PERFORMANCE OF GENERIC FEATURES**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-69",
"text": "Evaluation of the generic features is presented in Table 2 , showing that all classes except syntactic (SYN) and meta-data (MET) features are sig-nificantly correlated with both helpfulness rating (r) and helpfulness ranking (r s )."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-70",
"text": "Structural features (bolded) achieve the highest Pearson (0.60) and Spearman correlation coefficients (0.59) (although within the significant correlations, the difference among coefficients are insignificant)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-71",
"text": "Note that in isolation, MET (paper ratings) are not significantly correlated with peer-review helpfulness, which is different from prior findings of product reviews (Kim et al., 2006) where product scores are significantly correlated with product-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-72",
"text": "However, when combined with other features, MET does appear to add value (last row)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-73",
"text": "When comparing the performance between predicting helpfulness ratings versus ranking, we observe r \u2248 r s consistently for our peer reviews, while Kim et al. (2006) reported r < r s for product reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-74",
"text": "4 Finally, we observed a similar feature redundancy effect as Kim et al. (2006) did, in that simply combining all features does not improve the model's performance."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-75",
"text": "Interestingly, our best feature combination (last row) is the same as theirs."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-76",
"text": "In sum our results verify our hypothesis that the effectiveness of generic features can be transferred to our peerreview domain for predicting review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-78",
"text": "**ANALYSIS OF THE SPECIALIZED FEATURES**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-79",
"text": "Evaluation of the specialized features is shown in Table 3 , where all features examined are signifi- 4 The best performing single feature type reported (Kim et al., 2006) was review unigrams: r = 0.398 and rs = 0.593."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-80",
"text": "cantly correlated with both helpfulness rating and ranking."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-81",
"text": "When evaluated in isolation, although specialized features have weaker correlation coefficients ([0.43, 0.51] ) than the best generic features, these differences are not significant, and the specialized features have the potential advantage of being theory-based."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-82",
"text": "The use of features related to meaningful dimensions of writing has contributed to validity and greater acceptability in the related area of automated essay scoring (Attali and Burstein, 2006) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-83",
"text": "When combined with some generic features, the specialized features improve the model's performance in terms of both r and r s compared to the best performance in Section 4.1 (the baseline)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-84",
"text": "Though the improvement is not significant yet, we think it still interesting to investigate the potential trend to understand how specialized features capture additional information of peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-85",
"text": "Therefore, the following analysis is also presented (based on the absolute mean values), where we start from the baseline feature set, and gradually expand it by adding our new specialized features: 1) We first replace the raw lexical unigram features (UGR) with lexical category features (LEX2), which slightly improves the performance before rounding to the significant digits shown in row 5."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-86",
"text": "Note that the categories not only substantially abstract lexical information from the reviews, but also carry simple syntactic and semantic information."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-87",
"text": "2) We then add one semantic class -topic words (row 6), which enhances the performance further."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-88",
"text": "Semantic features did not help when working with generic lexical features in Section 4.1 (second to last row in Table 2 ), but they can be successfully combined with the lexical category features and further improve the performance as indicated here."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-89",
"text": "3) When cognitive-science and localization features are introduced, the prediction becomes even more accurate, which reaches a Pearson correlation of 0.67 and a Spearman correlation of 0.67 (Table 3 , last row)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-91",
"text": "**DISCUSSION**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-92",
"text": "Despite the difference between peer reviews and other types of reviews as discussed in Section 2, our work demonstrates that many generic linguistic features are also effective in predicting peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-93",
"text": "The model's performance can be alter- natively achieved and further improved by adding auxiliary features tailored to peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-94",
"text": "These specialized features not only introduce domain expertise, but also capture linguistic information at an abstracted level, which can help avoid the risk of over-fitting."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-95",
"text": "Given only 267 peer reviews in our case compared to more than ten thousand product reviews (Kim et al., 2006) , this is an important consideration."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-96",
"text": "Though our absolute quantitative results are not directly comparable to the results of Kim et al. (2006) , we indirectly compared them by analyzing the utility of features in isolation and combined."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-97",
"text": "While STR+UGR+MET is found as the best combination of generic features for both types of reviews, the best individual feature type is different (review unigrams work best for product reviews; structural features work best for peer reviews)."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-98",
"text": "More importantly, meta-data, which are found to significantly affect the perceived helpfulness of product reviews (Kim et al., 2006; Danescu-Niculescu-Mizil et al., 2009) , have no predictive power for peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-99",
"text": "Perhaps because the paper grades and other helpfulness ratings are not visible to the reviewers, we have less of a social dimension for predicting the helpfulness of peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-100",
"text": "We also found that SVM regression does not favor ranking over predicting helpfulness as in (Kim et al., 2006) ."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-102",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-103",
"text": "The contribution of our work is three-fold: 1) Our work successfully demonstrates that techniques used in predicting product review helpfulness ranking can be effectively adapted to the domain of peer reviews, with minor modifications to the semantic and metadata features."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-104",
"text": "2) Our qualitative comparison shows that the utility of generic features (e.g. meta-data features) in predicting review helpfulness varies between different review types."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-105",
"text": "3) We further show that prediction performance could be improved by incorporating specialized features that capture helpfulness information specific to peer reviews."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-106",
"text": "In the future, we would like to replace the manually coded peer-review specialized features (cogS) with their automatic predictions, since we have already shown in our prior work that some important cognitive-science constructs can be successfully identified automatically."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-107",
"text": "5 Also, it is interesting to observe that the average helpfulness ratings assigned by experts (used as the gold standard in this study) differ from those given by students."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-108",
"text": "Prior work on this corpus has already shown that feedback features of review comments differ not only between students and experts, but also between the writing and the content experts (Patchan et al., 2009 )."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-109",
"text": "While Patchan et al. (2009) focused on the review comments, we hypothesize that there is also a difference in perceived peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-110",
"text": "Therefore, we are planning to investigate the impact of these different helpfulness ratings on the utilities of features used in modeling peer-review helpfulness."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-111",
"text": "Finally, we would like to integrate our helpfulness model into a web-based peer-review system to improve the quality of both peer reviews and paper revisions."
},
{
"sent_id": "4f1a48dc79b9a099783d7e63741883-C001-1",
"text": "**ABSTRACT**"
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"4f1a48dc79b9a099783d7e63741883-C001-25"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-71"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-73"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-94",
"4f1a48dc79b9a099783d7e63741883-C001-95"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-100"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-98"
]
],
"cite_sentences": [
"4f1a48dc79b9a099783d7e63741883-C001-25",
"4f1a48dc79b9a099783d7e63741883-C001-71",
"4f1a48dc79b9a099783d7e63741883-C001-73",
"4f1a48dc79b9a099783d7e63741883-C001-95",
"4f1a48dc79b9a099783d7e63741883-C001-100",
"4f1a48dc79b9a099783d7e63741883-C001-98"
]
},
"@USE@": {
"gold_contexts": [
[
"4f1a48dc79b9a099783d7e63741883-C001-42"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-62"
],
[
"4f1a48dc79b9a099783d7e63741883-C001-96"
]
],
"cite_sentences": [
"4f1a48dc79b9a099783d7e63741883-C001-42",
"4f1a48dc79b9a099783d7e63741883-C001-62",
"4f1a48dc79b9a099783d7e63741883-C001-96"
]
},
"@SIM@": {
"gold_contexts": [
[
"4f1a48dc79b9a099783d7e63741883-C001-74"
]
],
"cite_sentences": [
"4f1a48dc79b9a099783d7e63741883-C001-74"
]
},
"@MOT@": {
"gold_contexts": [
[
"4f1a48dc79b9a099783d7e63741883-C001-94",
"4f1a48dc79b9a099783d7e63741883-C001-95"
]
],
"cite_sentences": [
"4f1a48dc79b9a099783d7e63741883-C001-95"
]
}
}
},
"ABC_206b65ee4e69a01e8a0892dc0f2b30_7": {
"x": [
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-2",
"text": "Relation extraction suffers from a performance loss when a model is applied to out-of-domain data."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-3",
"text": "This has fostered the development of domain adaptation techniques for relation extraction."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-4",
"text": "This paper evaluates word embeddings and clustering on adapting feature-based relation extraction systems."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-5",
"text": "We systematically explore various ways to apply word embeddings and show the best adaptation improvement by combining word cluster and word embedding information."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-6",
"text": "Finally, we demonstrate the effectiveness of regularization for the adaptability of relation extractors."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-9",
"text": "The goal of Relation Extraction (RE) is to detect and classify relation mentions between entity pairs into predefined relation types such as Employment or Citizenship relationships."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-10",
"text": "Recent research in this area, whether feature-based (Kambhatla, 2004; Boschee et al., 2005; Zhou et al., 2005; Grishman et al., 2005; Jiang and Zhai, 2007a; Chan and Roth, 2010; Sun et al., 2011) or kernelbased (Zelenko et al., 2003; Bunescu and Mooney, 2005a; Bunescu and Mooney, 2005b; Zhang et al., 2006; Qian et al., 2008; Nguyen et al., 2009) , attempts to improve the RE performance by enriching the feature sets from multiple sentence analyses and knowledge resources."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-11",
"text": "The fundamental assumption of these supervised systems is that the training data and the data to which the systems are applied are sampled independently and identically from the same distribution."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-12",
"text": "When there is a mismatch between data distributions, the RE performance of these systems tends to degrade dramatically (Plank and Moschitti, 2013) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-13",
"text": "This is where we need to resort to domain adaptation techniques (DA) to adapt a model trained on one domain (the source domain) into a new model which can perform well on new domains (the target domains)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-14",
"text": "The consequences of linguistic variation between training and testing data on NLP tools have been studied extensively in the last couple of years for various NLP tasks such as Part-of-Speech tagging (Blitzer et al., 2006; Huang and Yates, 2010; Schnabel and Sch\u00fctze, 2014) , named entity recognition (Daum\u00e9 III, 2007) and sentiment analysis (Blitzer et al., 2007; Daum\u00e9 III, 2007; Daum\u00e9 III et al., 2010; Blitzer et al., 2011) , etc."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-15",
"text": "Unfortunately, there is very little work on domain adaptation for RE."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-16",
"text": "The only study explicitly targeting this problem so far is by Plank and Moschitti (2013) who find that the out-of-domain performance of kernel-based relation extractors can be improved by embedding semantic similarity information generated from word clustering and latent semantic analysis (LSA) into syntactic tree kernels."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-17",
"text": "Although this idea is interesting, it suffers from two major limitations: + It does not incorporate word cluster information at different levels of granularity."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-18",
"text": "In fact, Plank and Moschitti (2013) only use the 10-bit cluster prefix in their study."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-19",
"text": "We will demonstrate later that the adaptability of relation extractors can benefit significantly from the addition of word cluster features at various granularities."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-20",
"text": "+ It is unclear if this approach can encode realvalued features of words (such as word embeddings (Mnih and Hinton, 2007; Collobert and Weston, 2008) ) effectively."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-21",
"text": "As the real-valued features are able to capture latent yet useful properties of words, the augmentation of lexical terms with these features is desirable to provide a more general representation, potentially helping relation extractors perform more robustly across domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-22",
"text": "In this work, we propose to avoid these limitations by applying a feature-based approach for RE which allows us to integrate various word features of generalization into a single system more natu-rally and effectively."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-23",
"text": "The application of word representations such as word clusters in domain adaptation of RE (Plank and Moschitti, 2013 ) is motivated by its successes in semi-supervised methods (Chan and Roth, 2010; Sun et al., 2011) where word representations help to reduce data-sparseness of lexical information in the training data."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-24",
"text": "In DA terms, since the vocabularies of the source and target domains are usually different, word representations would mitigate the lexical sparsity by providing general features of words that are shared across domains, hence bridge the gap between domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-25",
"text": "The underlying hypothesis here is that the absence of lexical target-domain features in the source domain can be compensated by these general features to improve RE performance on the target domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-26",
"text": "We extend this motivation by further evaluating word embeddings (Bengio et al., 2001; Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Turian et al., 2010) on feature-based methods to adapt RE systems to new domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-27",
"text": "We explore the embedding-based features in a principled way and demonstrate that word embedding itself is also an effective representation for domain adaptation of RE."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-28",
"text": "More importantly, we show empirically that word embeddings and word clusters capture different information and their combination would further improve the adaptability of relation extractors."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-30",
"text": "**REGULARIZATION**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-31",
"text": "Given the more general representations provided by word representations above, how can we learn a relation extractor from the labeled source domain data that generalizes well to new domains?"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-32",
"text": "In traditional machine learning where the challenge is to utilize the training data to make predictions on unseen data points (generated from the same distribution as the training data), the classifier with a good generalization performance is the one that not only fits the training data, but also avoids ovefitting over it."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-33",
"text": "This is often obtained via regularization methods to penalize complexity of classifiers."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-34",
"text": "Exploiting the shared interest in generalization performance with traditional machine learning, in domain adaptation for RE, we would prefer the relation extractor that fits the source domain data, but also circumvents the overfitting problem over this source domain 1 so that it could generalize well on new domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-35",
"text": "Eventually, regularization methods can be considered naturally as a simple yet general technique to cope with DA problems."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-36",
"text": "Following Plank and Moschitti (2013) , we assume that we only have labeled data in a single source domain but no labeled as well as unlabeled target data."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-37",
"text": "Moreover, we consider the singlesystem DA setting where we construct a single system able to work robustly with different but related domains (multiple target domains)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-38",
"text": "This setting differs from most previous studies (Blitzer et al., 2006) on DA which have attempted to design a specialized system for every specific target domain."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-39",
"text": "In our view, although this setting is more challenging, it is more practical for RE."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-40",
"text": "In fact, this setting can benefit considerably from our general approach of applying word representations and regularization."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-41",
"text": "Finally, due to this setting, the best way to set up the regularization parameter is to impose the same regularization parameter on every feature rather than a skewed regularization (Jiang and Zhai, 2007b) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-43",
"text": "**RELATED WORK**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-44",
"text": "Although word embeddings have been successfully employed in many NLP tasks (Collobert and Weston, 2008; Turian et al., 2010; Maas and Ng, 2010) , the application of word embeddings in RE is very recent."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-64",
"text": "We apply the same feature set as Sun et al. (2011) but remove the entity and mention type information 2 ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-45",
"text": "Kuksa et al. (2010) propose an abstraction-augmented string kernel for bio-relation extraction via word embeddings."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-46",
"text": "In the surge of deep learning, Socher et al. (2012) and Khashabi (2013) use pre-trained word embeddings as input for Matrix-Vector Recursive Neural Networks (MV-RNN) to learn compositional structures for RE."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-47",
"text": "However, none of these works evaluate word embeddings for domain adaptation of RE which is our main focus in this paper."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-48",
"text": "Regarding domain adaptation, in representation learning, Blitzer et al. (2006) propose structural correspondence learning (SCL) while Huang and Yates (2010) attempt to learn a multi-dimensional feature representation."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-49",
"text": "Unfortunately, these methods require unlabeled target domain data which are unavailable in our single-system setting of DA."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-50",
"text": "Daum\u00e9 III (2007) proposes an easy adaptation framework (EA) which is later extended to a semisupervised version (EA++) to incorporate unla-beled data (Daum\u00e9 III et al., 2010) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-51",
"text": "In terms of word embeddings for DA, recently, Xiao and Guo (2013) present a log-bilinear language adaptation framework for sequential labeling tasks."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-52",
"text": "However, these methods assume some labeled data in target domains and are thus not applicable in our setting of unsupervised DA."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-53",
"text": "Above all, we move one step further by evaluating the effectiveness of word embeddings on domain adaptation for RE which is very different from the principal topic of sequence labeling in the previous research."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-55",
"text": "**WORD REPRESENTATIONS**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-56",
"text": "We consider two types of word representations and use them as additional features in our DA system, namely Brown word clustering (Brown et al., 1992) and word embeddings (Bengio et al., 2001) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-57",
"text": "While word clusters can be recognized as an one-hot vector representation over a small vocabulary, word embeddings are dense, lowdimensional, and real-valued vectors (distributed representations)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-58",
"text": "Each dimension of the word embeddings expresses a latent feature of the words, hopefully reflecting useful semantic and syntactic regularities (Turian et al., 2010) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-59",
"text": "We investigate word embeddings induced by two typical language models: Collobert and Weston (2008) embeddings (C&W) (Collobert and Weston, 2008; Turian et al., 2010) and Hierarchical log-bilinear embeddings (HLBL) (Mnih and Hinton, 2007; Mnih and Hinton, 2009; Turian et al., 2010) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-60",
"text": "5 Feature Set 5.1 Baseline Feature Set Sun et al. (2011) utilize the full feature set from (Zhou et al., 2005) plus some additional features and achieve the state-of-the-art feature-based RE system."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-61",
"text": "Unfortunately, this feature set includes the human-annotated (gold-standard) information on entity and mention types which is often missing or noisy in reality (Plank and Moschitti, 2013) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-62",
"text": "This issue becomes more serious in our setting of single-system DA where we have a single source domain with multiple dissimilar target domains and an automatic system able to recognize entity and mention types very well in different domains may not be available."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-63",
"text": "Therefore, following the settings of Plank and Moschitti (2013) , we will only assume entity boundaries and not rely on the gold standard information in the experiments."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-66",
"text": "**LEXICAL FEATURE AUGMENTATION**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-67",
"text": "While Sun et al. (2011) show that adding word clusters to the heads of the two mentions is the most effective way to improve the generalization accuracy, the right lexical features into which word embeddings should be introduced to obtain the best adaptability improvement are unexplored."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-68",
"text": "Also, which dimensionality of which word embedding should we use with which lexical features?"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-69",
"text": "In order to answer these questions, following Sun et al. (2011) , we first group lexical features into 4 groups and rank their importance based on linguistic intuition and illustrations of the contributions of different lexical features from various featurebased RE systems."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-70",
"text": "After that, we evaluate the effectiveness of these lexical feature groups for word embedding augmentation individually and incrementally according to the rank of importance."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-71",
"text": "For each of these group combinations, we assess the system performance with different numbers of dimensions for both C&W and HLBL word embeddings."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-72",
"text": "Let M1 and M2 be the first and second mentions in the relation."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-73",
"text": "to facilitate system comparison later."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-74",
"text": "We evaluate C&W word embeddings with 25, 50 and 100 dimensions as well as HLBL word embeddings with 50 and 100 dimensions that are introduced in Turian et al. (2010) and can be downloaded here 4 ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-75",
"text": "The fact that we utilize the large, general and unbiased resources generated from the previous works for evaluation not only helps to verify the effectiveness of the resources across different tasks and settings but also supports our setting of single-system DA."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-76",
"text": "We use the ACE 2005 corpus for DA experiments (as in Plank and Moschitti (2013) )."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-77",
"text": "It involves 6 relation types and 6 domains: broadcast news (bn), newswire (nw), broadcast conversation (bc), telephone conversation (cts), weblogs (wl) and usenet (un)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-78",
"text": "We follow the standard practices on ACE (Plank and Moschitti, 2013) and use news (the union of bn and nw) as the source domain and bc, cts and wl as our target domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-79",
"text": "We take half of bc as the only target development set, and use the remaining data and domains for testing purposes (as they are small already)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-80",
"text": "As noted in Plank and Moschitti (2013) , the distributions of relations as well as the vocabularies of the domains are quite different."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-82",
"text": "**EVALUATION OF WORD EMBEDDING FEATURES**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-83",
"text": "We investigate the effectiveness of word embeddings on lexical features by following the procedure described in Section 5.2."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-84",
"text": "We test our system on two scenarios: In-domain: the system is trained and evaluated on the source domain (bn+nw, 5-fold cross validation); Out-of-domain: the system is trained on the source domain and evaluated on the target development set of bc (bc dev)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-104",
"text": "However, the performance order across domains of the two baselines are the same."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-105",
"text": "Besides, the baseline performance is improved over all target domains when the system is enriched with word cluster features of the 10 prefix length only (row 2)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-106",
"text": "(ii): Over all the target domains, the performance of the system augmented with word cluster features of various granularities (row 3) is superior to that when only cluster features for the prefix length 10 are added (row 2)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-107",
"text": "This is significant (at confidence level \u2265 95%) for domains bc and wl and verifies our assumption that various granularities for word cluster features are more effective than a single granularity for domain adaptation of RE."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-108",
"text": "(iii): Row 4 shows that word embedding itself is also very useful for domain adaptation in RE since it improves the baseline system for all the target domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-109",
"text": "(iv): In row 5, we see that the addition of both word cluster and word embedding features improves the system further and results in the best performance over all target domains (this is significant with confidence level \u2265 95% in domains bc and wl)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-110",
"text": "The result suggests that word embeddings seem to capture different information from word clusters and their combination would be effective to generalize relation extractors across domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-111",
"text": "However, in domain cts, the improvement that word embeddings provide for word clusters is modest."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-112",
"text": "This is because the RCV1 corpus used to induce the word embeddings (Turian et al., 2010) does not cover spoken language words in cts very well."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-113",
"text": "(v): Finally, the in-domain performance is also improved consistently demonstrating the robustness of word representations (Plank and Moschitti, 2013) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-115",
"text": "**DOMAIN ADAPTATION WITH REGULARIZATION**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-116",
"text": "All the experiments we have conducted so far do not apply regularization for training."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-117",
"text": "In this section, in order to evaluate the effect of regularization on the generalization capacity of relation extractors across domains, we replicate all the experiments in Section 6.3 but apply regularization when relation extractors are trained 6 ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-118",
"text": "All the improvements over the baseline in Table 4 are significant at confidence level \u2265 95%."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-119",
"text": "For this experiment, every statement in (ii), (iii), (iv) and (v) of Section 6.3 also holds."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-120",
"text": "More importantly, the performance in every cell of Table 4 is significantly better than the corresponding cell in Table 3 (5% or better gain in F measure, a significant improvement at confidence level \u2265 95%)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-121",
"text": "This demonstrates the effectiveness of regularization for RE in general and for domain adaptation of RE specifically."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-85",
"text": "Table 2 presents the F measures of this experiment 5 (the 4 http://metaoptimize.com/projects/ wordreprs/ 5 All the in-domain improvement in rows 2, 6, 7 of Table 2 are significant at confidence levels \u2265 95%."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-86",
"text": "suffix ED in lexical group names is to indicate the embedding features)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-87",
"text": "From the tables, we find that for C&W and HLBL embeddings of 50 and 100 dimensions, the most effective way to introduce word embeddings is to add embeddings to the heads of the two mentions (row 2; both in-domain and out-of-domain) although it is less pronounced for HLBL embedding with 50 dimensions."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-88",
"text": "Interestingly, for C&W embedding with 25 dimensions, adding the embedding to both heads and words of the two mentions (row 6) performs the best for both in-domain and out-of-domain scenarios."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-89",
"text": "This is new compared to the word cluster features where the heads of the two mentions are always the best places for augmentation (Sun et al., 2011) ."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-90",
"text": "It suggests that a suitable amount of embeddings for words in the mentions might be useful for the augmentation of the heads and inspires further exploration."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-91",
"text": "Introducing embeddings to words of mentions alone has mild impact while it is generally a bad idea to augment chunk heads and words in the contexts."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-92",
"text": "Comparing C&W and HLBL embeddings is somehow more complicated."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-93",
"text": "For both in-domain and out-of-domain settings with different numbers of dimensions, C&W embedding outperforms HLBL embedding when only the heads of the mentions are augmented while the degree of negative impact of HLBL embedding on chunk heads as well as context words seems less serious than C&W's."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-94",
"text": "Regarding the incremental addition of features (rows 6, 7, 8) , C&W is better for the outof-domain performance when 50 dimensions are used, whereas HLBL (with both 50 and 100 dimensions) is more effective for the in-domain setting."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-95",
"text": "For the next experiments, we will apply the C&W embedding of 50 dimensions to the heads of the mentions for its best out-of-domain performance."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-97",
"text": "**DOMAIN ADAPTATION WITH WORD**"
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-98",
"text": "Embeddings This section examines the effectiveness of word representations for RE across domains."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-99",
"text": "We evaluate word cluster and embedding (denoted by ED) features by adding them individually as well as simultaneously into the baseline feature set."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-100",
"text": "For word clusters, we experiment with two possibilities: (i) only using a single prefix length of 10 (as Plank and Moschitti (2013) did) (denoted by WC10) and (ii) applying multiple prefix lengths of 4, 6, 8, 10 together with the full string (denoted by WC)."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-101",
"text": "(i): The baseline system achieves a performance of 51.4% within its own domain while the performance on target domains bc, cts, wl drops to 49.7%, 41.5% and 36.6% respectively."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-102",
"text": "Our baseline performance is worse than that of Plank and Moschitti (2013) only on the target domain cts and better in the other cases."
},
{
"sent_id": "206b65ee4e69a01e8a0892dc0f2b30-C001-103",
"text": "This might be explained by the difference between our baseline feature set and the feature set underlying their kernel-based system."
}
],
"y": {
"@MOT@": {
"gold_contexts": [
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-11",
"206b65ee4e69a01e8a0892dc0f2b30-C001-12"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-17",
"206b65ee4e69a01e8a0892dc0f2b30-C001-18"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-61",
"206b65ee4e69a01e8a0892dc0f2b30-C001-63"
]
],
"cite_sentences": [
"206b65ee4e69a01e8a0892dc0f2b30-C001-12",
"206b65ee4e69a01e8a0892dc0f2b30-C001-18",
"206b65ee4e69a01e8a0892dc0f2b30-C001-61",
"206b65ee4e69a01e8a0892dc0f2b30-C001-63"
]
},
"@BACK@": {
"gold_contexts": [
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-16"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-17",
"206b65ee4e69a01e8a0892dc0f2b30-C001-18"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-23"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-80"
]
],
"cite_sentences": [
"206b65ee4e69a01e8a0892dc0f2b30-C001-16",
"206b65ee4e69a01e8a0892dc0f2b30-C001-18",
"206b65ee4e69a01e8a0892dc0f2b30-C001-23",
"206b65ee4e69a01e8a0892dc0f2b30-C001-80"
]
},
"@USE@": {
"gold_contexts": [
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-36"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-63"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-76"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-78"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-100"
]
],
"cite_sentences": [
"206b65ee4e69a01e8a0892dc0f2b30-C001-36",
"206b65ee4e69a01e8a0892dc0f2b30-C001-63",
"206b65ee4e69a01e8a0892dc0f2b30-C001-76",
"206b65ee4e69a01e8a0892dc0f2b30-C001-78",
"206b65ee4e69a01e8a0892dc0f2b30-C001-100"
]
},
"@DIF@": {
"gold_contexts": [
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-102"
],
[
"206b65ee4e69a01e8a0892dc0f2b30-C001-113"
]
],
"cite_sentences": [
"206b65ee4e69a01e8a0892dc0f2b30-C001-102",
"206b65ee4e69a01e8a0892dc0f2b30-C001-113"
]
}
}
},
"ABC_91e869971f139d90e36f73b1089877_7": {
"x": [
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-2",
"text": "BLEU is widely considered to be an informative metric for text-to-text generation, including Text Simplification (TS)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-3",
"text": "TS includes both lexical and structural aspects."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-4",
"text": "In this paper we show that BLEU is not suitable for the evaluation of sentence splitting, the major structural simplification operation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-5",
"text": "We manually compiled a sentence splitting gold standard corpus containing multiple structural paraphrases, and performed a correlation analysis with human judgments."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-6",
"text": "1 We find low or no correlation between BLEU and the grammaticality and meaning preservation parameters where sentence splitting is involved."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-7",
"text": "Moreover, BLEU often negatively correlates with simplicity, essentially penalizing simpler sentences."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-10",
"text": "BLEU (Papineni et al., 2002 ) is an n-grambased evaluation metric, widely used for Machine Translation (MT) evaluation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-11",
"text": "BLEU has also been applied to monolingual translation tasks, such as grammatical error correction (Park and Levy, 2011) , summarization (Graham, 2015) and text simplification (Narayan and Gardent, 2014; Stajner et al., 2015; Xu et al., 2016) , i.e. the rewriting of a sentence as one or more simpler sentences."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-12",
"text": "Along with the application of parallel corpora and MT techniques for TS (e.g., Zhu et al., 2010; Wubben et al., 2012; Narayan and Gardent, 2014) , BLEU became the main automatic metric for TS, despite its deficiencies (see \u00a72)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-13",
"text": "Indeed, focusing on lexical simplification, Xu et al. (2016) argued that BLEU gives high scores to sentences that are close or even identical to the input, especially when multiple references are used."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-14",
"text": "In their experiments, BLEU failed to predict simplicity, but obtained a higher correlation with grammaticality and meaning preservation, relative to the SARI metric they proposed."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-15",
"text": "In this paper, we further explore the applicability of BLEU for TS evaluation, examining BLEU's informativeness where sentence splitting is involved."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-16",
"text": "Sentence splitting, namely the rewriting of a single sentence as multiple sentences while preserving its meaning, is the main structural simplification operation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-17",
"text": "It has been shown useful for MT preprocessing (Chandrasekar et al., 1996; Mishra et al., 2014; Li and Nenkova, 2015) and human comprehension (Mason and Kendall, 1979; Williams et al., 2003) , independently from other lexical and structural simplification operations."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-41",
"text": "In particular, Callison-Burch et al. (2006) showed that BLEU may not correlate in some cases with human judgments since a huge number of potential translations have the same BLEU score, and that correlation decreases when translation quality is low."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-42",
"text": "Some of the reported shortcomings are relevant to monolingual translation, such as the impossibility to capture synonyms and paraphrases that are not in the reference set, or the uniform weighting of words."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-43",
"text": "tural operations are involved (Nisioi et al., 2017; Sulem et al., 2018b) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-45",
"text": "**BLEU IN TS.**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-46",
"text": "While BLEU is standardly used for TS evaluation (e.g., Xu et al., 2016; Nisioi et al., 2017; Zhang and Lapata, 2017; Ma and Sun, 2017 ), only few works tested its correlation with human judgments."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-47",
"text": "Using 20 source sentences from the PWKP test corpus (Zhu et al., 2010) with 5 simplified sentences for each of them, Wubben et al. (2012) reported positive correlation of BLEU with simplicity ratings, but no correlation with adequacy."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-48",
"text": "T-BLEU (\u0160tajner et al., 2014), a variant of BLEU which uses lower n-grams when no overlapping 4-grams are found, was tested on outputs that applied only structural modifications to the source."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-49",
"text": "It was found to have moderate positive correlation for meaning preservation, and positive but low correlation for grammaticality."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-50",
"text": "Correlation with simplicity was not considered in this experiment."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-51",
"text": "Xu et al. (2016) focused on lexical simplification, finding that BLEU obtains reasonable correlation for grammaticality and meaning preservation but fails to capture simplicity, even when multiple references are used."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-52",
"text": "To our knowledge, no previous work has examined the behavior of BLEU on sentence splitting, which we investigate here using a manually compiled gold standard."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-54",
"text": "**GOLD-STANDARD SPLITTING CORPUS**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-55",
"text": "In order to investigate the effect of correctly splitting sentences on the automatic metric scores, we build a parallel corpus, where each sentence is modified by 4 annotators, according to specific sentence splitting guidelines."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-56",
"text": "We use the complex side of the test corpus of Xu et al. (2016) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-57",
"text": "3 While Narayan et al. (2017) recently proposed the semi-automatically compiled WEB-SPLIT dataset for training automatic sentence splitting systems, here we generate a completely manual corpus, without a-priori splitting points nor do we pre-suppose that all sentences should be split."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-58",
"text": "This corpus enriches the set of references focused on lexical operations that were collected by Xu et al. (2016) for the same source sentences and can also be used as an out-of-domain test set for Split-and-Rephrase (Narayan et al., 2017) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-59",
"text": "We use two sets of guidelines."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-60",
"text": "In Set 1, annotators are required to split the original as much as possible, while preserving the sentence's gram-maticality, fluency and meaning."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-61",
"text": "The guidelines include two sentence splitting examples."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-62",
"text": "4 In Set 2, annotators are encouraged to split only in cases where it simplifies the original sentence."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-63",
"text": "That is, simplicity is implicit in Set 1 and explicit in Set 2."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-64",
"text": "In both sets, the annotators are instructed to leave the source unchanged if splitting violates grammaticality, fluency or meaning preservation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-65",
"text": "5 Each set of guidelines is used by two annotators, with native or native-like proficiency in English."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-66",
"text": "The obtained corpora are denoted by HSplit1, HSplit2 (for Set 1), and HSplit3 and HSplit4 (for Set 2), each containing 359 sentences."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-67",
"text": "Table 1 presents statistics for the corpora."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-68",
"text": "Both in terms of the number of splits per sentence (# Sents) and in terms of the proportion of input sentences that have been split (SplitSents), we observe that the average difference within each set is significantly greater than the average difference between the sets."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-69",
"text": "6 This suggests that the number of splits is less affected by the explicit mention of simplicity than by the inter-annotator variability."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-70",
"text": "#Sents denotes the average number of sentences in the output."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-71",
"text": "SplitSents denotes the proportion of input sentences that have been split."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-72",
"text": "The last row presents the average scores of the 4 HSplit corpora."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-74",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-76",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-77",
"text": "Metrics."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-78",
"text": "In addition to BLEU, 7 we also experiment with (1) iBLEU (Sun and Zhou, 2012) which was recently used for TS (Xu et al., 2016; and which takes into account the BLEU scores of the output against the input and against the references; (2) the Flesch-Kincaid Grade Level (FK; Kincaid et al., 1975 ), computed at the system level, which estimates the readability of the text with a lower value indicating higher 4 Examples are taken from Siddharthan (2006) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-79",
"text": "5 Examples are not provided in the case of Set 2 so as not to give an a-priori notion of simplicity."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-80",
"text": "The complete guidelines are found in the supplementary material."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-81",
"text": "6 Wilicoxon's signed rank test, p = 1.6 \u00b7 10 \u22125 for #Sents and p = 0.002 for SplitSents."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-82",
"text": "7 System-level BLEU scores are computed using the multi-bleu Moses support tool."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-83",
"text": "Sentence-level BLEU scores are computed using NLTK (Loper and Bird, 2002)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-84",
"text": "readability; 8 (3) SARI (Xu et al., 2016) , which compares the n-grams of the system output with those of the input and the human references, separately evaluating the quality of words that are added, deleted and kept by the systems."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-85",
"text": "For completeness, we also experiment with the negative Levenshtein distance to the source (-LD SC ), which serves as a measure of conservatism."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-86",
"text": "9 We explore two settings."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-87",
"text": "In one (\"Standard Reference Setting\", \u00a74.2), we use two sets of references: the Simple Wikipedia reference (yielding BLEU-1ref and iBLEU-1ref), and 8 references obtained by crowdsourcing by Xu et al. (2016) (yielding BLEU-8ref, iBLEU-8ref and SARI-8ref)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-88",
"text": "In the other (\"HSplit as Reference Setting\", \u00a74.3), we use HSplit as the reference set."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-89",
"text": "Systems."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-90",
"text": "For \"Standard Reference Setting\", we consider both a case where evaluated systems do not perform any splittings on the test set (\"Systems/Corpora without Splits\"), and one where we evaluate these systems, along with the HSplit corpus, used in the role of system outputs (\"All Systems/Corpora\")."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-91",
"text": "Systems include six MT-based simplification systems, including outputs of the state-of-the-art neural TS system of Nisioi et al. (2017) , in four variants: either default settings or initialization by word2vec, for each both the highest and the fourth ranked hypotheses in the beam are considered."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-92",
"text": "10 We further include Moses (Koehn et al., 2007) and SBMT-SARI (Xu et al., 2016) , a syntax-based MT system tuned against SARI, and the identity function (outputs are same as inputs)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-93",
"text": "The case which evaluates outputs with sentence splitting additionally includes the four HSplit corpora and the HSplit average scores."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-94",
"text": "For \"HSplit as Reference Setting\", we consider the outputs of six simplification systems whose main simplification operation is sentence splitting: DSS, DSS m , SEMoses, SEMoses m , SEMoses LM and SEMoses m LM , taken from (Sulem et al., 2018b) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-95",
"text": "Human Evaluation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-96",
"text": "We use the evaluation benchmark provided by Sulem et al. (2018b) , 11 including system outputs and human evaluation scores corresponding to the first 70 sentences of the test corpus of Xu et al. (2016) , and extend it to apply to HSplit as well."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-97",
"text": "The evaluation of HSplit is carried out by 3 in-house native English annotators, who rated the different input-output pairs for the different systems according to 4 parameters: Grammaticality (G), Meaning preservation (M), Simplicity (S) and Structural Simplicity (StS)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-98",
"text": "G and M are measured using a 1 to 5 scale."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-99",
"text": "A -2 to +2 scale is used for measuring simplicity and structural simplicity."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-100",
"text": "For computing the inter-annotator agreement of the whole benchmark (including the system outputs and the HSplit corpora), we follow Pavlick and Tetreault (2016) and randomly select, for each sentence, one annotator's rating to be the rating of Annotator 1 and the rounded average rating of the two other annotators to be the rating of Annotator 2."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-101",
"text": "We then compute weighted quadratic \u03ba (Cohen, 1968) between Annotator 1 and 2."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-102",
"text": "Repeating this process 1000 times, the obtained medians and 95% confidence intervals are 0.42 \u00b1 0.002 for G, 0.77 \u00b1 0.001 for M and 0.59 \u00b1 0.002 for S and StS."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-104",
"text": "**RESULTS WITH STANDARD REFERENCE SETTING**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-105",
"text": "Description of the Human Evaluation Scores."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-106",
"text": "The human evaluation scores for each parameter are obtained by averaging over the 3 annotators."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-107",
"text": "The scores at the system level are obtained by averaging over the 70 sentences."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-108",
"text": "In the \"All systems/corpora\" case of the \"Standard Reference Setting\", where 12 systems/corpora are considered, the range of the average G scores at the system level is from 3.71 to 4.80 (\u03c3 = 0.29)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-109",
"text": "For M, this max-min difference between the systems is 1.23 (\u03c3=0.40)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-110",
"text": "For S and StS, the differences are 0.53 (\u03c3 = 0.17) and 0.65 (\u03c3 = 0.20)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-111",
"text": "At the sentence level, considering 840 sentences (70 for each of the system/corpora), the G and M scores vary from 1 to 5 (\u03c3 equals 0.69 and 0.85 respectively), and the S and StS scores from -1 to 2 (\u03c3 equals 0.53 and 0.50)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-112",
"text": "In the \"Systems/corpora without Splits\" case of the \"Standard Reference Setting\", where 7 systems/corpora are considered, the max-min difference at the system level are again 1.09 (\u03c3 = 0.36) and 1.23 (\u03c3 = 0.47) for G and M respectively."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-113",
"text": "For S and StS, the differences are 0.45 and 0.49 (\u03c3 = 0.18)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-114",
"text": "At the sentence level, considering 490 sentences (70 for each of the system/corpora), the G and M scores vary from 1 to 5 (\u03c3 equals 0.78 and 1.01 respectively), and the S and StS scores from -1 to 2 (\u03c3 equals 0.51 and 0.46)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-115",
"text": "Comparing HSplit to Identity."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-116",
"text": "Comparing the BLEU score on the input (the identity function) and on the HSplit corpora, we observe that the former yields much higher BLEU scores."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-117",
"text": "Indeed, BLEU-1ref obtains 59.85 for the input and 43.90 for the HSplit corpora (averaged over the 4 HSplit corpora)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-118",
"text": "BLEU-8ref obtains 94.63 for the input and 73.03 for HSplit."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-119",
"text": "12 The high scores obtained for Identity, also observed by Xu et al. (2016) , indicate that BLEU is a not a good predictor for relative simplicity to the input."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-120",
"text": "The drop in the BLEU scores for HSplit is not reflected by the human evaluation scores for grammaticality (4.43 for AvgHSplit vs. 4.80 for Identity) and meaning preservation (4.70 vs. 5.00), where the decrease between Identity and HSplit is much more limited."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-121",
"text": "For examining these tendencies in more detail, we compute the correlations between the au-tomatic metrics and the human evaluation scores."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-122",
"text": "They are described in the following paragraph."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-123",
"text": "Correlation with Human Evaluation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-124",
"text": "The system-level Spearman correlations between the rankings of the automatic metrics and the human judgments (see \u00a74.1) are presented in Table 2 ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-125",
"text": "We find that in all cases BLEU and iBLEU negatively correlate with S and StS, indicating that they fail to capture simplicity and structural simplicity."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-126",
"text": "Where gold standard splits are evaluated as well, BLEU's and iBLEU's failure to capture StS is even more pronounced."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-127",
"text": "Moreover, BLEU's correlation with G and M in this case disappears."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-128",
"text": "In fact, BLEU's correlation with M in this case is considerably lower than that of -LD SC and its correlation with G is comparable, suggesting BLEU is inadequate even as a measure of G and M if splitting is involved."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-129",
"text": "We examine the possibility that BLEU mostly acts as a measure of conservatism, and compute the Spearman correlation between -LD SC and BLEU."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-130",
"text": "The high correlations we obtain between the metrics indicate that this may be the case."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-131",
"text": "Specifically, BLEU-1ref obtains correlations of 0.86 (p = 7 \u00d7 10 \u22123 ) without splits and of 0.52 (p = 0.04) where splitting is involved."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-132",
"text": "BLEU-8ref obtains 0.82 (p = 0.01) and 0.55 (p = 0.03)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-133",
"text": "SARI obtains positive correlations with S, of 0.52 (without splits) and 0.26 (all systems/corpora), but correlates with StS in neither setting."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-134",
"text": "This may stem from SARI's focus on lexical, rather than structural TS."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-135",
"text": "Similar trends are observed in the sentencelevel correlation for S, StS and M, whereas G sometimes benefits in the sentence level from including HSplit in the evaluation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-136",
"text": "For G and M, the correlation with BLEU is lower than its correlation with -LD SC in both cases."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-137",
"text": "Table 3 : Sentence-level Spearman correlation (and p-values) between the automatic metrics and the human ratings for \"HSplit as Reference Setting\"."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-138",
"text": "* p < 10 \u22125 ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-140",
"text": "**RESULTS WITH HSPLIT AS REFERENCE SETTING**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-141",
"text": "We turn to examining whether BLEU may be adapted to address sentence splitting, if provided with references that include splittings."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-142",
"text": "Description of the Human Evaluation Scores."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-143",
"text": "In the \"HSplit as Reference Reference Setting\", where 6 systems are considered, the max-min difference at the system level is 0.16 (\u03c3 = 0.06) for G, 0.37 for M (\u03c3 = 0.15), and 0.41 for S and StS (\u03c3 equals 0.20 and 0.19 respectively)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-144",
"text": "At the sentence level, considering 420 sentences (70 for each of the systems), the G and M scores vary from 1 to 5 (\u03c3 equals 0.99 and 0.88 respectively), and the S and StS scores from -2 to 2 (\u03c3 equals 0.63)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-145",
"text": "Correlation with Human Evaluation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-146",
"text": "On the system-level Spearman correlation between BLEU and human judgments, we find that while correlation with G is high (0.57, p = 0.1), it is low for M (0.11, p = 0.4), and negative for S (-0.70, p = 0.06) and StS (-0.60, p = 0.1)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-147",
"text": "Sentence-level correlations of BLEU and iBLEU are positive, but they are lower than those obtained by LD SC ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-148",
"text": "See Table 3 ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-149",
"text": "To recap, results in this section demonstrate that even when evaluated against references that focus on sentence splitting, BLEU fails to capture the simplicity and structural simplicity of the output."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-151",
"text": "**CONCLUSION**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-152",
"text": "In this paper we argued that BLEU is not suitable for TS evaluation, showing that (1) BLEU negatively correlates with simplicity, and that (2) even as a measure of grammaticality or meaning preservation it is comparable to, or worse than -LD SC , which requires no references."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-153",
"text": "Our findings suggest that BLEU should not be used for the evaluation of TS in general and sentence splitting in particular, and motivate the development of alternative methods for structural TS evaluation, such as (Sulem et al., 2018a) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-18",
"text": "Sentence splitting is performed by many TS systems (Zhu et al., 2010; Woodsend and Lapata, 2011; Siddharthan and Angrosh, 2014; Gardent, 2014, 2016) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-19",
"text": "For example, 63% and 80% of the test sentences are split by the systems of Woodsend and Lapata (2011) and Zhu et al. (2010) , respectively (Narayan and Gardent, 2016) ."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-20",
"text": "Sentence splitting is also the focus of the recently proposed Split-and Rephrase sub-task (Narayan et al., 2017; Aharoni and Goldberg, 2018) , in which the automatic metric used is BLEU."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-21",
"text": "For exploring the effect of sentence splitting on BLEU scores, we compile a human-generated gold standard sentence splitting corpus -HSplit, which will also be useful for future studies of splitting in TS, and perform correlation analyses with human judgments."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-22",
"text": "We consider two reference sets."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-23",
"text": "First, we experiment with the most common set, proposed by Xu et al. (2016) , evaluating a variety of system outputs, as well as HSplit."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-24",
"text": "The references in this setting explicitly emphasize lexical operations, and do not contain splitting or content deletion."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-25",
"text": "2 Second, we experiment with HSplit as the reference set, evaluating systems that focus on sentence splitting."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-26",
"text": "The first setting allows assessing whether BLEU with the standard reference set is a reliable metric on systems that perform splitting."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-27",
"text": "The second allows assessing whether BLEU can be adapted to evaluate splitting, given a reference set so oriented."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-28",
"text": "We find that BLEU is often negatively correlated with simplicity, even when evaluating outputs without splitting, and that when evaluating outputs with splitting, it is less reliable than a simple measure of similarity to the source ( \u00a74.2)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-29",
"text": "Moreover, we show that BLEU cannot be adapted to assess sentence splitting, even where the reference set focuses on this operation ( \u00a74.3)."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-30",
"text": "We conclude that BLEU is not informative and is often misleading for TS evaluation and for the related Split and Rephrase task."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-32",
"text": "**RELATED WORK**"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-33",
"text": "The BLEU Metric."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-34",
"text": "BLEU (Papineni et al., 2002) is reference-based, where the use of multiple references is used to address cross-reference variation."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-35",
"text": "To address changes in word order, BLEU uses n-gram precision, modified to eliminate repetitions across the references."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-36",
"text": "A brevity term penalizes overly short sentences."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-37",
"text": "Formally:"
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-38",
"text": "where BP is the brevity penalty term, p n are the modified precisions, and w n are the corresponding weights, which are usually uniform in practice."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-39",
"text": "The experiments of Papineni et al. (2002) showed that BLEU correlates with human judgments in the ranking of five English-to-Chinese MT systems and that it can distinguish human and machine translations."
},
{
"sent_id": "91e869971f139d90e36f73b1089877-C001-40",
"text": "Although BLEU is widely used in MT, several works have pointed out its shortcomings (e.g., Koehn and Monz, 2006) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"91e869971f139d90e36f73b1089877-C001-11"
],
[
"91e869971f139d90e36f73b1089877-C001-13"
],
[
"91e869971f139d90e36f73b1089877-C001-78"
]
],
"cite_sentences": [
"91e869971f139d90e36f73b1089877-C001-11",
"91e869971f139d90e36f73b1089877-C001-13",
"91e869971f139d90e36f73b1089877-C001-78"
]
},
"@USE@": {
"gold_contexts": [
[
"91e869971f139d90e36f73b1089877-C001-23"
],
[
"91e869971f139d90e36f73b1089877-C001-56"
],
[
"91e869971f139d90e36f73b1089877-C001-92"
],
[
"91e869971f139d90e36f73b1089877-C001-87"
],
[
"91e869971f139d90e36f73b1089877-C001-96"
]
],
"cite_sentences": [
"91e869971f139d90e36f73b1089877-C001-23",
"91e869971f139d90e36f73b1089877-C001-56",
"91e869971f139d90e36f73b1089877-C001-92",
"91e869971f139d90e36f73b1089877-C001-87",
"91e869971f139d90e36f73b1089877-C001-96"
]
},
"@MOT@": {
"gold_contexts": [
[
"91e869971f139d90e36f73b1089877-C001-46"
]
],
"cite_sentences": [
"91e869971f139d90e36f73b1089877-C001-46"
]
},
"@EXT@": {
"gold_contexts": [
[
"91e869971f139d90e36f73b1089877-C001-57",
"91e869971f139d90e36f73b1089877-C001-58"
]
],
"cite_sentences": [
"91e869971f139d90e36f73b1089877-C001-58"
]
},
"@SIM@": {
"gold_contexts": [
[
"91e869971f139d90e36f73b1089877-C001-119"
]
],
"cite_sentences": [
"91e869971f139d90e36f73b1089877-C001-119"
]
}
}
},
"ABC_53ed85f4bfa634656062ad6ba342d2_7": {
"x": [
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-109",
"text": "**COMPLETE MATCH (CM):**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-2",
"text": "This paper presents a deterministic dependency parser based on memory-based learning, which parses English text in linear time."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-3",
"text": "When trained and evaluated on the Wall Street Journal section of the Penn Treebank, the parser achieves a maximum attachment score of 87.1%."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-4",
"text": "Unlike most previous systems, the parser produces labeled dependency graphs, using as arc labels a combination of bracket labels and grammatical role labels taken from the Penn Treebank II annotation scheme."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-5",
"text": "The best overall accuracy obtained for identifying both the correct head and the correct arc label is 86.0%, when restricted to grammatical role labels (7 labels), and 84.4% for the maximum set (50 labels)."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-8",
"text": "There has been a steadily increasing interest in syntactic parsing based on dependency analysis in recent years."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-9",
"text": "One important reason seems to be that dependency parsing offers a good compromise between the conflicting demands of analysis depth, on the one hand, and robustness and efficiency, on the other."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-10",
"text": "Thus, whereas a complete dependency structure provides a fully disambiguated analysis of a sentence, this analysis is typically less complex than in frameworks based on constituent analysis and can therefore often be computed deterministically with reasonable accuracy."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-11",
"text": "Deterministic methods for dependency parsing have now been applied to a variety of languages, including Japanese (Kudo and Matsumoto, 2000) , English (Yamada and Matsumoto, 2003) , Turkish (Oflazer, 2003) , and Swedish (Nivre et al., 2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-12",
"text": "For English, the interest in dependency parsing has been weaker than for other languages."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-13",
"text": "To some extent, this can probably be explained by the strong tradition of constituent analysis in Anglo-American linguistics, but this trend has been reinforced by the fact that the major treebank of American English, the Penn Treebank (Marcus et al., 1993) , is annotated primarily with constituent analysis."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-81",
"text": "For more information about the different parameters and settings, see Daelemans et al. (2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-14",
"text": "On the other hand, the best available parsers trained on the Penn Treebank, those of Collins (1997) and Charniak (2000) , use statistical models for disambiguation that make crucial use of dependency relations."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-15",
"text": "Moreover, the deterministic dependency parser of Yamada and Matsumoto (2003) , when trained on the Penn Treebank, gives a dependency accuracy that is almost as good as that of Collins (1997) and Charniak (2000) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-16",
"text": "The parser described in this paper is similar to that of Yamada and Matsumoto (2003) in that it uses a deterministic parsing algorithm in combination with a classifier induced from a treebank."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-17",
"text": "However, there are also important differences between the two approaches."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-18",
"text": "First of all, whereas Yamada and Matsumoto employs a strict bottom-up algorithm (essentially shift-reduce parsing) with multiple passes over the input, the present parser uses the algorithm proposed in Nivre (2003) , which combines bottomup and top-down processing in a single pass in order to achieve incrementality."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-19",
"text": "This also means that the time complexity of the algorithm used here is linear in the size of the input, while the algorithm of Yamada and Matsumoto is quadratic in the worst case."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-20",
"text": "Another difference is that Yamada and Matsumoto use support vector machines (Vapnik, 1995) , while we instead rely on memory-based learning (Daelemans, 1999) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-21",
"text": "Most importantly, however, the parser presented in this paper constructs labeled dependency graphs, i.e. dependency graphs where arcs are labeled with dependency types."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-22",
"text": "As far as we know, this makes it different from all previous systems for dependency parsing applied to the Penn Treebank (Eisner, 1996; Yamada and Matsumoto, 2003) , although there are systems that extract labeled grammatical relations based on shallow parsing, e.g. Buchholz (2002) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-23",
"text": "The fact that we are working with labeled dependency graphs is also one of the motivations for choosing memory-based learning over support vector machines, since we require a multi-class classifier."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-24",
"text": "Even though it is possible to use SVM for multi-class classification, this can get cumbersome when the number of classes is large."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-25",
"text": "The parsing methodology investigated here has previously been applied to Swedish, where promising results were obtained with a relatively small treebank (approximately 5000 sentences for training), resulting in an attachment score of 84.7% and a labeled accuracy of 80.6% (Nivre et al., 2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-26",
"text": "1 However, since there are no comparable results available for Swedish, it is difficult to assess the significance of these findings, which is one of the reasons why we want to apply the method to a benchmark corpus such as the the Penn Treebank, even though the annotation in this corpus is not ideal for labeled dependency parsing."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-27",
"text": "The paper is structured as follows."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-28",
"text": "Section 2 describes the parsing algorithm, while section 3 explains how memory-based learning is used to guide the parser."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-29",
"text": "Experimental results are reported in section 4, and conclusions are stated in section 5."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-31",
"text": "**DETERMINISTIC DEPENDENCY PARSING**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-32",
"text": "In dependency parsing the goal of the parsing process is to construct a labeled dependency graph of the kind depicted in Figure 1 ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-33",
"text": "In formal terms, we define dependency graphs as follows:"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-34",
"text": "1. Let R = {r 1 , . . . , r m } be the set of permissible dependency types (arc labels)."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-36",
"text": "**A DEPENDENCY GRAPH FOR A STRING OF WORDS**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-83",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-37",
"text": "(a) W is the set of nodes, i.e. word tokens in the input string, (b) A is a set of labeled arcs (w i , r, w j ) (w i , w j \u2208 W , r \u2208 R), (c) for every w j \u2208 W , there is at most one arc (w i , r, w j ) \u2208 A."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-38",
"text": "3."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-39",
"text": "A graph D = (W, A) is well-formed iff it is acyclic, projective and connected."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-40",
"text": "For a more detailed discussion of dependency graphs and well-formedness conditions, the reader is referred to Nivre (2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-41",
"text": "The parsing algorithm used here was first defined for unlabeled dependency parsing in Nivre (2003) and subsequently extended to labeled graphs in Nivre et al. (2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-42",
"text": "Parser configurations are represented by triples S, I, A , where S is the stack (represented as a list), I is the list of (remaining) input tokens, and A is the (current) arc relation for the dependency graph."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-43",
"text": "(Since in a dependency graph the set of nodes is given by the input tokens, only the arcs need to be represented explicitly.) Given an input string W , the parser is initialized to nil, W, \u2205 2 and terminates when it reaches a configuration S, nil, A (for any list S and set of arcs A)."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-44",
"text": "The input string W is accepted if the dependency graph D = (W, A) given at termination is well-formed; otherwise W is rejected."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-45",
"text": "Given an arbitrary configuration of the parser, there are four possible transitions to the next configuration (where t is the token on top of the stack, n is the next input token, w is any word, and r, r \u2208 R):"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-46",
"text": "there is no arc (w, r, t) \u2208 A, extend A with (n, r , t) and pop the stack, giving the configuration S,n|I,A\u222a{(n, r , t)} ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-48",
"text": "**RIGHT-ARC:**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-49",
"text": "In a configuration t|S,n|I,A , if there is no arc (w, r, n) \u2208 A, extend A with (t, r , n) and push n onto the stack, giving the configuration n|t|S,I,A\u222a{(t, r , n)} ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-50",
"text": "After initialization, the parser is guaranteed to terminate after at most 2n transitions, given an input string of length n (Nivre, 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-51",
"text": "Moreover, the parser always constructs a dependency graph that is acyclic and projective."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-52",
"text": "This means that the dependency graph given at termination is well-formed if and only if it is connected (Nivre, 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-53",
"text": "Otherwise, it is a set of connected components, each of which is a well-formed dependency graph for a substring of the original input."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-54",
"text": "The transition system defined above is nondeterministic in itself, since several transitions can often be applied in a given configuration."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-55",
"text": "To construct deterministic parsers based on this system, we use classifiers trained on treebank data in order to predict the next transition (and dependency type) given the current configuration of the parser."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-56",
"text": "In this way, our approach can be seen as a form of history-based parsing (Black et al., 1992; Magerman, 1995) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-57",
"text": "In the experiments reported here, we use memory-based learning to train our classifiers."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-59",
"text": "**MEMORY-BASED LEARNING**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-60",
"text": "Memory-based learning and problem solving is based on two fundamental principles: learning is the simple storage of experiences in memory, and solving a new problem is achieved by reusing solutions from similar previously solved problems (Daelemans, 1999) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-61",
"text": "It is inspired by the nearest neighbor approach in statistical pattern recognition and artificial intelligence (Fix and Hodges, 1952) , as well as the analogical modeling approach in linguistics (Skousen, 1989; Skousen, 1992) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-62",
"text": "In machine learning terms, it can be characterized as a lazy learning method, since it defers processing of input until needed and processes input by combining stored data (Aha, 1997) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-63",
"text": "Memory-based learning has been successfully applied to a number of problems in natural language processing, such as grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking (Daelemans et al., 2002) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-64",
"text": "Previous work on memory-based learning for deterministic parsing includes Veenstra and Daelemans (2000) and Nivre et al. (2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-65",
"text": "For the experiments reported in this paper, we have used the software package TiMBL (Tilburg Memory Based Learner), which provides a variety of metrics, algorithms, and extra functions on top of the classical k nearest neighbor classification kernel, such as value distance metrics and distance weighted class voting (Daelemans et al., 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-66",
"text": "The function we want to approximate is a mapping f from configurations to parser actions, where each action consists of a transition and (except for Shift and Reduce) a dependency type:"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-67",
"text": "Here Config is the set of all configurations and R is the set of dependency types."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-68",
"text": "In order to make the problem tractable, we approximate f with a functionf whose domain is a finite space of parser states, which are abstractions over configurations."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-69",
"text": "For this purpose we define a number of features that can be used to define different models of parser state."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-70",
"text": "Figure 2 illustrates the features that are used to define parser states in the present study."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-71",
"text": "The two central elements in any configuration are the token on top of the stack (T) and the next input token (N), the tokens which may be connected by a dependency arc in the next configuration."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-72",
"text": "For these tokens, we consider both the word form (T.LEX, N.LEX) and the part-of-speech (T.POS, N.POS), as assigned by an automatic part-of-speech tagger in a preprocessing phase."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-73",
"text": "Next, we consider a selection of dependencies that may be present in the current arc relation, namely those linking T to its head (TH) and its leftmost and rightmost dependent (TL, TR), and that linking N to its leftmost dependent (NL), 3 considering both the dependency type (arc label) and the part-of-speech of the head or dependent."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-74",
"text": "Finally, we use a lookahead of three tokens, considering only their parts-of-speech."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-75",
"text": "We have experimented with two different state models, one that incorporates all the features depicted in Figure 2 (Model 1), and one that excludes the parts-of-speech of TH, TL, TR, NL (Model 2)."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-76",
"text": "Models similar to model 2 have been found to work well for datasets with a rich annotation of dependency types, such as the Swedish dependency treebank derived from Einarsson (1976) , where the extra part-of-speech features are largely redundant (Nivre et al., 2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-77",
"text": "Model 1 can be expected to work better for datasets with less informative dependency annotation, such as dependency trees extracted from the Penn Treebank, where the extra part-of-speech features may compensate for the lack of information in arc labels."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-78",
"text": "The learning algorithm used is the IB1 algorithm (Aha et al., 1991 ) with k = 5, i.e. classification based on 5 nearest neighbors."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-79",
"text": "4 Distances are measured using the modified value difference metric (MVDM) (Stanfill and Waltz, 1986; Cost and Salzberg, 1993) for instances with a frequency of at least 3 (and the simple overlap metric otherwise), and classification is based on distance weighted class voting with inverse distance weighting (Dudani, 1976) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-80",
"text": "These settings are the result of extensive experiments partially reported in Nivre et al. (2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-84",
"text": "The data set used for experimental evaluation is the standard data set from the Wall Street Journal section of the Penn Treebank, with sections 2-21 used for training and section 23 for testing (Collins, 1999; Charniak, 2000) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-85",
"text": "The data has been converted to dependency trees using head rules (Magerman, 1995; Collins, 1996) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-86",
"text": "We are grateful to Yamada and Matsumoto for letting us use their rule set, which is a slight modification of the rules used by Collins (1999) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-87",
"text": "This permits us to make exact comparisons with the parser of Yamada and Matsumoto (2003) , but also the parsers of Collins (1997) and Charniak (2000) , which are evaluated on the same data set in Yamada and Matsumoto (2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-88",
"text": "One problem that we had to face is that the standard conversion of phrase structure trees to dependency trees gives unlabeled dependency trees, whereas our parser requires labeled trees."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-89",
"text": "Since the annotation scheme of the Penn Treebank does not include dependency types, there is no straightforward way to derive such labels."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-90",
"text": "We have therefore experimented with two different sets of labels, none of which corresponds to dependency types in a strict sense."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-91",
"text": "The first set consists of the function tags for grammatical roles according to the Penn II annotation guidelines (Bies et al., 1995) ; we call this set G. The second set consists of the ordinary bracket labels (S, NP, VP, etc.), combined with function tags for grammatical roles, giving composite labels such as NP-SBJ; we call this set B. We assign labels to arcs by letting each (non-root) word that heads a phrase P in the original phrase structure have its incoming edge labeled with the label of P (modulo the set of labels used)."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-92",
"text": "In both sets, we also include a default label DEP for arcs that would not otherwise get a label."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-93",
"text": "This gives a total of 7 labels in the G set and 50 labels in the B set."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-94",
"text": "Figure 1 shows a converted dependency tree using the B labels; in the corresponding tree with G labels NP-SBJ would be replaced by SBJ, ADVP and VP by DEP."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-95",
"text": "We use the following metrics for evaluation:"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-97",
"text": "**UNLABELED ATTACHMENT SCORE (UAS):**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-98",
"text": "The proportion of words that are assigned the correct head (or no head if the word is a root) (Eisner, 1996; ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-100",
"text": "**LABELED ATTACHMENT SCORE (LAS):**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-101",
"text": "The proportion of words that are assigned the correct head and dependency type (or no head if the word is a root) (Nivre et al., 2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-103",
"text": "**DEPENDENCY ACCURACY (DA):**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-104",
"text": "The proportion of non-root words that are assigned the correct head (Yamada and Matsumoto, 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-106",
"text": "**ROOT ACCURACY (RA):**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-107",
"text": "The proportion of root words that are analyzed as such (Yamada and Matsumoto, 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-110",
"text": "The proportion of sentences whose unlabeled dependency structure is completely correct (Yamada and Matsumoto, 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-111",
"text": "All metrics except CM are calculated as mean scores per word, and punctuation tokens are consistently excluded."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-112",
"text": "Table 1 shows the attachment score, both unlabeled and labeled, for the two different state models with the two different label sets."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-113",
"text": "First of all, we see that Model 1 gives better accuracy than Model 2 with the smaller label set G, which confirms our expectations that the added part-of-speech features are helpful when the dependency labels are less informative."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-114",
"text": "Conversely, we see that Model 2 outperforms Model 1 with the larger label set B, which is consistent with the hypothesis that part-of-speech features become redundant as dependency labels get more informative."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-115",
"text": "It is interesting to note that this effect holds even in the case where the dependency labels are mostly derived from phrase structure categories."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-116",
"text": "We can also see that the unlabeled attachment score improves, for both models, when the set of dependency labels is extended."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-117",
"text": "On the other hand, the labeled attachment score drops, but it must be remembered that these scores are not really comparable, since the number of classes in the classification problem increases from 7 to 50 as we move from the G set to the B set."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-118",
"text": "Therefore, we have also included the labeled attachment score restricted to the G set for the parser using the B set (BG), and we see then that the attachment score improves, especially for Model 2. (All differences are significant beyond the .01 level; McNemar's test.) Table 2 shows the dependency accuracy, root accuracy and complete match scores for our best parser (Model 2 with label set B) in comparison with Collins (1997) (Model 3) , Charniak (2000) , and Yamada and Matsumoto (2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-119",
"text": "5 It is clear that, with respect to unlabeled accuracy, our parser does not quite reach state-of-the-art performance, even if we limit the competition to deterministic methods such as that of Yamada and Matsumoto (2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-120",
"text": "We believe that there are mainly three reasons for this."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-121",
"text": "First of all, the part-of-speech tagger used for preprocessing in our experiments has a lower accuracy than the one used by Yamada and Matsumoto (2003) (96.1% vs. 97.1%) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-122",
"text": "Although this is not a very interesting explanation, it undoubtedly accounts for part of the difference."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-123",
"text": "Secondly, since 5 The information in the first three rows is taken directly from Yamada and Matsumoto (2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-124",
"text": "our parser makes crucial use of dependency type information in predicting the next action of the parser, it is very likely that it suffers from the lack of real dependency labels in the converted treebank."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-125",
"text": "Indirect support for this assumption can be gained from previous experiments with Swedish data, where almost the same accuracy (85% unlabeled attachment score) has been achieved with a treebank which is much smaller but which contains proper dependency annotation (Nivre et al., 2004) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-126",
"text": "A third important factor is the relatively low root accuracy of our parser, which may reflect a weakness in the one-pass parsing strategy with respect to the global structure of complex sentences."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-127",
"text": "It is noteworthy that our parser has lower root accuracy than dependency accuracy, whereas the inverse holds for all the other parsers."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-128",
"text": "The problem becomes even more visible when we consider the dependency and root accuracy for sentences of different lengths, as shown in Table 3 ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-129",
"text": "Here we see that for really short sentences (up to 10 words) root accuracy is indeed higher than dependency accuracy, but while dependency accuracy degrades gracefully with sentence length, the root accuracy drops more drastically (which also very clearly affects the complete match score)."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-130",
"text": "This may be taken to suggest that some kind of preprocessing in the form of clausing may help to improve overall accuracy."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-131",
"text": "Turning finally to the assessment of labeled dependency accuracy, we are not aware of any strictly comparable results for the given data set, but Buchholz (2002) reports a labeled accuracy of 72.6% for the assignment of grammatical relations using a cascade of memory-based processors."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-132",
"text": "This can be compared with a labeled attachment score of 84.4% for Model 2 with our B set, which is of about the same size as the set used by Buchholz, although the labels are not the same."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-133",
"text": "In another study, Blaheta and Charniak (2000) report an F-measure of 98.9% for the assignment of Penn Treebank grammatical role labels (our G set) to phrases that were correctly parsed by the parser described in Charniak (2000) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-134",
"text": "If null labels (corresponding to our DEP labels) are excluded, the F-score drops to 95.7%."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-135",
"text": "The corresponding F-measures for our best parser (Model 2, BG) are 99.0% and 94.7%."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-136",
"text": "For the larger B set, our best parser achieves an F-measure of 96.9% (DEP labels included), which can be compared with 97.0% for a similar (but larger) set of labels in Collins (1999) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-137",
"text": "6 Although none of the previous results on labeling accuracy is strictly comparable to ours, it nevertheless seems fair to conclude that the (Yamada and Matsumoto, 2003) labeling accuracy of the present parser is close to the state of the art, even if its capacity to derive correct structures is not."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-138",
"text": "----------------------------------"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-139",
"text": "**CONCLUSION**"
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-140",
"text": "This paper has explored the application of a datadriven dependency parser to English text, using data from the Penn Treebank."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-141",
"text": "The parser is deterministic and uses a linear-time parsing algorithm, guided by memory-based classifiers, to construct labeled dependency structures incrementally in one pass over the input."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-142",
"text": "Given the difficulty of extracting labeled dependencies from a phrase structure treebank with limited functional annotation, the accuracy attained is fairly respectable. And although the structural accuracy falls short of the best available parsers, the labeling accuracy appears to be competitive."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-143",
"text": "The most important weakness is the limited accuracy in identifying the root node of a sentence, especially for longer sentences."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-144",
"text": "We conjecture that an improvement in this area could lead to a boost in overall performance."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-145",
"text": "Another important issue to investigate further is the influence of different kinds of arc labels, and in particular labels that are based on a proper dependency grammar."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-146",
"text": "In the future, we therefore want to perform more experiments with genuine dependency treebanks like the Prague Dependency Treebank (Hajic, 1998) and the Danish Dependency Treebank (Kromann, 2003) ."
},
{
"sent_id": "53ed85f4bfa634656062ad6ba342d2-C001-147",
"text": "We also want to apply dependency-based evaluation schemes such as the ones proposed by Lin (1998) and Carroll et al. (1998) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"53ed85f4bfa634656062ad6ba342d2-C001-11"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-15"
]
],
"cite_sentences": [
"53ed85f4bfa634656062ad6ba342d2-C001-11",
"53ed85f4bfa634656062ad6ba342d2-C001-15"
]
},
"@USE@": {
"gold_contexts": [
[
"53ed85f4bfa634656062ad6ba342d2-C001-16"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-86",
"53ed85f4bfa634656062ad6ba342d2-C001-87"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-104",
"53ed85f4bfa634656062ad6ba342d2-C001-107",
"53ed85f4bfa634656062ad6ba342d2-C001-110",
"53ed85f4bfa634656062ad6ba342d2-C001-95"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-118"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-123",
"53ed85f4bfa634656062ad6ba342d2-C001-124"
]
],
"cite_sentences": [
"53ed85f4bfa634656062ad6ba342d2-C001-16",
"53ed85f4bfa634656062ad6ba342d2-C001-87",
"53ed85f4bfa634656062ad6ba342d2-C001-104",
"53ed85f4bfa634656062ad6ba342d2-C001-107",
"53ed85f4bfa634656062ad6ba342d2-C001-110",
"53ed85f4bfa634656062ad6ba342d2-C001-118",
"53ed85f4bfa634656062ad6ba342d2-C001-123"
]
},
"@SIM@": {
"gold_contexts": [
[
"53ed85f4bfa634656062ad6ba342d2-C001-16"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-104",
"53ed85f4bfa634656062ad6ba342d2-C001-107",
"53ed85f4bfa634656062ad6ba342d2-C001-110",
"53ed85f4bfa634656062ad6ba342d2-C001-95"
]
],
"cite_sentences": [
"53ed85f4bfa634656062ad6ba342d2-C001-16",
"53ed85f4bfa634656062ad6ba342d2-C001-104",
"53ed85f4bfa634656062ad6ba342d2-C001-107",
"53ed85f4bfa634656062ad6ba342d2-C001-110"
]
},
"@DIF@": {
"gold_contexts": [
[
"53ed85f4bfa634656062ad6ba342d2-C001-16",
"53ed85f4bfa634656062ad6ba342d2-C001-17",
"53ed85f4bfa634656062ad6ba342d2-C001-18"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-22"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-119"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-121"
],
[
"53ed85f4bfa634656062ad6ba342d2-C001-137"
]
],
"cite_sentences": [
"53ed85f4bfa634656062ad6ba342d2-C001-16",
"53ed85f4bfa634656062ad6ba342d2-C001-22",
"53ed85f4bfa634656062ad6ba342d2-C001-119",
"53ed85f4bfa634656062ad6ba342d2-C001-121",
"53ed85f4bfa634656062ad6ba342d2-C001-137"
]
},
"@EXT@": {
"gold_contexts": [
[
"53ed85f4bfa634656062ad6ba342d2-C001-16",
"53ed85f4bfa634656062ad6ba342d2-C001-17",
"53ed85f4bfa634656062ad6ba342d2-C001-18"
]
],
"cite_sentences": [
"53ed85f4bfa634656062ad6ba342d2-C001-16"
]
}
}
},
"ABC_bb74dd634a8fc5cdb2f4f3294b6bc5_7": {
"x": [
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-2",
"text": "Lifelong machine learning is a novel machine learning paradigm which can continually accumulate knowledge during learning."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-3",
"text": "The knowledge extracting and reusing abilities enable the lifelong machine learning to solve the related problems."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-4",
"text": "The traditional approaches like Na\u00efve Bayes and some neural network based approaches only aim to achieve the best performance upon a single task."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-5",
"text": "Unlike them, the lifelong machine learning in this paper focus on how to accumulate knowledge during learning and leverage them for the further tasks."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-6",
"text": "Meanwhile, the demand for labeled data for training also be significantly decreased with the knowledge reusing."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-7",
"text": "This paper suggests that the aim of the lifelong learning is to use less labeled data and computational cost to achieve the performance as well as or even better than the supervised learning."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-8",
"text": "\u2022 Computing methodologies \u2192 Theory of mind."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-11",
"text": "Over the past 30 years, machine learning have achieved a significant development."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-12",
"text": "However, we are still in a era of \"Weak AI\" rather than \"Strong AI\"."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-13",
"text": "Current machine learning algorithms only know how to solve a specific problem but have no idea when they meet some related problems."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-14",
"text": "Hence, the lifelong machine learning (simply said as lifelong learning or \"LML\" below) [8] was raised to solve a infinite sequence of related tasks by knowledge accumulation and reusing."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-15",
"text": "For the related problems, an integrated model with knowledge reusing could decrease the cost for the sample annotation."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-16",
"text": "For instance, in the sentiment classification we need to predict the sentiment (positive or negative) of a sentence or a document."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-17",
"text": "For different sentiment classification tasks, traditional approaches need to train an independent model on each domain to obtain the best performance."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-18",
"text": "Hence, for each domain we need to collect labeled data for the supervised learning."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-19",
"text": "In this way, the algorithm will never know how to solve a problem without new labeled data."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-20",
"text": "This is what a typical \"weak AI\"."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-21",
"text": "To achieve the goal of \"strong AI\", we need to change our learning goal to really understand the sentiment of words."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-22",
"text": "Which means that the algorithm should know how each word influences the sentiment of a document in different tasks."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-23",
"text": "If we can achieve this learning goal, the algorithms are able to solve new tasks without teaching."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-24",
"text": "Zhiyuan Chen and etc."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-25",
"text": "[2] ever proposed a approach to close the goal."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-26",
"text": "They made a big progress but the supervised learning still is needed."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-27",
"text": "Guangyi Lv and etc."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-28",
"text": "[4] extend the work of [2] with a neural network based approach."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-29",
"text": "However, the supervised learning still is necessary under their setting and huge volume of labeled data are required."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-30",
"text": "Hence, this paper aims to decrease the usage of labeled data while maintain the performance."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-32",
"text": "**LIFELONG MACHINE LEARNING**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-33",
"text": "It was firstly called as lifelong machine learning since 1995 by Thrun [7, 9] ."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-34",
"text": "Efficient Lifelong Machine Learning (ELLA) [6] raised by Ruvolo and Eaton."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-35",
"text": "Comparing with the multi-task learning [1] , ELLA is much more efficient."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-36",
"text": "Zhiyuan and etc."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-37",
"text": "[2] improved the sentiment classification by involving knowledge."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-38",
"text": "The object function was modified with two penalty terms which corresponding with previous tasks."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-39",
"text": "The knowledge system contains the following components:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-41",
"text": "**COMPONENTS OF LML**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-42",
"text": "\u2022 Knowledge Base (KB): The knowledge Base [2] mainly used to maintain the previous knowledge."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-43",
"text": "Based on the type of knowledge, it could be divided as Past Information Store (PIS), Meta-Knowledge Miner (MKM) and Meta-Knowledge Store (MKS)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-44",
"text": "\u2022 Knowledge Reasoner (KR): The knowledge reasoner is designed to generate new knowledge upon the archived knowledge by logic inference."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-45",
"text": "A strict logic design is required so the most of the LML algorithms lack of the component."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-46",
"text": "\u2022 Knowledge-Base Learner (KBL): The Knowledge-Based Learner [2] aims to retrieve and transfer previous knowledge to the current task."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-47",
"text": "Hence, it contains two parts: task knowledge miner and leaner."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-48",
"text": "The miner seeks and determines which knowledge could be reused, and the learner transfers such knowledge to the current task."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-50",
"text": "**SENTIMENT CLASSIFICATION**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-51",
"text": "Hong and etc. [3] had discussed that the NLP field is most suitable for the lifelong machine learning researches due to its knowledge is easy to extract and to be understood by human."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-52",
"text": "Previous classical paper [2] chose the sentiment classification as the learning target because it could be regarded as a large task as well as a group of related sub-tasks in the different domains."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-53",
"text": "Although these sub-tasks are related to each other but a model only trained on a single subtasks is unable to perform well in the rest sub-tasks."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-54",
"text": "This requires the algorithms could know when the knowledge can be used and when can not due to the distribution of each sub-tasks is different."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-55",
"text": "Known these, an algorithm can be called as \"lifelong\" because it is able to transfer previous knowledge to new tasks to improve performance."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-56",
"text": "Although deep learning already is applied in sentiment classification, it still could not leverage past knowledge well."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-57",
"text": "This because the complexity of neural network limits the researches to define and extract knowledge from the data."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-58",
"text": "As the previous work [2] , this paper also uses Na\u00efve Bayes as the knowledge can be presented by the probability."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-59",
"text": "In this way, we need to know the probability of each word that shows in the positive or negative content."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-60",
"text": "We also need to know well that some words may only have sentiment polarity in some specific domains(equal to tasks in this paper)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-61",
"text": "\"Lifelong Sentiment Classification\" (\"LSC\" for simple below) [2] records that which domain does a word have the sentiment orientation."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-62",
"text": "If a word always has sentiment polarity or has significant polarity in current domain, a higher weight will sign to it more than other words."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-63",
"text": "This approach contains a knowledge transfer operation and a knowledge validation operation."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-65",
"text": "**CONTRIBUTION OF THIS PAPER**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-66",
"text": "Although LSC [2] already raised a lifelong approach, it only aims to improve the classification accuracy."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-67",
"text": "It still is under the setting of the supervised learning and also is unable to deliver an explicit knowledge to guild further learning."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-68",
"text": "Based on the LSC, this paper advances the lifelong learning in sentiment classification and have two main contributions:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-69",
"text": "\u2022 A improved lifelong learning paradigm is proposed to solve the sentiment classification problem under unsupervised learning setting with previous knowledge."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-70",
"text": "\u2022 We introduce a novel approach to discover and store the words with sentiment polarity for reuse."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-72",
"text": "**SENTIMENT POLARITY WORDS 4.1 NA\u00cfVE BAYESIAN TEXT CLASSIFICATION**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-73",
"text": "In this paper, we define a word has sentiment polarity by calculating the probability that it appears in a positive or negative content (sentence or document)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-74",
"text": "If a word has a high probability with sentiment polarity, it also will leads to the document have higher probability of sentiment probability based on the Na\u00efve Bayesian (NB) formula."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-75",
"text": "Hence, to determine the words with polarity is the key to predict the sentiment."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-76",
"text": "Na\u00efve Bayesian (NB) classifier [5] calculates the probability of each word w in a document d and then to predict the sentiment polarity (positive or negative)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-77",
"text": "We use the same formula below as in the LSC [2] ."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-78",
"text": "P(w |c j ) is the probability of a word appears in a class:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-79",
"text": "Where c j is either positive (+) or negative (-) sentiment polarity."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-80",
"text": "N c j ,w is the frequency of a word w in documents of class c j ."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-81",
"text": "|V| is the size of vocabulary V and \u03bb(0 \u2a7d \u03bb \u2a7d 1) is used for smoothing ( set as 1 for Laplace smoothing in this paper)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-82",
"text": "Given a document, we can calculate the probability of it for different classes by:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-83",
"text": "Where d i is the given document, n w , d i is the frequence of a word appears in this document."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-84",
"text": "To predict the class of a document, we only need to calculate P(c + |d i ) \u2212 P(c \u2212 |d i )."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-85",
"text": "If the difference is lager than 0, the document should be predict as positive polarity:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-86",
"text": "As we only need to know whether P(c + |d i ) \u2212 P(c \u2212 |d i ) is lager that 0, so the formula could be simplify to:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-88",
"text": "**DISCOVER WORDS WITH SENTIMENT POLARITY**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-89",
"text": "Ideally, if we know the P(c + ), P(c \u2212 ) and P(w |c j ) of all words, we can predict the sentiment polarity for all documents."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-90",
"text": "However, above three key components are different in different domains."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-91",
"text": "LSC [2] proposed a possible solution to calculate P(w |c j ), but it uses all words which has high risk to be overfitting."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-92",
"text": "As we known, not all words have sentimental polarity like \"a\", \"one\" and etc."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-93",
"text": "while some words always have polarity like \"good\", \"hate\", \"excellent\" and so on."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-94",
"text": "In addition, some words only have sentiment polarity in specific domains."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-95",
"text": "For example, \"tough\" in reviews of the diamond indicates that the diamond have a good quality while it means hard to chew in the domain of food."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-96",
"text": "Hence, in order to achieve the goal of the lifelong learning."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-97",
"text": "We need to find the words always have sentiment polarity and be careful for those words only shows polarity in specific domains."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-99",
"text": "**LIFELONG SEMI-SUPERVISED LEARNING FOR SENTIMENT CLASSIFICATION**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-100",
"text": "Although LSC [2] considered the difference among domains, it still is a typical supervised learning approach."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-101",
"text": "In this paper, we proposed to learn as two stages:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-102",
"text": "(1) Initial Learning Stage: to explore a basic set of sentiment words."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-103",
"text": "After that, the model should be able to basically classify a new domain with a good performance."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-104",
"text": "(2) Self-study Stage: Use the knowledge accumulated from the initial stage to handle new domains, also fine-tune and consolidate the knowledge generated from the initial learning stage."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-106",
"text": "**INITIAL LEARNING STAGE**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-107",
"text": "In this stage, we need to train the model to remember some sentiment polarity words."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-108",
"text": "This requires us to find the words with sentiment polarity in each domain."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-109",
"text": "We need to answer two questions here:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-110",
"text": "(1) How to determine the polarity of a word? (2) How much domains do we need for the initial learning stage?"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-111",
"text": "For the first question, we need to find which words mainly show in the positive or negative documents."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-112",
"text": "This means for a word w with positive polarity, P(+|w) >> P(\u2212|w) or P(+|w) >> P(+)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-113",
"text": "In this paper, we will use O(w) = P(+|w)/P(+) to represent the polarity."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-114",
"text": "This because that the P(c j |w)/P(w) is easy to extend into the multiclasses classification problems."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-115",
"text": "According to the Bayesian formula, P(+|w)/P(+) = P(w |+)/P(w)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-117",
"text": "**SELF-STUDY STAGE**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-118",
"text": "In this stage, our main task is to explore which words have polarity."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-119",
"text": "We will mainly use these words to predict the new domains and assign the pseudo-labels to them."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-120",
"text": "With the pseudo labels, we are able to discover the new words with polarity."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-121",
"text": "Following is the the procedure for self-study:"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-122",
"text": "(1) Using the sentiment words accumulated from the previous tasks to predict a new domain, then assign the prediction results as the pseudo labels."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-123",
"text": "(2) Using the reviews and pseudo labels of above new domain as new training data to run Na\u00efve model."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-124",
"text": "(3) Update the sentiment words knowledge base."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-126",
"text": "**EXPERIMENT 6.1 DATASETS**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-127",
"text": "In the experiment, we use the same datasets as LSC [2] used."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-128",
"text": "It contains the reviews from 20 domains crawled from the Amazon.com and each domain has 1,000 reviews (the distribution of positive and negative reviews is imbalanced)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-130",
"text": "**WORD POLARITY ANALYSIS**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-131",
"text": "To answer the first question for the initial learning stage, we need to know which words exactly influence the sentiment classification."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-132",
"text": "Firstly, we calculate P(w |+) and P(w |\u2212) for each words."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-133",
"text": "Then, we define the polarity degree by O(w) = P(w |+)/P(w)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-134",
"text": "Finally, we only choose a specific percentage words to predict and see whether the performance decreases."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-135",
"text": "In addition, we also only consider the words that at least show over average 5 times in per domain."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-136",
"text": "This because that we did not delete the symbols and numbers in the data, and these characters may be noise in the training data."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-137",
"text": "We firstly sorted the words or symbols (no data pre-processing to the corpus in this paper) by the polarity O(w) and then choose a specific percentage words or symbols from the whole words to only 10%."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-138",
"text": "From Table 1 we can see that using no less than 30% can obtains the best average result."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-139",
"text": "This means that the most of words and symbols do not have obvious sentiment orientation."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-140",
"text": "Hence, we will only keep 30% of words for Na\u00efve Bayes model and even get better f1 score."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-141",
"text": "Although the performance decrease on a single domain, the better global performance can achieve with only the sentiment words."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-142",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-143",
"text": "**REQUIREMENT FOR THE INITIAL LEARNING**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-144",
"text": "For the second question of the initial learning stage, the answer depends on the tasks."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-145",
"text": "In the practice, all of the labeled data definitely need to be used for training."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-146",
"text": "The only question should be conceded is that how much labeled data can meet the minimum requirement."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-147",
"text": "For this sentiment classification task, one domain is absolutely insufficient."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-148",
"text": "Based on the experiment result, the initial learning stage at least needs two domains, and can achieve much better performance with three domains."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-149",
"text": "Increase more domains will not significant influence the performance."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-150",
"text": "Hence, three domains are enough for this task."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-151",
"text": "For different tasks, two labeled domains are necessary."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-152",
"text": "More labeled domains are suggested to continue collect until the performance on the new domains tends to steady."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-154",
"text": "**SELF-STUDY LEARNING**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-155",
"text": "In the self-study learning stage, the learning is designed under the unsupervised learning setting."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-156",
"text": "In this stage, there is any labeled data."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-157",
"text": "Instead of that, we uses the model generate from the initial learning stage to predict each domain and assign the pseudo labels to them."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-158",
"text": "After that, the model will regard the pseudo labels as the real labels and continue the training on the new domain."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-159",
"text": "With this method, self-study learning stage can learn new domains well without any labeled data."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-160",
"text": "Table 2 is the F1 score of three models on 17 domains."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-161",
"text": "The first three domains was used for the initial learning stage. And we use the Macro-F1 score because the datasets are imbalanced and it can prove our performance on the minor classes."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-162",
"text": "We compared our model (Semi-Unsupervised Learning, SU-LML for short) with Na\u00efve Bayes model which only trained on the first three (source) domains (NB-S) and Na\u00efve Bayes model trained on each domain with labels by 5-fold cross validation (NB-T)."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-163",
"text": "We can see that our approach is significantly better than other two approaches."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-164",
"text": "It even perform better than the NB-T, a typically supervised learning."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-165",
"text": "The figure 2 shows the result more clearly."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-166",
"text": "The comparisons to LSC and neural based lifelong learning [4]"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-168",
"text": "**KNOWLEDGE GENERATED DURING LEARNING**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-169",
"text": "In this paper, we done one more important things is that we discovered which words have sentiment polarity."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-170",
"text": "with sentiment polarity, we increase the polarity score of it with one."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-171",
"text": "In addition, we will plus an additional score from 0 to 1 to 1 based on the O(w) rank."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-172",
"text": "From table 3, we can see that most top words with negative emotion and most of them make sense."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-174",
"text": "**CONCLUSION AND OUTLOOK**"
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-175",
"text": "We proposed a semi-unsupervised lifelong sentiment classification approach in this paper."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-176",
"text": "It can accumulate knowledge from the previous learning and turn to self-study."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-177",
"text": "A very few labeled data required in our approach so it is very suitable for the industry scenario."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-178",
"text": "The performance of it even exceeds the supervised learning, which shows that the knowledge reusing of the lifelong learning is useful."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-179",
"text": "Although we only show two classes classification here, but the ideal is also suitable for the multi-classes classification."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-180",
"text": "All text classification can use this approach, not only sentiment classification."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-181",
"text": "Our model classify documents by the knowledge of the sentiment polarity of the words, which uses the same approach of we human being."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-182",
"text": "We shows that to focus the goal behind the learning tasks is more meaningful than just to find a solution."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-183",
"text": "Understanding the words is much important than solve a sentiment classification task."
},
{
"sent_id": "bb74dd634a8fc5cdb2f4f3294b6bc5-C001-184",
"text": "We should learn the knowledge and skills for all tasks rather than a solution for a single task."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-23",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-24",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-25"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-27",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-28"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-36",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-37"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-42"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-46"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-52"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-61"
]
],
"cite_sentences": [
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-25",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-28",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-37",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-42",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-46",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-52",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-61"
]
},
"@MOT@": {
"gold_contexts": [
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-52",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-53"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-66",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-67"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-92",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-93",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-96",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-97"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-100"
]
],
"cite_sentences": [
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-52",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-66",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-100"
]
},
"@USE@": {
"gold_contexts": [
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-58"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-77"
],
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-127"
]
],
"cite_sentences": [
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-58",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-77",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-127"
]
},
"@EXT@": {
"gold_contexts": [
[
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-66",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-68",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-69",
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-70"
]
],
"cite_sentences": [
"bb74dd634a8fc5cdb2f4f3294b6bc5-C001-66"
]
}
}
},
"ABC_f54235664f013f0fec918222be9198_7": {
"x": [
{
"sent_id": "f54235664f013f0fec918222be9198-C001-117",
"text": "Table 4 ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-2",
"text": "How far can we get with unsupervised parsing if we make our training corpus several orders of magnitude larger than has hitherto be attempted?"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-3",
"text": "We present a new algorithm for unsupervised parsing using an all-subtrees model, termed U-DOP*, which parses directly with packed forests of all binary trees."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-4",
"text": "We train both on Penn's WSJ data and on the (much larger) NANC corpus, showing that U-DOP* outperforms a treebank-PCFG on the standard WSJ test set."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-5",
"text": "While U-DOP* performs worse than state-of-the-art supervised parsers on handannotated sentences, we show that the model outperforms supervised parsers when evaluated as a language model in syntax-based machine translation on Europarl."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-6",
"text": "We argue that supervised parsers miss the fluidity between constituents and non-constituents and that in the field of syntax-based language modeling the end of supervised parsing has come in sight."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-9",
"text": "A major challenge in natural language parsing is the unsupervised induction of syntactic structure."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-10",
"text": "While most parsing methods are currently supervised or semi-supervised (McClosky et al. 2006; Henderson 2004; Steedman et al. 2003) , they depend on hand-annotated data which are difficult to come by and which exist only for a few languages."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-11",
"text": "Unsupervised parsing methods are becoming increasingly important since they operate with raw, unlabeled data of which unlimited quantities are available."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-12",
"text": "There has been a resurgence of interest in unsupervised parsing during the last few years."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-13",
"text": "Where van Zaanen (2000) and Clark (2001) induced unlabeled phrase structure for small domains like the ATIS, obtaining around 40% unlabeled f-score, Klein and Manning (2002) report 71.1% f-score on Penn WSJ part-of-speech strings \u2264 10 words (WSJ10) using a constituentcontext model called CCM."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-14",
"text": "Klein and Manning (2004) further show that a hybrid approach which combines constituency and dependency models, yields 77.6% f-score on WSJ10."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-15",
"text": "While Klein and Manning's approach may be described as an \"all-substrings\" approach to unsupervised parsing, an even richer model consists of an \"all-subtrees\" approach to unsupervised parsing, called U-DOP (Bod 2006) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-16",
"text": "U-DOP initially assigns all unlabeled binary trees to a training set, efficiently stored in a packed forest, and next trains subtrees thereof on a heldout corpus, either by taking their relative frequencies, or by iteratively training the subtree parameters using the EM algorithm (referred to as \"UML-DOP\")."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-17",
"text": "The main advantage of an allsubtrees approach seems to be the direct inclusion of discontiguous context that is not captured by (linear) substrings."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-18",
"text": "Discontiguous context is important not only for learning structural dependencies but also for learning a variety of noncontiguous constructions such as nearest \u2026 to\u2026 or take \u2026 by surprise."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-19",
"text": "Bod (2006) reports 82.9% unlabeled f-score on the same WSJ10 as used by Manning (2002, 2004) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-20",
"text": "Unfortunately, his experiments heavily depend on a priori sampling of subtrees, and the model becomes highly inefficient if larger corpora are used or longer sentences are included."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-21",
"text": "In this paper we will also test an alternative model for unsupervised all-subtrees 400 parsing, termed U-DOP*, which is based on the DOP* estimator by Zollmann and Sima'an (2005) , and which computes the shortest derivations for sentences from a held-out corpus using all subtrees from all trees from an extraction corpus."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-22",
"text": "While we do not achieve as high an f-score as the UML-DOP model in Bod (2006) , we will show that U-DOP* can operate without subtree sampling, and that the model can be trained on corpora that are two orders of magnitude larger than in Bod (2006) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-23",
"text": "We will extend our experiments to 4 million sentences from the NANC corpus (Graff 1995) , showing that an f-score of 70.7% can be obtained on the standard Penn WSJ test set by means of unsupervised parsing."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-24",
"text": "Moreover, U-DOP* can be directly put to use in bootstrapping structures for concrete applications such as syntax-based machine translation and speech recognition."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-25",
"text": "We show that U-DOP* outperforms the supervised DOP model if tested on the German-English Europarl corpus in a syntax-based MT system."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-26",
"text": "In the following, we first explain the DOP* estimator and discuss how it can be extended to unsupervised parsing."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-27",
"text": "In section 3, we discuss how a PCFG reduction for supervised DOP can be applied to packed parse forests."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-28",
"text": "In section 4, we will go into an experimental evaluation of U-DOP* on annotated corpora, while in section 5 we will evaluate U-DOP* on unlabeled corpora in an MT application."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-92",
"text": "We used the technique in Bod (1998 Bod ( , 2000 to include 'unknown' subtrees."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-29",
"text": "2 From DOP* to U-DOP* DOP* is a modification of the DOP model in Bod (1998) that results in a statistically consistent estimator and in an efficient training procedure (Zollmann and Sima'an 2005) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-30",
"text": "DOP* uses the allsubtrees idea from DOP: given a treebank, take all subtrees, regardless of size, to form a stochastic tree-substitution grammar (STSG)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-31",
"text": "Since a parse tree of a sentence may be generated by several (leftmost) derivations, the probability of a tree is the sum of the probabilities of the derivations producing that tree."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-32",
"text": "The probability of a derivation is the product of the subtree probabilities."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-33",
"text": "The original DOP model in Bod (1998) takes the occurrence frequencies of the subtrees in the trees normalized by their root frequencies as subtree parameters."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-34",
"text": "While efficient algorithms have been developed for this DOP model by converting it into a PCFG reduction (Goodman 2003 ), DOP's estimator was shown to be inconsistent by Johnson (2002) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-35",
"text": "That is, even with unlimited training data, DOP's estimator is not guaranteed to converge to the correct distribution."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-36",
"text": "Zollmann and Sima'an (2005) developed a statistically consistent estimator for DOP which is based on the assumption that maximizing the joint probability of the parses in a treebank can be approximated by maximizing the joint probability of their shortest derivations (i.e. the derivations consisting of the fewest subtrees)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-37",
"text": "This assumption is in consonance with the principle of simplicity, but there are also empirical reasons for the shortest derivation assumption: in Bod (2003) and Hearne and Way (2006) , it is shown that DOP models that select the preferred parse of a test sentence using the shortest derivation criterion perform very well."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-38",
"text": "On the basis of this shortest-derivation assumption, Zollmann and Sima'an come up with a model that uses held-out estimation: the training corpus is randomly split into two parts proportional to a fixed ratio: an extraction corpus EC and a held-out corpus HC."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-39",
"text": "Applied to DOP, held-out estimation would mean to extract fragments from the trees in EC and to assign their weights such that the likelihood of HC is maximized."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-40",
"text": "If we combine their estimation method with Goodman's reduction of DOP, Zollman and Sima'an's procedure operates as follows:"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-41",
"text": "(1) Divide a treebank into an EC and HC (2) Convert the subtrees from EC into a PCFG reduction (3) Compute the shortest derivations for the sentences in HC (by simply assigning each subtree equal weight and applying Viterbi 1-best) (4) From those shortest derivations, extract the subtrees and their relative frequencies in HC to form an STSG Zollmann and Sima'an show that the resulting estimator is consistent."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-42",
"text": "But equally important is the fact that this new DOP* model does not suffer from a decrease in parse accuracy if larger subtrees are included, whereas the original DOP model needs to be redressed by a correction factor to maintain this property (Bod 2003) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-43",
"text": "Moreover, DOP*'s estimation procedure is very efficient, while the EM training procedure for UML-DOP proposed in Bod (2006) is particularly time consuming and can only operate by randomly sampling trees."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-44",
"text": "Given the advantages of DOP*, we will generalize this model in the current paper to unsupervised parsing."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-45",
"text": "We will use the same allsubtrees methodology as in Bod (2006) , but now by applying the efficient and consistent DOP*-based estimator."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-46",
"text": "The resulting model, which we will call U-DOP*, roughly operates as follows:"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-47",
"text": "(1) Divide a corpus into an EC and HC (2) Assign all unlabeled binary trees to the sentences in EC, and store them in a shared parse forest (3) Convert the subtrees from the parse forests into a compact PCFG reduction (see next section) (4) Compute the shortest derivations for the sentences in HC (as in DOP*) (5) From those shortest derivations, extract the subtrees and their relative frequencies in HC to form an STSG (6) Use the STSG to compute the most probable parse trees for new test data by means of Viterbi n-best (see next section)"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-48",
"text": "We will use this U-DOP* model to investigate our main research question: how far can we get with unsupervised parsing if we make our training corpus several orders of magnitude larger than has hitherto be attempted?"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-50",
"text": "**CONVERTING SHARED PARSE FORESTS INTO PCFG REDUCTIONS**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-51",
"text": "The main computational problem is how to deal with the immense number of subtrees in U-DOP*. There exists already an efficient supervised algorithm that parses a sentence by means of all subtrees from a treebank."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-52",
"text": "This algorithm was extensively described in Goodman (2003) and converts a DOP-based STSG into a compact PCFG reduction that generates eight rules for each node in the treebank."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-53",
"text": "The reduction is based on the following idea: every node in every treebank tree is assigned a unique number which is called its address."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-54",
"text": "The notation A@k denotes the node at address k where A is the nonterminal labeling that node."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-55",
"text": "A new nonterminal is created for each node in the training data."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-56",
"text": "This nonterminal is called A k ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-57",
"text": "Let a j represent the number of subtrees headed by the node A@j, and let a represent the number of subtrees headed by nodes with nonterminal A, that is a = \u03a3 j a j ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-58",
"text": "Then there is a PCFG with the following property: for every subtree in the training corpus headed by A, the grammar will generate an isomorphic subderivation."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-59",
"text": "For example, for a node (A@j (B@k, C@l)), the following eight PCFG rules in figure 1 are generated, where the number following a rule is its weight."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-60",
"text": "By simple induction it can be shown that this construction produces PCFG derivations isomorphic to DOP derivations (Goodman 2003: 130-133) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-61",
"text": "The PCFG reduction is linear in the number of nodes in the corpus."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-62",
"text": "While Goodman's reduction method was developed for supervised DOP where each training sentence is annotated with exactly one tree, the method can be generalized to a corpus where each sentence is annotated with all possible binary trees (labeled with the generalized category X), as long as we represent these trees by a shared parse forest."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-63",
"text": "A shared parse forest can be obtained by adding pointers from each node in the chart (or tabular diagram) to the nodes that caused it to be placed in the chart."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-64",
"text": "Such a forest can be represented in cubic space and time (see Billot and Lang 1989) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-65",
"text": "Then, instead of assigning a unique address to each node in each tree, as done by the PCFG reduction for supervised DOP, we now assign a unique address to each node in each parse forest for each sentence."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-66",
"text": "However, the same node may be part of more than one tree."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-67",
"text": "A shared parse forest is an AND-OR graph where AND-nodes correspond to the usual parse tree nodes, while OR-nodes correspond to distinct subtrees occurring in the same context."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-68",
"text": "The total number of nodes is cubic in sentence length n. This means that there are O(n 3 ) many nodes that receive a unique address as described above, to which next our PCFG reduction is applied."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-69",
"text": "This is a huge reduction compared to Bod (2006) where the number of subtrees of all trees increases with the Catalan number, and only ad hoc sampling could make the method work."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-91",
"text": "The training for U-DOP* consisted in the computation of the shortest derivations for the HC from which the subtrees and their relative frequencies were extracted."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-70",
"text": "Since U-DOP* computes the shortest derivations (in the training phase) by combining subtrees from unlabeled binary trees, the PCFG reduction in figure 1 can be represented as in figure 2 , where X refers to the generalized category while B and C either refer to part-of-speech categories or are equivalent to X. The equal weights follow from the fact that the shortest derivation is equivalent to the most probable derivation if all subtrees are assigned equal probability (see Bod 2000; Goodman 2003) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-71",
"text": "Figure 2."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-72",
"text": "PCFG reduction for U-DOP* Once we have parsed HC with the shortest derivations by the PCFG reduction in figure 2, we extract the subtrees from HC to form an STSG."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-73",
"text": "The number of subtrees in the shortest derivations is linear in the number of nodes (see Zollmann and Sima'an 2005, theorem 5.2) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-74",
"text": "This means that U-DOP* results in an STSG which is much more succinct than previous DOP-based STSGs."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-75",
"text": "Moreover, as in Bod (1998 Bod ( , 2000 , we use an extension of Good-Turing to smooth the subtrees and to deal with 'unknown' subtrees."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-76",
"text": "Note that the direct conversion of parse forests into a PCFG reduction also allows us to efficiently implement the maximum likelihood extension of U-DOP known as UML-DOP (Bod 2006) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-77",
"text": "This can be accomplished by training the PCFG reduction on the held-out corpus HC by means of the expectation-maximization algorithm, where the weights in figure 1 are taken as initial parameters."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-78",
"text": "Both U-DOP*'s and UML-DOP's estimators are known to be statistically consistent. But while U-DOP*'s training phase merely consists of the computation of the shortest derivations and the extraction of subtrees, UML-DOP involves iterative training of the parameters."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-79",
"text": "Once we have extracted the STSG, we compute the most probable parse for new sentences by Viterbi n-best, summing up the probabilities of derivations resulting in the same tree (the exact computation of the most probable parse is NP hard -see Sima 'an 1996) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-80",
"text": "We have incorporated the technique by Huang and Chiang (2005) into our implementation which allows for efficient Viterbi n-best parsing."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-82",
"text": "**EVALUATION ON HAND-ANNOTATED CORPORA**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-83",
"text": "To evaluate U-DOP* against UML-DOP and other unsupervised parsing models, we started out with three corpora that are also used in Manning (2002, 2004) and Bod (2006) : Penn's WSJ10 which contains 7422 sentences \u2264 10 words after removing empty elements and punctuation, the German NEGRA10 corpus and the Chinese Treebank CTB10 both containing 2200+ sentences \u2264 10 words after removing punctuation."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-84",
"text": "As with most other unsupervised parsing models, we train and test on p-o-s strings rather than on word strings."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-85",
"text": "The extension to word strings is straightforward as there exist highly accurate unsupervised part-of-speech taggers (e.g. Sch\u00fctze 1995) which can be directly combined with unsupervised parsers, but for the moment we will stick to p-o-s strings (we will come back to word strings in section 5)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-86",
"text": "Each corpus was divided into 10 training/test set splits of 90%/10% (n-fold testing), and each training set was randomly divided into two equal parts, that serve as EC and HC and vice versa."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-87",
"text": "We used the same evaluation metrics for unlabeled precision (UP) and unlabeled recall (UR) as in Manning (2002, 2004) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-88",
"text": "The two metrics of UP and UR are combined by the unlabeled f-score F1 = 2*UP*UR/(UP+UR)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-89",
"text": "All trees in the test set were binarized beforehand, in the same way as in Bod (2006) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-90",
"text": "For UML-DOP the decrease in crossentropy became negligible after maximally 18 iterations."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-93",
"text": "Table 1 shows the f-scores for U-DOP* and UML-DOP against the f-scores for U-DOP reported in Bod (2006) , the CCM model in Klein and Manning (2002) , the DMV dependency model in Klein and Manning (2004) It should be kept in mind that an exact comparison can only be made between U-DOP* and UML-DOP in table 1, since these two models were tested on 90%/10% splits, while the other models were applied to the full WSJ10, NEGRA10 and CTB10 corpora."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-94",
"text": "Table 1 shows that U-DOP* performs worse than UML-DOP in all cases, although the differences are small and was statistically significant only for WSJ10 using paired t-testing."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-95",
"text": "As explained above, the main advantage of U-DOP* over UML-DOP is that it works with a more succinct grammar extracted from the shortest derivations of HC."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-96",
"text": "Table 2 shows the size of the grammar (number of rules or subtrees) of the two models for resp."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-97",
"text": "Penn WSJ10, the entire Penn WSJ and the first 2 million sentences from the NANC (North American News Text) corpus which contains a total of approximately 24 million sentences from different news sources."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-98",
"text": "Table 2 ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-99",
"text": "Grammar size of U-DOP* and UML-DOP for WSJ10 (7,7K sentences), WSJ (50K sentences) and the first 2,000K sentences from NANC."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-101",
"text": "**MODEL**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-102",
"text": "Note that while U-DOP* is about 2 orders of magnitudes smaller than UML-DOP for the WSJ10, it is almost 3 orders of magnitudes smaller for the first 2 million sentences of the NANC corpus."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-103",
"text": "Thus even if U-DOP* does not give the highest f-score in table 1, it is more apt to be trained on larger data sets."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-104",
"text": "In fact, a well-known advantage of unsupervised methods over supervised methods is the availability of almost unlimited amounts of text. (2006) in improving a supervised parser by selftraining."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-105",
"text": "In our experiments below we will start by mixing subsets from the NANC's WSJ data with Penn's WSJ data."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-106",
"text": "Next, we will do the same with 2 million sentences from the LA Times in the NANC corpus, and finally we will mix all data together for inducing a U-DOP* model."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-107",
"text": "From Penn's WSJ, we only use sections 2 to 21 for training (just as in supervised parsing) and section 23 (\u2264100 words) for testing, so as to compare our unsupervised results with some binarized supervised parsers."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-108",
"text": "The NANC data was first split into sentences by means of a simple discriminitive model."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-109",
"text": "It was next p-o-s tagged with the the TnT tagger (Brants 2000) which was trained on the Penn Treebank such that the same tag set was used."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-110",
"text": "Next, we added subsets of increasing size from the NANC p-o-s strings to the 40,000 Penn WSJ p-o-s strings."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-111",
"text": "Each time the resulting corpus was split into two halfs and the shortest derivations were computed for one half by using the PCFGreduction from the other half and vice versa."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-112",
"text": "The resulting trees were used for extracting an STSG which in turn was used to parse section 23 of Penn's WSJ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-113",
"text": "Table 3 indicates that there is a monotonous increase in f-score on the WSJ test set if NANC text is added to our training data in both cases, independent of whether the sentences come from the WSJ domain or the LA Times domain."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-114",
"text": "Although the effect of adding LA Times data is weaker than adding WSJ data, it is noteworthy that the unsupervised induction of trees from the LA Times domain still improves the f-score even if the test data are from a different domain."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-115",
"text": "We also investigated the effect of adding the LA Times data to the total mix of Penn's WSJ and NANC's WSJ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-116",
"text": "Table 4 shows the results of this experiment, where the baseline of 0 sentences thus starts with the 2,040k sentences from the combined Penn-NANC WSJ data."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-118",
"text": "Results of U-DOP* on section 23 from Penn's WSJ by mixing sentences from the combined Penn-NANC WSJ with additions from NANC's LA Times."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-119",
"text": "As seen in table 4, the f-score continues to increase even when adding LA Times data to the large combined set of Penn-NANC WSJ sentences."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-120",
"text": "The highest f-score is obtained by adding 2,000k sentences, resulting in a total training set of 4,040k sentences."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-121",
"text": "We believe that our result is quite promising for the future of unsupervised parsing."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-122",
"text": "In putting our best f-score in table 4 into perspective, it should be kept in mind that the gold standard trees from Penn-WSJ section 23 were binarized."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-123",
"text": "It is well known that such a binarization has a negative effect on the f-score."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-124",
"text": "Bod (2006) reports that an unbinarized treebank grammar achieves an average 72.3% f-score on WSJ sentences \u2264 40 words, while the binarized version achieves only 64.6% f-score."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-125",
"text": "To compare U-DOP*'s results against some supervised parsers, we additionally evaluated a PCFG treebank grammar and the supervised DOP* parser using the same test set."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-126",
"text": "For these supervised parsers, we employed the standard training set, i.e. Penn's WSJ sections 2-21, but only by taking the p-o-s strings as we did for our unsupervised U-DOP* model."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-127",
"text": "Table 5 ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-128",
"text": "Comparison between the (best version of) U-DOP*, the supervised treebank PCFG and the supervised DOP* for section 23 of Penn's WSJ As seen in table 5, U-DOP* outperforms the binarized treebank PCFG on the WSJ test set."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-129",
"text": "While a similar result was obtained in Bod (2006) , the absolute difference between unsupervised parsing and the treebank grammar was extremely small in Bod (2006): 1.8%, while the difference in table 5 is 7.2%, corresponding to 19.7% error reduction."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-130",
"text": "Our f-score remains behind the supervised version of DOP* but the gap gets narrower as more training data is being added to U-DOP*."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-132",
"text": "**EVALUATION ON UNLABELED CORPORA IN A PRACTICAL APPLICATION**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-133",
"text": "Our experiments so far have shown that despite the addition of large amounts of unlabeled training data, U-DOP* is still outperformed by the supervised DOP* model when tested on handannotated corpora like the Penn Treebank. Yet it is well known that any evaluation on hand-annotated corpora unreasonably favors supervised parsers."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-134",
"text": "There is thus a quest for designing an evaluation scheme that is independent of annotations."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-135",
"text": "One way to go would be to compare supervised and unsupervised parsers as a syntax-based language model in a practical application such as machine translation (MT) or speech recognition."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-136",
"text": "In Bod (2007) , we compared U-DOP* and DOP* in a syntax-based MT system known as Data-Oriented Translation or DOT (Poutsma 2000; Groves et al. 2004 )."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-137",
"text": "The DOT model starts with a bilingual treebank where each tree pair constitutes an example translation and where translationally equivalent constituents are linked."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-138",
"text": "Similar to DOP, the DOT model uses all linked subtree pairs from the bilingual treebank to form an STSG of linked subtrees, which are used to compute the most probable translation of a target sentence given a source sentence (see Hearne and Way 2006) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-139",
"text": "What we did in Bod (2007) is to let both DOP* and U-DOP* compute the best trees directly for the word strings in the German-English Europarl corpus (Koehn 2005) , which contains about 750,000 sentence pairs."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-140",
"text": "Differently from U-DOP*, DOP* needed to be trained on annotated data, for which we used respectively the Negra and the Penn treebank."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-141",
"text": "Of course, it is well-known that a supervised parser's f-score decreases if it is transferred to another domain: for example, the (non-binarized) WSJ-trained DOP model in Bod (2003) decreases from around 91% to 85.5% fscore if tested on the Brown corpus."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-142",
"text": "Yet, this score is still considerably higher than the accuracy obtained by the unsupervised U-DOP model, which achieves 67.6% unlabeled f-score on Brown sentences."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-143",
"text": "Our main question of interest is in how far this difference in accuracy on hand-annotated corpora carries over when tested in the context of a concrete application like MT."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-144",
"text": "This is not a trivial question, since U-DOP* learns 'constituents' for word sequences such as Ich m\u00f6chte (\"I would like to\") and There are (Bod 2007) , which are usually hand-annotated as non-constituents."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-145",
"text": "While U-DOP* is punished for this 'incorrect' prediction if evaluated on the Penn Treebank, it may be rewarded for this prediction if evaluated in the context of machine translation using the Bleu score (Papineni et al. 2002) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-146",
"text": "Thus similar to Chiang (2005) , U-DOP can discover non-syntactic phrases, or simply \"phrases\", which are typically neglected by linguistically syntax-based MT systems."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-147",
"text": "At the same time, U-DOP* can also learn discontiguous constituents that are neglected by phrase-based MT systems (Koehn et al. 2003) ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-148",
"text": "In our experiments, we used both U-DOP* and DOP* to predict the best trees for the GermanEnglish Europarl corpus."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-149",
"text": "Next, we assigned links between each two nodes in the respective trees for each sentence pair."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-150",
"text": "For a 2,000 sentence test set from a different part of the Europarl corpus we computed the most probable target sentence (using Viterbi n best)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-151",
"text": "The Bleu score was used to measure translation accuracy, calculated by the NIST script with its default settings."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-152",
"text": "As a baseline we compared our results with the publicly available phrase-based system Pharaoh (Koehn et al. 2003) , using the default feature set."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-153",
"text": "Table 6 shows for each system the Bleu score together with a description of the productive units."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-154",
"text": "'U-DOT' refers to 'Unsupervised DOT' based on U-DOP*, while DOT is based on DOP*."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-156",
"text": "**SYSTEM**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-157",
"text": "Productive Units Bleu-score Table 6 ."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-158",
"text": "Comparing U-DOP* and DOP* in syntaxbased MT on the German-English Europarl corpus against the Pharaoh system."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-159",
"text": "The table shows that the unsupervised U-DOT model outperforms the supervised DOT model with 0.059."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-160",
"text": "Using Zhang's significance tester (Zhang et al. 2004) , it turns out that this difference is statistically significant (p < 0.001)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-161",
"text": "Also the difference between U-DOT and the baseline Pharaoh is statistically significant (p < 0.008)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-162",
"text": "Thus even if supervised parsers like DOP* outperform unsupervised parsers like U-DOP* on hand-parsed data with >10%, the same supervised parser is outperformed by the unsupervised parser if tested in an MT application."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-163",
"text": "Evidently, U-DOP's capacity to capture both constituents and phrases pays off in a concrete application and shows the shortcomings of models that only allow for either constituents (such as linguistically syntax-based MT) or phrases (such as phrase-based MT)."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-164",
"text": "In Bod (2007) we also show that U-DOT obtains virtually the same Bleu score as Pharaoh after eliminating subtrees with discontiguous yields."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-166",
"text": "**CONCLUSION: FUTURE OF SUPERVISED PARSING**"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-167",
"text": "In this paper we have shown that the accuracy of unsupervised parsing under U-DOP* continues to grow when enlarging the training set with additional data."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-168",
"text": "However, except for the simple treebank PCFG, U-DOP* scores worse than supervised parsers if evaluated on hand-annotated data."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-169",
"text": "At the same time U-DOP* significantly outperforms the supervised DOP* if evaluated in a practical application like MT."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-170",
"text": "We argued that this can be explained by the fact that U-DOP learns both constituents and (non-syntactic) phrases while supervised parsers learn constituents only. What should we learn from these results?"
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-171",
"text": "We believe that parsing, when separated from a task-based application, is mainly an academic exercise."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-172",
"text": "If we only want to mimick a treebank or implement a linguistically motivated grammar, then supervised, grammar-based parsers are preferred to unsupervised parsers. But if we want to improve a practical application with a syntaxbased language model, then an unsupervised parser like U-DOP* might be superior."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-173",
"text": "The problem with most supervised (and semi-supervised) parsers is their rigid notion of constituent which excludes 'constituents' like the German Ich m\u00f6chte or the French Il y a. Instead, it has become increasingly clear that the notion of constituent is a fluid which may sometimes be in agreement with traditional syntax, but which may just as well be in opposition to it."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-174",
"text": "Any sequence of words can be a unit of combination, including noncontiguous word sequences like closest X to Y. A parser which does not allow for this fluidity may be of limited use as a language model."
},
{
"sent_id": "f54235664f013f0fec918222be9198-C001-175",
"text": "Since supervised parsers seem to stick to categorical notions of constituent, we believe that in the field of syntax-based language models the end of supervised parsing has come in sight."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f54235664f013f0fec918222be9198-C001-15"
],
[
"f54235664f013f0fec918222be9198-C001-43"
]
],
"cite_sentences": [
"f54235664f013f0fec918222be9198-C001-15",
"f54235664f013f0fec918222be9198-C001-43"
]
},
"@DIF@": {
"gold_contexts": [
[
"f54235664f013f0fec918222be9198-C001-22"
],
[
"f54235664f013f0fec918222be9198-C001-68",
"f54235664f013f0fec918222be9198-C001-69"
],
[
"f54235664f013f0fec918222be9198-C001-129"
]
],
"cite_sentences": [
"f54235664f013f0fec918222be9198-C001-22",
"f54235664f013f0fec918222be9198-C001-69",
"f54235664f013f0fec918222be9198-C001-129"
]
},
"@USE@": {
"gold_contexts": [
[
"f54235664f013f0fec918222be9198-C001-22"
],
[
"f54235664f013f0fec918222be9198-C001-76"
],
[
"f54235664f013f0fec918222be9198-C001-83"
],
[
"f54235664f013f0fec918222be9198-C001-89"
],
[
"f54235664f013f0fec918222be9198-C001-93"
]
],
"cite_sentences": [
"f54235664f013f0fec918222be9198-C001-22",
"f54235664f013f0fec918222be9198-C001-76",
"f54235664f013f0fec918222be9198-C001-83",
"f54235664f013f0fec918222be9198-C001-89",
"f54235664f013f0fec918222be9198-C001-93"
]
},
"@EXT@": {
"gold_contexts": [
[
"f54235664f013f0fec918222be9198-C001-45"
]
],
"cite_sentences": [
"f54235664f013f0fec918222be9198-C001-45"
]
}
}
},
"ABC_d576de5c19d7ff62a143b0d4d56135_7": {
"x": [
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-2",
"text": "In order to determine argument structure in text, one must understand how individual components of the overall argument are linked."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-3",
"text": "This work presents the first neural network-based approach to link extraction in argument mining."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-4",
"text": "Specifically, we propose a novel architecture that applies Pointer Network sequence-tosequence attention modeling to structural prediction in discourse parsing tasks."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-5",
"text": "We then develop a joint model that extends this architecture to simultaneously address the link extraction task and the classification of argument components."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-6",
"text": "The proposed joint model achieves state-of-the-art results on two separate evaluation corpora, showing far superior performance than the previously proposed corpus-specific and heavily feature-engineered models."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-7",
"text": "Furthermore, our results demonstrate that jointly optimizing for both tasks is crucial for high performance."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-10",
"text": "An important goal in argument mining is to understand the structure in argumentative text (Persing and Ng, 2016; Peldszus and Stede, 2015; Stab and Gurevych, 2016; Nguyen and Litman, 2016) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-11",
"text": "One fundamental assumption when working with argumentative text is the presence of Arguments Components (ACs)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-12",
"text": "The types of ACs are generally characterized as a claim or a premise (Govier, 2013) , with premises acting as support (or possibly attack) units for claims (though some corpora have further AC types, such as major claim Gurevych, 2016, 2014b) )."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-13",
"text": "The task of processing argument structure encapsulates four distinct subtasks (our work focuses on subtasks 2 and 3): (1) Given a sequence of tokens that represents an entire argumentative text, determine the token subsequences that constitute non-intersecting ACs; (2) Given an AC, determine the type of AC (claim, premise, etc.); (3) Given a set/list of ACs, determine which ACs have directed links that encapsulate overall argument structure; (4) Given two linked ACs, determine whether the link is a supporting or attacking relation."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-14",
"text": "This can be labeled as a 'micro' approach to argument mining (Stab and Gurevych, 2016) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-15",
"text": "In contrast, there have been a number of efforts to identify argument structure at a higher level (Boltuzic and\u0160najder, 2014; Ghosh et al., 2014; Cabrio and Villata, 2012) , as well as slightly re-ordering the pipeline with respect to AC types (Rinott et al., 2015) )."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-16",
"text": "There are two key assumptions our work makes going forward."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-17",
"text": "First, we assume subtask 1 has been completed, i.e. ACs have already been identified."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-18",
"text": "Second, we follow previous work that assumes a tree structure for the linking of ACs (Palau and Moens, 2009; Cohen, 1987; Peldszus and Stede, 2015; Stab and Gurevych, 2016) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-19",
"text": "Specifically, a given AC can only have a single outgoing link, but can have numerous incoming links."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-20",
"text": "Furthermore, there is a 'head' component that has no outgoing link (the top of the tree)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-21",
"text": "Depending on the corpus (see Section 4), an argument structure can be either a single tree or a forest, consisting of multiple trees."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-22",
"text": "Figure 1 shows an example that we will use throughout the paper to concretely explain how our approach works."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-23",
"text": "First, the left side of the figure presents the raw text of a paragraph in a persuasive essay (Stab and Gurevych, 2016) , with the ACs contained in square brackets."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-24",
"text": "Squiggly vs straight underlining differentiates between claims and premises, respectively."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-25",
"text": "The ACs have been annotated as to how they are linked, and the right side of the figure reflects this structure."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-26",
"text": "Figure 1: An example of argument structure with four ACs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-27",
"text": "The left side shows raw text that has been annotated for the presence of ACs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-28",
"text": "Squiggly or straight underlining means an AC is a claim or premise, respectively."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-29",
"text": "The ACs in the text have also been annotated for links to other ACs, which is shown in the right figure."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-30",
"text": "ACs 3 and 4 are premises that link to another premise, AC2."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-31",
"text": "Finally, AC2 links to a claim, AC1."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-32",
"text": "AC1 therefore acts as the central argumentative component."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-33",
"text": "structure with four ACs forms a tree, where AC2 has two incoming links, and AC1 acts as the head, with no outgoing links."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-34",
"text": "We also specify the type of AC, with the head AC marked as a claim and the remaining ACs marked as premises."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-35",
"text": "Lastly, we note that the order of argument components can be a strong indicator of how components should relate."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-36",
"text": "Linking to the first argument component can provide a competitive baseline heuristic (Peldszus and Stede, 2015; Stab and Gurevych, 2016) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-37",
"text": "Given the above considerations, we propose that sequence-to-sequence attention modeling, in the spirit of a Pointer Network (PN) (Vinyals et al., 2015b) , can be effective for predicting argument structure."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-38",
"text": "To the best of our knowledge, a clean, elegant implementation of a PN-based model has yet to be introduced for discourse parsing tasks."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-39",
"text": "A PN is a sequence-to-sequence model ) that outputs a distribution over the encoding indices at each decoding timestep."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-40",
"text": "More generally, it is a recurrent model with attention (Bahdanau et al., 2014) , and we claim that as such, it is promising for link extraction because it inherently possesses three important characteristics: (1) it is able to model the sequential nature of ACs, (2) it constrains ACs to have a single outgoing link, thus partly enforcing the tree structure, and (3) the hidden representations learned by the model can be used for jointly predicting multiple subtasks."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-41",
"text": "Furthermore, we believe the sequence-tosequence aspect of the model provides two distinct benefits: (1) it allows for two separate representations of a single AC (one for the source and one for the target of the link), and (2) the decoder networkcould learn to predict correct sequences of linked indices, which is a second recurrence over ACs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-42",
"text": "Note that we also test the sequence-to-sequence architecture against a simplified model that only uses hidden states from an encoding network to make predictions (see Section 5)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-43",
"text": "The main technical contribution of our work is a joint model that simultaneously predicts links between ACs and determines their type."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-44",
"text": "Our joint model uses the hidden representation of ACs produced during the encoding step (see Section 3.4)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-45",
"text": "While PNs were originally proposed to allow a variable length decoding sequence, our model differs in that it decodes for the same number of timesteps as there are inputs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-46",
"text": "This is a key insight that allows for a sequence-to-sequence model to be used for structural prediction."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-47",
"text": "Aside from the partial assumption of tree structure in the argumentative text, our models do not make any additional assumptions about the AC types or connectivity, unlike the work of Peldszus (2014) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-48",
"text": "Lastly, in respect to the broad task of parsing, our model is flexible because it can easily handle nonprojective, multi-root dependencies."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-49",
"text": "We evaluate our models on the corpora of Stab and Gurevych (2016) and Peldszus (2014) , and compare our results with the results of the aformentioned authors."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-50",
"text": "Our results show that (1) joint modeling is imperative for competitive performance on the link extraction task, (2) the presence of the second recurrence improves performance over a non-sequence-to-sequence model, and (3) the joint model can outperform models with heavy featureengineering and corpus-specific constraints."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-52",
"text": "**RELATED WORK**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-53",
"text": "Palau and Moens (2009) is an early work in argument mining, using a hand-crafted Context-Free Grammar to determine the structure of ACs in a corpus of legal texts."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-54",
"text": "Lawrence et al. (2014) leverage a topic modeling-based AC similarity to uncover tree-structured arguments in philosophical texts."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-55",
"text": "Recent work offers data-driven approaches to the task of predicting links between ACs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-56",
"text": "Stab and Gurevych (2014b) approach the task as a binary classification problem."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-57",
"text": "The authors train an SVM with various semantic and structural features."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-58",
"text": "Peldszus and Stede have also used classification models for predicting the presence of links (2015) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-59",
"text": "The first neural network-based model for argumentation mining was proposed by Laha and Raykar (2016) , who use two recurrent networks in end-to-end fashion to classify AC types."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-60",
"text": "Various authors have also proposed to jointly model link extraction with other subtasks from the argument mining pipeline, using either an Integer Linear Programming (ILP) framework (Persing and Ng, 2016; Stab and Gurevych, 2016) or directly feeding previous subtask predictions into a tree-based parser."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-61",
"text": "The former joint approaches are evaluated on an annotated corpus of persuasive essays Gurevych, 2014a, 2016) , and the latter on a corpus of microtexts (Peldszus, 2014) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-62",
"text": "The ILP framework is effective in enforcing a tree structure between ACs when predictions are made from otherwise naive base classifiers."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-63",
"text": "Recurrent neural networks have previously been proposed to model tree/graph structures in a linear manner."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-64",
"text": "Vinyals et al. (2015c) use a sequence-tosequence model for the task of syntactic parsing."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-65",
"text": "Bowman et al. (2015) experiment on an artificial entailment dataset that is specifically engineered to capture recursive logic (Bowman et al., 2014) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-66",
"text": "Standard recurrent neural networks can take in complete sentence sequences and perform competitively with a recursive neural network."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-67",
"text": "Multitask learning for sequence-to-sequence has also been proposed (Luong et al., 2015) , though none of the models used a PN for prediction."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-68",
"text": "In the field of discourse parsing, the work of is the only work, to our knowledge, that incorporates attention into the network architecture."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-69",
"text": "However, the attention is only used in the process of creating representations of the text itself."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-70",
"text": "Attention is not used to predict the overall discourse structure."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-71",
"text": "In fact, the model still relies on a binary classifier to determine if textual components should have a link."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-72",
"text": "Arguably the most similar approach to ours is in the field of dependency parsing (Cheng et al., 2016) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-73",
"text": "The authors propose a model that performs 'queries' between word representations in order to determine a distribution over potential headwords."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-75",
"text": "**PROPOSED APPROACH**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-76",
"text": "In this section, we describe our approach to using a sequence-to-sequence model with attention for argument mining, specifically, identifying AC types and extracting the links between them."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-77",
"text": "We begin by giving a brief overview of these models."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-79",
"text": "**POINTER NETWORK**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-80",
"text": "A PN is a sequence-to-sequence model with attention (Bahdanau et al., 2014) that was proposed to handle decoding sequences over the encoding inputs, and can be extended to arbitrary sets (Vinyals et al., 2015a) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-81",
"text": "The original motivation for a pointer network was to allow networks to learn solutions to algorithmic problems, such as the traveling salesperson and convex hull problems, where the solution is a sequence over input points."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-82",
"text": "The PN model is trained on input/output sequence pairs (E, D), where E is the source and D is the target (our choice of E,D is meant to represent the encoding, decoding steps of the sequence-to-sequence model)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-83",
"text": "Given model parameters \u0398, we apply the chain rule to determine the probability of a single training example:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-84",
"text": "(1) where the function m signifies that the number of decoding timesteps is a function of each individual training example."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-85",
"text": "We will discuss shortly why we need to modify the original definition of m for our application."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-86",
"text": "By taking the log-likelihood of Equation 1, we arrive at the optimization objective:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-87",
"text": "which is the sum over all training example pairs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-88",
"text": "The PN uses Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) for sequential modeling, which produces a hidden layer h at each encoding/decoding timestep."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-89",
"text": "In practice, the PN has two separate LSTMs, one for Figure 2: Applying a Pointer Network to the example paragraph in Figure 1 with LSTMs unrolled over time."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-90",
"text": "Note that D1 points to itself to denote that it has not outgoing link and is therefore the head of a tree."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-91",
"text": "encoding and one for decoding."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-92",
"text": "Thus, we refer to encoding hidden layers as e, and decoding hidden layers as d."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-93",
"text": "The PN uses a form of content-based attention (Bahdanau et al., 2014) to allow the model to produce a distribution over input elements."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-94",
"text": "This can also be thought of as a distribution over input indices, wherein a decoding step 'points' to the input."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-95",
"text": "Formally, given encoding hidden states (e 1 , ..., e n ), the model calculates p(D i |D 1 , ..., D i\u22121 , E) as follows:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-96",
"text": "where matrices W 1 , W 2 and vector v are parameters of the model (along with the LSTM parameters used for encoding and decoding)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-97",
"text": "In Equation 3, prior to taking the dot product with v, the resulting transformation can be thought of as creating a joint hidden representation of inputs i and j. Vector u i in equation 4 is of length n, and index j corresponds to input element j. Therefore, by taking the softmax of u i , we are able to create a distribution over the input."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-99",
"text": "**LINK EXTRACTION AS SEQUENCE MODELING**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-100",
"text": "A given piece of text has a set of ACs, which occur in a specific order in the text: (C 1 , ..., C n )."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-101",
"text": "Therefore, at encoding timestep i, the model is fed a representation of C i ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-102",
"text": "Since the representation is large and sparse (see Section 3.3 for details on how we represent ACs), we add a fully-connected layer before the LSTM input."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-103",
"text": "Given a representation R i for AC C i , the LSTM input A i is calculated as:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-104",
"text": "where W rep , b rep in turn become model parameters, and \u03c3 is the sigmoid function 1 ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-105",
"text": "Similarly, the decoding network applies a fully-connected layer with sigmoid activation to its inputs, see Figure 3 ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-106",
"text": "At encoding step i, the encoding LSTM produces hidden layer e i , which can be thought of as a hidden representation of AC C i ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-107",
"text": "In order to make sequence-to-sequence modeling applicable to the problem of link extraction, we explicitly set the number of decoding timesteps to be equal to the number of input components."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-108",
"text": "Using notation from Equation 1, the decoding sequence length for an encoding sequence E is simply m(E) = |{C 1 , ..., C n }|, which is trivially equal to n. By constructing the decoding sequence in this manner, we can associate decoding timestep i with AC C i ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-109",
"text": "From Equation 4, decoding timestep i will output a distribution over input indices."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-110",
"text": "The result of this distribution will indicate to which AC component C i links."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-111",
"text": "Recall there is a possibility that an AC has no outgoing link, such as if it's the root of the tree."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-112",
"text": "In this case, we state that if AC C i does not have an outgoing link, decoding step D i will output index i. Conversely, if D i outputs index j, such that j is not equal to i, this implies that C i has an outgoing link to C j ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-113",
"text": "For the argument structure in Figure 1 , the corresponding decoding sequence is (1, 1, 2, 2)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-114",
"text": "The topology of this decoding sequence is illustrated in Figure 2 ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-115",
"text": "Observe how C 1 points to itself since it has no outgoing link."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-116",
"text": "Finally, we note that we have a Bidirectional LSTM (Graves and Schmidhuber, 2005) as the encoder, unlike the model proposed by Vinyals et al. (2015b) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-117",
"text": "Thus, e i is the concatenation of forward and backward hidden states \u2212 \u2192 e i and \u2190 \u2212 e n\u2212i+1 , produced by two separate LSTMs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-118",
"text": "The decoder remains a standard forward LSTM."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-120",
"text": "**REPRESENTING ARGUMENT COMPONENTS**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-121",
"text": "At each timestep of the encoder, the network takes in a representation of an AC."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-122",
"text": "Each AC is itself Figure 3: Architecture of the joint model applied to the example in Figure 1 . Note that D1 points to itself to denote that it has not outgoing link and is therefore the head of a tree."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-123",
"text": "a sequence of tokens, similar to the QuestionAnswering dataset from Weston et al. (2015) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-124",
"text": "We follow the work of Stab and Gurevych (2016) and focus on three different types of features to represent our ACs: (1) Bag-of-Words of the AC; (2) Embedding representation based on GloVe embeddings (Pennington et al., 2014) , which uses average, max, and min pooling across the token embeddings; (3) Structural features: Whether or not the AC is the first AC in a paragraph, and whether the AC is in an opening, body, or closing paragraph."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-125",
"text": "See Section 6 for an ablation study of the proposed features."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-127",
"text": "**JOINT NEURAL MODEL**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-128",
"text": "Up to this point, we focused on the task of extracting links between ACs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-129",
"text": "However, recent work has shown that joint models that simultaneously try to complete multiple aspects of the subtask pipeline outperform models that focus on a single subtask (Persing and Ng, 2016; Stab and Gurevych, 2014b; Peldszus and Stede, 2015) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-130",
"text": "Therefore, we will modify the single-task architecture so that it would allow us to perform AC classification (Kwon et al., 2007; Rooney et al., 2012) together with link prediction."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-131",
"text": "Knowledge of an individual subtask's predictions can aid in other subtasks."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-132",
"text": "For example, claims do not have an outgoing link, so knowing the type of AC can aid in the link prediction task."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-133",
"text": "This can be seen as a way of regularizing the hidden representations from the encoding component (Che et al., 2015) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-134",
"text": "At each timestep, predicting AC type is a straightforward classification task: given AC C i , we need to predict whether it is a claim, premise, or possibly major claim."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-135",
"text": "More generally, this is another sequence modeling problem: given input sequence E, we want to predict a sequence of argument types T ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-136",
"text": "For encoding timestep i, the model creates hidden representation e i ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-137",
"text": "This can be thought of as a representation of AC C i ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-138",
"text": "Therefore, our joint model will simply pass this representation through a fully-connected layer as follows:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-139",
"text": "where W cls , b cls become elements of the model parameters, \u0398. The dimensionality of W cls , b cls is determined by the number of classes."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-140",
"text": "Lastly, we use softmax to form a distribution over the possible classes."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-141",
"text": "Consequently, the probability of predicting the component type at timestep i is defined as:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-142",
"text": "Finally, combining this new prediction task with Equation 2, we arrive at the new training objective:"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-143",
"text": "which simply sums the costs of the individual prediction tasks, and the second summation is the cost for the new task of predicting AC type."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-144",
"text": "\u03b1 \u2208 [0, 1] is a hyperparameter that specifies how we weight the two prediction tasks in our cost function."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-145",
"text": "The architecture of the joint model, applied to our ongoing example, is illustrated in Figure 3 ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-147",
"text": "**EXPERIMENTAL DESIGN**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-148",
"text": "As we have mentioned, our work assumes that ACs have already been identified."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-149",
"text": "The order of ACs corresponds directly to the order in which the ACs appear in the text."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-150",
"text": "We test the effectiveness of our proposed model on a dataset of persuasive essays (PEC) (Stab and Gurevych, 2016) , as well as a dataset of microtexts (MTC) (Peldszus, 2014) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-151",
"text": "The feature space for the PEC has roughly 3,000 dimensions, and the MTC feature space has between 2,500 and 3,000 dimensions, depending on the data split."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-152",
"text": "The PEC contains a total of 402 essays, with a frozen set of 80 essays held out for testing."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-153",
"text": "There are three AC types in this corpus: major claim, claim, and premise."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-154",
"text": "In this corpus, individual structures can be either trees or forests."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-155",
"text": "Also, in this corpus, each essay has multiple paragraphs, and argument structure is only uncovered within a given paragraph."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-156",
"text": "The MTC contains 112 short texts."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-157",
"text": "Unlike the PEC, each text in this corpus is itself a complete example, as well as a single tree."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-158",
"text": "Since the dataset is small, the authors have created 10 sets of 5-fold cross-validation, reporting the the average across all splits for final model evaluation."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-159",
"text": "This corpus contains only two types of ACs: claim and premise."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-160",
"text": "Note that link prediction is directed, i.e., predicting a link between the pair"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-161",
"text": "We implement our models in TensorFlow (Abadi et al., 2015) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-162",
"text": "We use the following parameters: hidden input dimension size 512, hidden layer size 256 for the bidirectional LSTMs, hidden layer size 512 for the LSTM decoder, \u03b1 equal to 0.5, and dropout (Srivastava et al., 2014) of 0.9."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-163",
"text": "We believe the need for such high dropout is due to the small amounts of training data (Zarrella and Marsh, 2016) , particularly in the MTC."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-164",
"text": "All models are trained with Adam optimizer (Kingma and Ba, 2014 ) with a batch size of 16."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-165",
"text": "For a given training set, we randomly select 10% to become the validation set."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-166",
"text": "Training occurs for 4,000 epochs."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-167",
"text": "Once training is completed, we select the model with the highest validation accuracy (on the link prediction task) and evaluate it on the held-out test set."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-168",
"text": "At test time, we take a greedy approach and select the index of the probability distribution (whether link or type prediction) with the highest value."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-170",
"text": "**RESULTS**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-171",
"text": "The results of our experiments are presented in Tables 1 and 2."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-172",
"text": "For each corpus, we present f1 scores for the AC type classification experiment, with a macro-averaged score of the individual class f1 scores."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-173",
"text": "We also present the f1 scores for predicting the presence/absence of links between ACs, as well as the associated macro-average between these two values."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-174",
"text": "We implement and compare four types of neural models: 1) The previously described joint model from In both corpora we compare against the following previously proposed models: Base Classifier (Stab and Gurevych, 2016 ) is a feature-rich, taskspecific (AC type or link extraction) SVM classifier."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-175",
"text": "Neither of these classifiers enforce structural or global constraints."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-176",
"text": "Conversely, the ILP Joint Model (Stab and Gurevych, 2016) provides constraints by sharing prediction information between the base classifiers."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-177",
"text": "For example, the model attempts to enforce a tree structure among ACs within a given paragraph, as well as using incoming link predictions to better predict the type class claim."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-178",
"text": "For the MTC only, we also have the following comparative models: Simple (Peldszus and Stede, 2015 ) is a feature-rich logistic regression classifier."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-179",
"text": "Best EG (Peldszus and Stede, 2015) creates an Evidence Graph (EG) from the predictions of a set of base classifiers."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-180",
"text": "The EG models the potential argument structure, and offers a global optimization objective that the base classifiers attempt to optimize by adjusting their individual weights."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-181",
"text": "Lastly, MP+p (Peldszus and Stede, 2015) combines predictions from base classifiers with a Minimum Spanning Tree Parser (MSTParser)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-183",
"text": "**DISCUSSION**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-184",
"text": "First, we point out that the joint model achieves state-of-the-art on 10 of the 13 metrics in Tables 1 and 2 , including the highest results in all metrics on the PEC, as well as link prediction on the MTC."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-185",
"text": "The performance on the MTC is very en- couraging for several reasons."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-186",
"text": "First, the fact that the model can perform so well with only a hundred training examples is rather remarkable."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-187",
"text": "Second, although we motivate the use of an attention model due to the fact that it partially enforces a tree structure, other models we compare against explicitly contain further constraints (for example, only premises can have outgoing links)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-188",
"text": "Moreover, the MP+p model directly enforces the single tree constraint unique to the microtext corpus (the PEC allows forests)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-189",
"text": "Even though the joint model does not have the tree constraint directly encoded, it able to learn the structure effectively from the training examples so that it can outperform the Mp+p model for link prediction."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-190",
"text": "As for the other neural models, the joint model with no seq2seq performs competitively with the ILP joint model on the PEC, but trails the performance of the joint model."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-191",
"text": "We believe this is because the joint model is able to create two different representations for each AC, one each in the encoding/decoding state, which benefits performance in the two tasks."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-192",
"text": "We also believe that the joint model benefits from a second recurrence over the ACs, modeling the tree/forest structure in a linear manner."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-193",
"text": "Conversely, the joint model with no seq2seq must encode information relating to type as well as link prediction in a single hidden representation."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-194",
"text": "On one hand, the joint model no seq2seq outperforms the ILP model on link prediction, yet it is not able to match the ILP joint model's performance on type prediction, primarily due to the poor performance on predicting the major claim class."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-195",
"text": "Another interesting outcome is the importance of the fully-connected layer before the LSTM input."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-196",
"text": "This extra layer seems to be crucial for improving performance on this task."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-197",
"text": "The results dictate that even a simple fully-connected layer with sigmoid activation can provide a useful dimensionality reduction step."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-198",
"text": "Finally, and arguably most importantly, the single-task model, only optimized for link prediction, suffers a large drop in performance, conveying that the dual optimization of the joint model is vital for high performance in the link prediction task."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-199",
"text": "We believe this is because the joint optimization creates more expressive representations of the ACs, which capture the natural relation between AC type and AC linking."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-200",
"text": "Table 3 shows the results of an ablation study for AC feature representation."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-201",
"text": "Regarding link prediction, BOW features are clearly the most important, as their absence results in the highest drop in performance."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-202",
"text": "Conversely, the presence of structural features provides the smallest boost in performance, as the model is still able to record state- Table 4 : Results of binning test data by length of AC sequence."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-203",
"text": "* indicates that this bin does not contain any major claim labels, and this average only applies to claim and premise classes."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-204",
"text": "However, we do not disable the model from predicting this class: the model was able to avoid predicting this class on its own."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-205",
"text": "of-the-art results compared to the ILP Joint Model."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-206",
"text": "This shows that the Joint Model is able to capture structural cues through sequence modeling and semantics."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-207",
"text": "When considering type prediction, both BOW and structural features are important, and it is the embedding features that provide the least benefit."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-208",
"text": "The ablation results also provide an interesting insight into the effectiveness of different pooling strategies for using individual token embeddings to create a multi-word embedding."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-209",
"text": "The popular method of averaging embeddings (which is used by Stab and Gurevych (2016) in their system) is in fact the worst method, although its performance is still competitive with the previous state-of-the-art."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-210",
"text": "Conversely, max pooling results are on par with the joint model results in Table 1 ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-211",
"text": "Table 4 shows results on the PEC test set with the test examples binned by sequence length."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-212",
"text": "First, it is not surprising to see that the model performs best when the sequences are the shortest (for link prediction; type prediction actually sees the worst performance in the middle bin)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-213",
"text": "As the sequence length increases, the accuracy on link prediction drops."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-214",
"text": "This is possibly due to the fact that as the length increases, a given AC has more possibilities as to which other AC it can link to, making the task more difficult."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-215",
"text": "Conversely, there is actually a rise in no link prediction accuracy from the second to third row."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-216",
"text": "This is likely due to the fact that since the model predicts at most one outgoing link, it indirectly predicts no link for the remaining ACs in the sequence."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-217",
"text": "Since the chance probability is low for having a link between a given AC in a long sequence, the no link performance is actually better in longer sequences."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-218",
"text": "The results of the length-based binning could also potentially give insight into the poor performance on the type prediction task in the MTC."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-219",
"text": "Since the arguments in the MTC average 5 ACs, they would be in the second bin (row 2) of Table 4 ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-220",
"text": "The claim and premise f1 scores for this bin are similar to those from the same system's performance on the MTC."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-222",
"text": "**CONCLUSION**"
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-223",
"text": "In this paper we have proposed how to use a joint sequence-to-sequence model with attention (Vinyals et al., 2015b) to both extract links between ACs and classify AC type."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-224",
"text": "We evaluate our models on two corpora: a corpus of persuasive essays (Stab and Gurevych, 2016) , and a corpus of microtexts (Peldszus, 2014) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-225",
"text": "The Joint Model records state-of-the-art results on the persuasive essay corpus, as well as achieving state-of-the-art results for link prediction on the microtext corpus."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-226",
"text": "The results show that jointly modeling the two prediction tasks is critical for high performance."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-227",
"text": "Future work can attempt to learn the AC representations themselves, such as in Kumar et al. (2015) ."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-228",
"text": "Lastly, future work can integrate subtasks 1 and 4 into the model."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-229",
"text": "The representations produced by Equation 3 could potentially be used to predict link type, i.e. supporting or attacking (the fourth subtask in the pipeline)."
},
{
"sent_id": "d576de5c19d7ff62a143b0d4d56135-C001-230",
"text": "In addition, a segmenting technique, such as the one proposed by Weston et al. (2014) , can accomplish subtask 1."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d576de5c19d7ff62a143b0d4d56135-C001-10"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-13",
"d576de5c19d7ff62a143b0d4d56135-C001-14"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-36"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-60"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-176"
]
],
"cite_sentences": [
"d576de5c19d7ff62a143b0d4d56135-C001-10",
"d576de5c19d7ff62a143b0d4d56135-C001-14",
"d576de5c19d7ff62a143b0d4d56135-C001-36",
"d576de5c19d7ff62a143b0d4d56135-C001-60",
"d576de5c19d7ff62a143b0d4d56135-C001-176"
]
},
"@USE@": {
"gold_contexts": [
[
"d576de5c19d7ff62a143b0d4d56135-C001-18"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-22",
"d576de5c19d7ff62a143b0d4d56135-C001-23"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-36",
"d576de5c19d7ff62a143b0d4d56135-C001-37"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-49"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-124"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-150"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-174"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-224"
]
],
"cite_sentences": [
"d576de5c19d7ff62a143b0d4d56135-C001-18",
"d576de5c19d7ff62a143b0d4d56135-C001-23",
"d576de5c19d7ff62a143b0d4d56135-C001-36",
"d576de5c19d7ff62a143b0d4d56135-C001-49",
"d576de5c19d7ff62a143b0d4d56135-C001-124",
"d576de5c19d7ff62a143b0d4d56135-C001-150",
"d576de5c19d7ff62a143b0d4d56135-C001-174",
"d576de5c19d7ff62a143b0d4d56135-C001-224"
]
},
"@DIF@": {
"gold_contexts": [
[
"d576de5c19d7ff62a143b0d4d56135-C001-49",
"d576de5c19d7ff62a143b0d4d56135-C001-50"
]
],
"cite_sentences": [
"d576de5c19d7ff62a143b0d4d56135-C001-49"
]
},
"@MOT@": {
"gold_contexts": [
[
"d576de5c19d7ff62a143b0d4d56135-C001-174",
"d576de5c19d7ff62a143b0d4d56135-C001-175"
],
[
"d576de5c19d7ff62a143b0d4d56135-C001-209"
]
],
"cite_sentences": [
"d576de5c19d7ff62a143b0d4d56135-C001-174",
"d576de5c19d7ff62a143b0d4d56135-C001-209"
]
}
}
},
"ABC_586a0f40a9299ef2753d2b0575eff8_7": {
"x": [
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-26",
"text": "**MODEL**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-2",
"text": "We study the event detection problem using convolutional neural networks (CNNs) that overcome the two fundamental limitations of the traditional feature-based approaches to this task: complicated feature engineering for rich feature sets and error propagation from the preceding stages which generate these features."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-3",
"text": "The experimental results show that the CNNs outperform the best reported feature-based systems in the general setting as well as the domain adaptation setting without resorting to extensive external resources."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-4",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-5",
"text": "**INTRODUCTION**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-6",
"text": "We address the problem of event detection (ED): identifying instances of specified types of events in text."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-7",
"text": "Associated with each event mention is a phrase, the event trigger (most often a single verb or nominalization), which evokes that event."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-8",
"text": "Our task, more precisely stated, involves identifying event triggers and classifying them into specific types."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-9",
"text": "For instance, according to the ACE 2005 annotation guideline 1 , in the sentence \"A police officer was killed in New Jersey today\", an event detection system should be able to recognize the word \"killed\" as a trigger for the event \"Die\"."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-10",
"text": "This task is quite challenging, as the same event might appear in the form of various trigger expressions and an expression might represent different events in different contexts."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-11",
"text": "ED is a crucial component in the overall task of event extraction, which also involves event argument discovery."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-12",
"text": "Recent systems for event extraction have employed either a pipeline architecture with separate classifiers for trigger and argument labeling (Ji and Grishman, 2008; Gupta and Ji, 2009 ; Patwardhan 1 https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/ english-events-guidelines-v5."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-13",
"text": "4.3.pdf and Rilof, 2009; Liao and Grishman, 2011; McClosky et al., 2011; Huang and Riloff, 2012; Li et al., 2013a) or a joint inference architecture that performs the two subtasks at the same time to benefit from their inter-dependencies (Riedel and McCallum, 2011a; Riedel and McCallum, 2011b; Li et al., 2013b; Venugopal et al., 2014) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-14",
"text": "Both approaches have coped with the ED task by elaborately hand-designing a large set of features (feature engineering) and utilizing the existing supervised natural language processing (NLP) toolkits and resources (i.e name tagger, parsers, gazetteers etc) to extract these features to be fed into statistical classifiers."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-15",
"text": "Although this approach has achieved the top performance (Hong et al., 2011; Li et al., 2013b) , it suffers from at least two issues:"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-16",
"text": "(i) The choice of features is a manual process and requires linguistic intuition as well as domain expertise, implying additional studies for new application domains and limiting the capacity to quickly adapt to these new domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-17",
"text": "(ii) The supervised NLP toolkits and resources for feature extraction might involve errors (either due to the imperfect nature or the performance loss of the toolkits on new domains (Blitzer et al., 2006; Daum\u00e9 III, 2007; McClosky et al., 2010) ), probably propagated to the final event detector."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-18",
"text": "This paper presents a convolutional neural network (LeCun et al., 1988; Kalchbrenner et al., 2014) for the ED task that automatically learns features from sentences, and minimizes the dependence on supervised toolkits and resources for features, thus alleviating the error propagation and improving the performance for this task."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-19",
"text": "Due to the emerging interest of the NLP community in deep learning recently, CNNs have been studied extensively and applied effectively in various tasks: semantic parsing (Yih et al., 2014) , search query retrieval (Shen et al., 2014) , semantic matching (Hu et al., 2014) , sentence modeling and classification (Kalchbrenner et al., 2014; Kim, 2014), name tagging and semantic role labeling (Collobert et al., 2011) , relation classification and extraction (Zeng et al., 2014; Nguyen and Grishman, 2015) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-20",
"text": "However, to the best of our knowledge, this is the first work on event detection via CNNs so far."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-21",
"text": "First, we evaluate CNNs for ED in the general setting and show that CNNs, though not requiring complicated feature engineering, can still outperform the state-of-the-art feature-based methods extensively relying on the other supervised modules and manual resources for features."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-22",
"text": "Second, we investigate CNNs in a domain adaptation (DA) setting for ED."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-23",
"text": "We demonstrate that CNNs significantly outperform the traditional featurebased methods with respect to generalization performance across domains due to: (i) their capacity to mitigate the error propagation from the preprocessing modules for features, and (ii) the use of word embeddings to induce a more general representation for trigger candidates."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-24",
"text": "We believe that this is also the first research on domain adaptation using CNNs."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-27",
"text": "We formalize the event detection problem as a multi-class classification problem."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-28",
"text": "Given a sentence, for every token in that sentence, we want to predict if the current token is an event trigger: i.e, does it express some event in the pre-defined event set or not (Li et al., 2013b) ?"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-29",
"text": "The current token along with its context in the sentence constitute an event trigger candidate or an example in multiclass classification terms."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-30",
"text": "In order to prepare for the CNNs, we limit the context to a fixed window size by trimming longer sentences and padding shorter sentences with a special token when necessary."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-31",
"text": "Let 2w + 1 be the fixed window size, and x = [x \u2212w , x \u2212w+1 , . . . , x 0 , . . . , x w\u22121 , x w ] be some trigger candidate where the current token is positioned in the middle of the window (token x 0 )."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-32",
"text": "Before entering the CNNs, each token x i is transformed into a real-valued vector by looking up the following embedding tables to capture different characteristics of the token:"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-33",
"text": "-Word Embedding Table ( initialized by some pre-trained word embeddings): to capture the hidden semantic and syntactic properties of the tokens (Collobert and Weston, 2008; Turian et al., 2010) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-34",
"text": "-Position Embedding Table: to embed the relative distance i of the token x i to the current token x 0 ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-35",
"text": "In practice, we initialize this table randomly."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-36",
"text": "-Entity Type Embedding Table: If we further know the entity mentions and their entity types 2 in the sentence, we can also capture this information for each token by looking up the entity type embedding table (initialized randomly) using the entity type associated with each token."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-37",
"text": "We employ the BIO annotation scheme to assign entity type labels to each token in the trigger candidate using the heads of the entity mentions."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-38",
"text": "For each token x i , the vectors obtained from the three look-ups above are concatenated into a single vector x i to represent the token."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-39",
"text": "As a result, the original event trigger x is transformed into a matrix x = [x \u2212w , x \u2212w+1 , . . . , x 0 , . . . , x w\u22121 , x w ] of size m t \u00d7 (2w + 1) (m t is the dimensionality of the concatenated vectors of the tokens)."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-40",
"text": "The matrix representation x is then passed through a convolution layer, a max pooling layer and a softmax at the end to perform classification (like (Kim, 2014; Kalchbrenner et al., 2014) )."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-41",
"text": "In the convolution layer, we have a set of feature maps (filters) {f 1 , f 2 , . . . , f n } for the convolution operation."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-42",
"text": "Each feature map f i corresponds to some window size k and can be essentially seen as a weight matrix of size m t \u00d7 k. Figure 1 illustrates the proposed CNN."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-43",
"text": "The gradients are computed using backpropagation; regularization is implemented by a dropout (Kim, 2014; Hinton et al., 2012) , and training is done via stochastic gradient descent with shuffled mini-batches and the AdaDelta update rule (Zeiler, 2012; Kim, 2014) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-44",
"text": "During the training, we also optimize the weights of the three embedding tables at the same time to reach an effective state (Kim, 2014) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-46",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-48",
"text": "**DATASET, HYPERPARAMETERS AND RESOURCES**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-49",
"text": "As the benefit of multiple window sizes in the convolution layer has been demonstrated in the previous work on sentence modeling (Kalchbrenner et al., 2014; Kim, 2014) , in the experiments below, we use window sizes in the set {2, 3, 4, 5} to generate feature maps."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-50",
"text": "We utilize 150 feature maps for each window size in this set."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-51",
"text": "The window size for triggers is set to 31 while the dimensionality of the position embeddings and entity type embeddings is 50 3 .We inherit the values for the other parameters from Kim (2014) , i.e, the dropout rate \u03c1 = 0.5, the mini-batch size = 50, the hyperparameter for the l 2 norms = 3."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-52",
"text": "Finally, we employ the pre-trained word embeddings word2vec with 300 dimensions from Mikolov et al. (2013) for initialization."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-53",
"text": "We evaluate the presented CNN over the ACE 2005 corpus."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-54",
"text": "For comparison purposes, we utilize the same test set with 40 newswire articles (672 sentences), the same development set with 30 other documents (836 sentences) and the same training set with the remaning 529 documents (14,849 sentences) as the previous studies on this dataset (Ji and Grishman, 2008; Liao and Grishman, 2010; Li et al., 2013b) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-55",
"text": "The ACE 2005 corpus has 33 event subtypes that, along with one class \"None\" for the non-trigger tokens, constitutes a 34-class classification problem."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-56",
"text": "In order to evaluate the effectiveness of the position embeddings and the entity type embeddings, Table 1 reports the performance of the proposed CNN on the development set when these embeddings are either included or excluded from the systems."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-57",
"text": "With the large margins of performance, it is very clear from the table that the position embeddings are crucial while the entity embeddings are also very useful for CNNs on ED."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-58",
"text": "For the experiments below, we examine the CNNs in two scenarios: excluding the entity type embeddings (CNN1) and including the entity type embeddings (CNN2)."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-59",
"text": "We always use position embeddings in these two scenarios."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-61",
"text": "**PERFORMANCE COMPARISON**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-62",
"text": "The state-of-the-art systems for event detection on the ACE 2005 dataset have followed the traditional feature-based approach with rich hand-designed feature sets, and statistical classifiers such as MaxEnt and perceptron for structured prediction in a joint architecture (Hong et al., 2011; Li et al., 2013b) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-63",
"text": "In this section, we compare the proposed CNNs with these state-of-the-art systems on the blind test set."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-64",
"text": "Table 2 presents the overall performance of the systems with gold-standard entity mention and type information 4 ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-65",
"text": "As we can see from the table, considering the systems that only use sentence level information, CNN1 significantly outperforms the MaxEnt classifier as well as the joint beam search with local features from Li et al. (2013b) (an improvement of 1.6% in F1 score), and performs comparably with the joint beam search approach using both local and global features (Li et al., 2013b) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-66",
"text": "This is remarkable since CNN1 does not require any external features 5 , in contrast to the other featurebased systems that extensively rely on such external features to perform well."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-67",
"text": "More interestingly, when the entity type information is incorporated into CNN1, we obtain CNN2 that still only needs sentence level information but achieves the stateof-the-art performance for this task (an improvement of 1.5% over the best system with only sentence level information (Li et al., 2013b) )."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-68",
"text": "Except for CNN1, all the systems reported in Table 2 employ the gold-standard (perfect) entities mentions and types from manual annotation which might not be available in reality."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-69",
"text": "Table 3 compares the performance of CNN1 and the feature-based systems in a more realistic setting, where entity mentions and types are acquired from an automatic high-performing name tagger and information extraction system (Li et al., 2013b) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-70",
"text": "Note that CNN1 is eligible for this comparison as it does not utilize any external features, thus avoiding usage of the name tagger and the information extraction system to identify entity mentions and types."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-72",
"text": "**DOMAIN ADAPTATION EXPERIMENT**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-73",
"text": "In this section, we aim to further compare the proposed CNNs with the feature-based systems under the domain adaptation setting for event detection."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-74",
"text": "The ultimate goal of domain adaptation research is to develop techniques taking training 5 External features are the features generated from the supervised NLP modules and manual resources such as parsers, name tagger, entity mention extractors (either automatic or manual), gazetteers etc."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-75",
"text": "Methods F Sentence level in Ji and Grishman (2008) 59.7 MaxEnt with local features in Li et al. (2013b) 64.7 Joint beam search with local features in Li et al. (2013b) 63.7"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-76",
"text": "Joint beam search with local and global features in Li et al. (2013b) 65.6 CNN1: CNN without any external features 67.6 data in some source domain and learning models that can work well on target domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-77",
"text": "The target domains are supposed to be so dissimilar from the source domain that the learning techniques would suffer from a significant performance loss when trained on the source domain and applied to the target domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-78",
"text": "To make it clear, we address the unsupervised DA problem in this section, i.e no training data in the target domains (Blitzer et al., 2006; Plank and Moschitti, 2013) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-79",
"text": "The fundamental reason for the performance loss of the featurebased systems on the target domains is twofold: (i) The behavioral changes of features across domains: As domains differ, some features might be informative in the source domain but become less relevant in the target domains and vice versa."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-80",
"text": "(ii) The propagated errors of the pre-processing toolkits for lower-level tasks (POS tagging, name tagging, parsing etc) to extract features: These pre-processing toolkits are also known to degrade when shifted to target domains (Blitzer et al., 2006; Daum\u00e9 III, 2007; McClosky et al., 2010) , introducing noisy features into the systems for higher-level tasks in the target domains and eventually impairing the performance of these higherlevel systems on the target domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-81",
"text": "For ED, we postulate that CNNs are more useful than the feature-based approach for DA for two reasons."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-82",
"text": "First, rather than relying on the symbolic and concrete forms (i.e words, types etc) to construct features as the traditional feature-based systems (Ji and Grishman, 2008; Li et al., 2013b) do, CNNs automatically induce their features from word embeddings, the general distributed representation of words that is shared across domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-83",
"text": "This helps CNNs mitigate the lexical sparsity, learn more general and effective feature representation for trigger candidates, and thus bridge the gap between domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-84",
"text": "Second, as CNNs minimize the reliance on the supervised pre-processing toolkits for features, they can alleviate the error Table 4 : In-domain (first column) and Out-of-domain Performance (columns two to four)."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-85",
"text": "Cells marked with \u2020designate CNN models that significantly outperform (p < 0.05) all the reported feature-based methods on the specified domain."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-86",
"text": "propagation and be more robust to domain shifts."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-88",
"text": "**DATASET**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-89",
"text": "We also do the experiments in this part over the ACE 2005 dataset but focus more on the difference between domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-90",
"text": "The ACE 2005 corpus comes with 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl)."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-91",
"text": "Following the common practice of domain adaptation research on this dataset (Plank and Moschitti, 2013; Nguyen and Grishman, 2014) , we use news (the union of bn and nw) as the source domain and bc, cts, wl as three different target domains."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-92",
"text": "We take half of bc as the development set and use the remaining data for testing."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-93",
"text": "We note that the distribution of event subtypes and the vocabularies of the source and target domains are quite different (Plank and Moschitti, 2013) ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-94",
"text": "6 ."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-95",
"text": "The main conclusions from the table include: (i) The baseline systems MaxEnt, Joint+Local, Joint+Local+Global achieve high performance on the source domain, but degrade dramatically on the target domains due to the domain shifts."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-96",
"text": "(ii) Comparing CNN1 and the baseline systems, we see that CNN1 performs comparably with the baseline systems on the source domain (in-domain performance) (as expected), substantially outperform the baseline systems on two of the three target domains (i.e, bc and cts), and is only less effective than the joint beam search approach on the wl domain; (iii) Finally and most importantly, we consistently achieve the best adaptation performance across all the target domains with CNN2 by only introducing entity type information into CNN1."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-97",
"text": "In fact, CNN2 significantly outperforms the feature-based systems with p < 0.05 and large margins of about 5.0% on the domains bc and cts, clearly confirming our argument in Section 3.3 and testifying to the benefits of CNNs on DA for ED."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-99",
"text": "**DOMAIN ADAPTATION RESULTS**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-101",
"text": "**CONCLUSION**"
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-102",
"text": "We present a CNN for event detection that automatically learns effective feature representations from pre-trained word embeddings, position embeddings as well as entity type embeddings and reduces the error propagation."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-103",
"text": "We conducted experiments to compare the proposed CNN with the state-of-the-art feature-based systems in both the general setting and the domain adaptation setting."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-104",
"text": "The experimental results demonstrate the effectiveness as well as the robustness across domains of the CNN."
},
{
"sent_id": "586a0f40a9299ef2753d2b0575eff8-C001-105",
"text": "In the future, our plans include: (i) to explore the joint approaches for event extraction with CNNs; (ii) and to investigate other neural network architectures for information extraction."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"586a0f40a9299ef2753d2b0575eff8-C001-12",
"586a0f40a9299ef2753d2b0575eff8-C001-13"
],
[
"586a0f40a9299ef2753d2b0575eff8-C001-27",
"586a0f40a9299ef2753d2b0575eff8-C001-28"
],
[
"586a0f40a9299ef2753d2b0575eff8-C001-62"
]
],
"cite_sentences": [
"586a0f40a9299ef2753d2b0575eff8-C001-13",
"586a0f40a9299ef2753d2b0575eff8-C001-28",
"586a0f40a9299ef2753d2b0575eff8-C001-62"
]
},
"@MOT@": {
"gold_contexts": [
[
"586a0f40a9299ef2753d2b0575eff8-C001-15",
"586a0f40a9299ef2753d2b0575eff8-C001-16",
"586a0f40a9299ef2753d2b0575eff8-C001-17"
]
],
"cite_sentences": [
"586a0f40a9299ef2753d2b0575eff8-C001-15"
]
},
"@USE@": {
"gold_contexts": [
[
"586a0f40a9299ef2753d2b0575eff8-C001-27",
"586a0f40a9299ef2753d2b0575eff8-C001-28"
],
[
"586a0f40a9299ef2753d2b0575eff8-C001-54"
],
[
"586a0f40a9299ef2753d2b0575eff8-C001-69"
]
],
"cite_sentences": [
"586a0f40a9299ef2753d2b0575eff8-C001-28",
"586a0f40a9299ef2753d2b0575eff8-C001-54",
"586a0f40a9299ef2753d2b0575eff8-C001-69"
]
},
"@DIF@": {
"gold_contexts": [
[
"586a0f40a9299ef2753d2b0575eff8-C001-65"
],
[
"586a0f40a9299ef2753d2b0575eff8-C001-67"
],
[
"586a0f40a9299ef2753d2b0575eff8-C001-82"
]
],
"cite_sentences": [
"586a0f40a9299ef2753d2b0575eff8-C001-65",
"586a0f40a9299ef2753d2b0575eff8-C001-67",
"586a0f40a9299ef2753d2b0575eff8-C001-82"
]
}
}
},
"ABC_2b10893f03b4f5eaac0fe06b4d6115_7": {
"x": [
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-135",
"text": "Thus, in the resulting dataset negative examples are overrepresented."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-136",
"text": "As Table 2 shows, the candidate extraction method did not cover all manually annotated VPCs in the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-137",
"text": "Hence, we treated the omitted LVCs as false negatives in our evaluation."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-138",
"text": "As a baseline, we applied a context-free dictionary lookup method."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-2",
"text": "Verb-particle combinations (VPCs) consist of a verbal and a preposition/particle component, which often have some additional meaning compared to the meaning of their parts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-3",
"text": "If a data-driven morphological parser or a syntactic parser is trained on a dataset annotated with extra information for VPCs, they will be able to identify VPCs in raw texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-4",
"text": "In this paper, we examine how syntactic parsers perform on this task and we introduce VPCTagger, a machine learning-based tool that is able to identify English VPCs in context."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-5",
"text": "Our method consists of two steps: it first selects VPC candidates on the basis of syntactic information and then selects genuine VPCs among them by exploiting new features like semantic and contextual ones."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-6",
"text": "Based on our results, we see that VPCTagger outperforms state-of-the-art methods in the VPC detection task."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-9",
"text": "Verb-particle constructions (VPCs) are a subclass of multiword expressions (MWEs) that contain more than one meaningful tokens but the whole unit exhibits syntactic, semantic or pragmatic idiosyncracies (Sag et al., 2002) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-10",
"text": "VPCs consist of a verb and a preposition/particle (like hand in or go out) and they are very characteristic of the English language."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-11",
"text": "The particle modifies the meaning of the verb: it may add aspectual information, may refer to motion or location or may totally change the meaning of the expression."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-12",
"text": "Thus, the meaning of VPCs can be compositional, i.e. it can be computed on the basis of the meaning of the verb and the particle (go out) or it can be idiomatic; i.e. a combination of the given verb and particle results in a(n unexpected) new meaning (do in \"kill\")."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-13",
"text": "Moreover, as their syntactic surface structure is very similar to verb -prepositional phrase combinations, it is not straightforward to determine whether a given verb + preposition/particle combination functions as a VPC or not and contextual information plays a very important role here."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-14",
"text": "For instance, compare the following examples: The hitman did in the president and What he did in the garden was unbelievable."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-15",
"text": "Both sentences contain the sequence did in, but it is only in the first sentence where it functions as a VPC and in the second case, it is a simple verbprepositional phrase combination."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-16",
"text": "For these reasons, VPCs are of great interest for natural language processing applications like machine translation or information extraction, where it is necessary to grab the meaning of the text."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-17",
"text": "The special relation of the verb and particle within a VPC is often distinctively marked at several annotation layers in treebanks."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-18",
"text": "For instance, in the Penn Treebank, the particle is assigned a specific part of speech tag (RP) and it also has a specific syntactic label (PRT) (Marcus et al., 1993) , see also Figure 1 ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-19",
"text": "This entails that if a datadriven morphological parser or a syntactic parser is trained on a dataset annotated with extra information for VPCs, it will be able to assign these kind of tags as well."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-20",
"text": "In other words, the morphological/syntactic parser itself will be able to identify VPCs in texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-21",
"text": "In this paper, we seek to identify VPCs on the basis of syntactic information."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-22",
"text": "We first examine how syntactic parsers perform on Wiki50 , a dataset manually annotated for different types of MWEs, including VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-23",
"text": "We then present our syntax-based tool called VPCTagger to identify VPCs, which consists of two steps: first, we select VPC candidates (i.e. verbpreposition/particle pairs) from the text and then we apply a machine learning-based technique to classify them as genuine VPCs or not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-24",
"text": "This"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-25",
"text": "The hitman did in the president ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-26",
"text": "method is based on a rich feature set with new features like semantic or contextual features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-27",
"text": "We compare the performance of the parsers with that of our approach and we discuss the reasons for any possible differences."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-29",
"text": "**RELATED WORK**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-30",
"text": "Recently, some studies have attempted to identify VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-31",
"text": "For instance, Baldwin and Villavicencio (2002) detected verb-particle constructions in raw texts with the help of information based on POS-tagging and chunking, and they also made use of frequency and lexical information in their classifier."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-32",
"text": "Kim and Baldwin (2006) built their system on semantic information when deciding whether verb-preposition pairs were verb-particle constructions or not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-33",
"text": "Nagy T. and Vincze (2011) implemented a rule-based system based on morphological features to detect VPCs in raw texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-34",
"text": "The (non-)compositionality of verb-particle combinations has also raised interest among researchers."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-35",
"text": "McCarthy et al. (2003) implemented a method to determine the compositionality of VPCs and Baldwin (2005) presented a dataset in which non-compositional VPCs could be found."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-36",
"text": "Villavicencio (2003) proposed some methods to extend the coverage of available VPC resources."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-37",
"text": "Tu and Roth (2012) distinguished genuine VPCs and verb-preposition combinations in context."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-38",
"text": "They built a crowdsourced corpus of VPC candidates in context, where each candidate was manually classified as a VPC or not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-39",
"text": "However, during corpus building, they applied lexical restrictions and concentrated only on VPCs formed with six verbs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-40",
"text": "Their SVM-based algorithm used syntactic and lexical features to classify VPCs candidates and they concluded that their system achieved good results on idiomatic VPCs, but the classification of more compositional VPCs is more challenging."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-41",
"text": "Since in this paper we focus on syntax-based VPC identification more precisely, we also identify VPCs with syntactic parsers, it seems necessary to mention studies that experimented with parsers for identifying different types of MWEs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-42",
"text": "For instance, constituency parsing models were employed in identifying contiguous MWEs in French and Arabic (Green et al., 2013) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-43",
"text": "Their method relied on a syntactic treebank, an MWE list and a morphological analyzer."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-44",
"text": "Vincze et al. (2013) employed a dependency parser for identifying light verb constructions in Hungarian texts as a \"side effect\" of parsing sentences and report state-of-the-art results for this task."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-45",
"text": "Here, we make use of parsers trained on the Penn Treebank (which contains annotation for VPCs) and we evaluate their performance on the Wiki50 corpus, which was manually annotated for VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-46",
"text": "Thus, we first examine how well these parsers identify VPCs (i.e. assigning VPC-specific syntactic labels) and then we present how VPCTagger can carry out this task."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-47",
"text": "First, we select VPC candidates from raw text and then, we classify them as genuine VPCs or not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-49",
"text": "**VERB-PARTICLE CONSTRUCTIONS IN ENGLISH**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-50",
"text": "As mentioned earlier, verb-particle constructions consist of a verb and a particle."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-51",
"text": "Similar constructions are present in several languages, although there might be different grammatical or orthographic norms for such verbs in those languages."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-52",
"text": "For instance, in German and in Hungarian, the particle usually precedes the verb and they are spelt as one word, e.g. aufmachen (up.make) \"to open\" in German or kinyitni (out.open) \"to open\" in Hungarian."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-53",
"text": "On the other hand, languages like Swedish, Norwegian, Icelandic and Italian follow the same pattern as English; namely, the verb precedes the particle and they are spelt as two words (Masini, 2005) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-54",
"text": "These two typological classes require different approaches if we would like identify VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-55",
"text": "For the first group, morphology-based solutions can be implemented that can identify the internal structure of compound words."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-56",
"text": "For the second group, syntax-based methods can also be successful, which take into account the syntactic relation between the verb and the particle."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-57",
"text": "Many of the VPCs are formed with a motion verb and a particle denoting directions (like go out, come in etc.) and their meaning reflects this: they denote a motion or location."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-58",
"text": "The meaning of VPCs belonging to this group is usually trans-parent and thus they can be easily learnt by second language learners."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-59",
"text": "In other cases, the particle adds some aspectual information to the meaning of the verb: eat up means \"to consume totally\" or burn out means \"to reach a state where someone becomes exhausted\"."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-60",
"text": "These VPCs still have a compositional meaning, but the particle has a nondirectional function here, but rather an aspectual one (cf. Jackendoff (2002))."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-61",
"text": "Yet other VPCs have completely idiomatic meanings like do up \"repair\" or do in \"kill\"."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-62",
"text": "In the latter cases, the meaning of the construction cannot be computed from the meaning of the parts, hence they are problematic for both language learners and NLP applications."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-63",
"text": "Tu and Roth (2012) distinguish between two sets of VPCs in their database: the more compositional and the more idiomatic ones."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-64",
"text": "Differentiating between compositional and idiomatic VPCs has an apt linguistic background as well (see above) and it may be exploited in some NLP applications like machine translation (parts of compositional VPCs may be directly translated while idiomatic VPCs should be treated as one unit)."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-65",
"text": "However, when grouping their data, Tu and Roth just consider frequency data and treat one VPC as one lexical entry."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-66",
"text": "This approach is somewhat problematic as many VPCs in their dataset are highly ambiguous and thus may have more meanings (like get at, which can mean \"criticise\", \"mean\", \"get access\", \"threaten\") and some of them may be compositional, while others are not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-67",
"text": "Hence, clustering all these meanings and classifying them as either compositional or idiomatic may be misleading."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-68",
"text": "Instead, VPC and non-VPC uses of one specific verb-particle combination could be truly distinguished on the basis of frequency data, or, on the other hand, a word sense disambiguation approach may give an account of the compositional or idiomatic uses of the specific unit."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-69",
"text": "In our experiments, we use the Wiki50 corpus, in which VPCs are annotated in raw text, but no semantic classes are further distinguished."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-70",
"text": "Hence, our goal here is not the automatic semantic classification of VPCs because we believe that first the identification of VPCs in context should be solved and then in a further step, genuine VPCs might be classified as compositional or idiomatic, given a manually annotated dataset from which this kind of information may be learnt."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-71",
"text": "This issue will be addressed in a future study."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-73",
"text": "**VPC DETECTION**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-74",
"text": "Our goal is to identify each individual VPC in running texts; i.e. to take individual inputs like How did they get on yesterday? and mark each VPC in the sentence."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-75",
"text": "Our tool called VPCTagger is based on a two-step approach."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-76",
"text": "First, we syntactically parse each sentence, and extract potential VPCs with a syntax-based candidate extraction method."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-77",
"text": "Afterwards, a binary classification can be used to automatically classify potential VPCs as VPCs or not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-78",
"text": "For the automatic classification of candidate VPCs, we implemented a machine learning approach, which is based on a rich feature set with new features like semantic and contextual features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-79",
"text": "Figure 2 outlines the process used to identify each individual VPC in a running text."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-81",
"text": "**CORPORA**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-82",
"text": "To evaluate of our methods, we made use of two corpora."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-83",
"text": "Statistical data on the corpora can be seen in Table 1 ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-84",
"text": "First, we used Wiki50 , in which several types of multiword expressions (including VPCs) and Named Entities were marked."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-85",
"text": "This corpus consists of 50 Wikipedia pages, and contains 466 occurrences of VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-86",
"text": "In order to compare the performance of our system with others, we also used the dataset of Tu and Roth (2012) , which contains 1,348 sentences taken from different parts of the British National Corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-87",
"text": "However, they only focused on VPCs in this dataset, where 65% of the sentences contain a phrasal verb and 35% contain a simplex verbpreposition combination."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-88",
"text": "As Table 1 indicates, the Tu&Roth dataset only focused on 23 different VPCs, but 342 unique VPCs were annotated in the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-90",
"text": "**CANDIDATE EXTRACTION**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-91",
"text": "In this section, we concentrate on the first step of our approach, namely how VPC candidates can be selected from texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-92",
"text": "As we mentioned in Section 1, our hypothesis is that the automatic detection of VPCs can be basically carried out by dependency parsers."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-93",
"text": "Thus, we examined the performance of two parsers on VPC-specific syntactic labels."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-94",
"text": "As we had a full-coverage VPC annotated corpus where each individual occurrence of a VPC was manually marked, we were able to examine the characteristics of VPCs in a running text and evaluate the effectiveness of the parsers on this task."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-95",
"text": "Therefore, here we examine dependency relations among the manually annotated gold standard VPCs, provided by the Stanford parser (Klein and Manning, 2003) and the Bohnet parser (Bohnet, 2010) for the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-96",
"text": "In order to compare the efficiency of the parsers, both were applied using the same dependency represen Therefore, we extended our candidate extraction method, where besides the verb-particle dependency relation, the preposition and adverbial modifier syntactic relations were also investigated among verbs and particles."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-97",
"text": "With this modification, 70.24% and 96.42% of VPCs in the Wiki50 corpus could be identified."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-98",
"text": "In this phase, we found that the Bohnet parser was more successful on the Wiki50 corpus, i.e. it could cover more VPCs, hence we applied the Bohnet parser in our further experiments."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-99",
"text": "Some researchers filtered LVC candidates by selecting only certain verbs that may be part of the construction."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-100",
"text": "One example is Tu and Roth (2012) , where the authors examined a verbparticle combination only if the verbal components were formed with one of the previously given six verbs (i.e. make, take, have, give, do, get)."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-101",
"text": "Since Wiki50 was annotated for all VPC occurrences, we were able to check what percentage of VPCs could be covered if we applied this selection."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-102",
"text": "As Table 3 shows, the six verbs used by Tu and Roth (2012) are responsible for only 50 VPCs on the Wiki50 corpus, so it covers only 11.16% of all gold standard VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-103",
"text": "Table 4 lists the most frequent VPCs and the verbal components on the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-104",
"text": "As can be seen, the top 10 VPCs are responsible for only 17.41% of the VPC occurrences, while the top 10 verbal components are responsible for 41.07% of the VPC occurrences in the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-105",
"text": "Furthermore, 127 different verbal component occurred in Wiki50, but the verbs have and do -which are used by Tu and Roth (2012) -do not appear in the corpus as verbal component of VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-106",
"text": "All this indicates that applying lexical restrictions and focusing on a reduced set of verbs will lead to the exclusion of a considerable number of VPCs occurring in free texts and so, real-world tasks would hardly profit from them."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-108",
"text": "**MACHINE LEARNING BASED CANDIDATE CLASSICATION**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-109",
"text": "In order to perform an automatic classification of the candidate VPCs, a machine learning-based approach was implemented, which will be elaborated upon below."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-110",
"text": "This method is based on a rich feature set with the following categories: orthographic, lexical, syntactic, and semantic."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-111",
"text": "Moreover, as VPCs are highly ambiguous in raw texts, contextual features are also required."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-112",
"text": "\u2022 Orthographic features: Here, we examined whether the candidate consists of two or more tokens."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-113",
"text": "Moreover, if the particle component started with 'a', which prefix, in many cases, etymologically denotes a movement (like across and away), it was also noted and applied as a feature."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-114",
"text": "\u2022 Lexical features: We exploited the fact that the most common verbs occur most frequently in VPCs, so we selected fifteen verbs from the most frequent English verbs 1 ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-115",
"text": "Here, we examined whether the lemmatised verbal component of the candidate was one of these fifteen verbs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-116",
"text": "We also examined whether the particle component of the potential VPC occurred among the common English particles."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-117",
"text": "Here, we apply a manually built particle list based on linguistic considerations."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-118",
"text": "Moreover, we also checked whether a potential VPC is contained in the list of typical English VPCs collected by Baldwin (2008) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-119",
"text": "1 http://en.wikipedia.org/wiki/Most common words in English"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-120",
"text": "\u2022 Syntactic features: the dependency label between the verb and the particle can also be exploited in identifying LVCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-121",
"text": "As we typically found when dependency parsing the corpus, the syntactic relation between the verb and the particle in a VPC is prt, prep or advmod -applying the Stanford parser dependency representation, hence these syntactic relations were defined as features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-122",
"text": "If the candidate's object was a personal pronoun, it was also encoded as another syntactic feature."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-123",
"text": "\u2022 Semantic features: These features were based on the fact that the meaning of VPCs may typically reflect a motion or location like go on or take away."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-124",
"text": "First, we examine that the verbal component is a motion verb like go or turn, or the particle indicates a direction like out or away."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-125",
"text": "Moreover, the semantic type of the prepositional object, object and subject in the sentence can also help to decide whether the candidate is a VPC or not."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-126",
"text": "Consequently, the person, activity, animal, artifact and concept semantic senses were looked for among the upper level hyperonyms of the nominal head of the prepositional object, object and subject in Princeton WordNet 3.1 2 ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-127",
"text": "When several different machine learning algorithms were experimented on this feature set, the preliminary results showed that decision trees performed the best on this task."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-128",
"text": "This is probably due to the fact that our feature set consists of a few compact (i.e. high-level) features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-129",
"text": "The J48 classifier of the WEKA package (Hall et al., 2009) was trained with its default settings on the abovementioned feature set, which implements the C4.5 (Quinlan, 1993) decision tree algorithm."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-130",
"text": "Moreover, Support Vector Machines (SVM) (Cortes and Vapnik, 1995) results are also reported to compare the performance of our methods with that of Tu and Roth (2012) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-131",
"text": "As the investigated corpora were not sufficiently large for splitting them into training and test sets of appropriate size, we evaluated our models in a cross validation manner on the Wiki50 corpus and the Tu&Roth dataset."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-132",
"text": "As Tu and Roth (2012) presented only the accuracy scores on the Tu & Roth dataset, we also employed an accuracy score as an evaluation metric on this dataset, where positive and negative examples were also marked."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-133",
"text": "But, in the case of Wiki50 corpus, where only the positive VPCs were manually annotated, the F \u03b2=1 score was employed and interpreted on the positive class as an evaluation metric."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-134",
"text": "Moreover, all potential VPCs were treated as negative that were extracted by the candidate extraction method but were not marked as positive in the gold standard."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-139",
"text": "In this case, we applied the same VPC list that was described among the lexical features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-140",
"text": "Then we marked candidates of the syntax-based method as VPC if the candidate VPC was found in the list."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-141",
"text": "We also compared our results with the rule-based results available for Wiki50 and also with the 5-fold cross validation results of Tu and Roth (2012) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-142",
"text": "Table 5 lists the results obtained using the baseline dictionary lookup, rule-based method, dependency parsers and machine learning approaches on the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-143",
"text": "It is revealed that the dictionary lookup method performed worst and achieved an F-score of 35.43."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-144",
"text": "Moreover, this method only achieved a precision score of 49.77%."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-145",
"text": "However, the rule-based method achieved the highest precision score with 91.26%, but the dependency parsers also got high precision scores of about 90% on Wiki50."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-146",
"text": "It is also clear that the machine learning-based approach, the VPCTagger, is the most successful method on Wiki50: it achieved an F-score 10 points higher than those for the rule-based method and dependency parsers and more than 45 points higher than that for the dictionary lookup."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-148",
"text": "**RESULTS**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-149",
"text": "In order to compare the performance of our system with others, we evaluated it on the Tu&Roth dataset (Tu and Roth, 2012) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-150",
"text": "Table 5 : Results obtained in terms of precision, recall and F-score."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-151",
"text": "over, it also lists the results of Tu and Roth (2012) and the VPCTagger evaluated in the 5-fold cross validation manner, as Tu and Roth (2012) applied this evaluation schema."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-152",
"text": "As in the Tu&Roth dataset positive and negative examples were also marked, we were able to use accuracy as evaluation metric besides the F \u03b2=1 scores."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-153",
"text": "It is revealed that the dictionary lookup and the rule-based method achieved an F-score of about 50, but our method seems the most successful on this dataset, as it can yield an accuracy 3.32% higher than that for the Tu&Roth system."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-154",
"text": "Table 6 : 5-fold cross validation results on the Tu&Roth dataset in terms of accuracy and F-score."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-156",
"text": "**DISCUSSION**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-157",
"text": "The applied machine learning-based method extensively outperformed our dictionary lookup and rule-based baseline methods, which underlines the fact that our approach can be suitably applied to VPC detection in raw texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-158",
"text": "It is well demonstrated that VPCs are very ambiguous in raw text, as the dictionary lookup method only achieved a precision score of 49.77% on the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-159",
"text": "This demonstrates that the automatic detection of VPCs is a challenging task and contextual features are essential."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-160",
"text": "In the case of the dictionary lookup, to achieve a higher recall score was mainly limited by the size of the dictionary used."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-161",
"text": "As Table 5 shows, VPCTagger achieved an Fscore 10% higher than those for the dependency parsers, which may refer to the fact that our machine learning-based approach performed well on this task."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-162",
"text": "This method proved to be the most balanced as it got roughly the same recall, precision and F-score results on the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-163",
"text": "In addition, the dependency parsers achieve high precision with lower recall scores."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-164",
"text": "Moreover, the results obtained with our machine learning approach on the Tu&Roth dataset outperformed those reported in Tu and Roth (2012) ."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-165",
"text": "This may be attributed to the inclusion of a rich feature set with new features like semantic and contextual features that were used in our system."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-166",
"text": "As Table 6 indicates, the dictionary lookup and rule-based methods were less effective when applied on the Tu&Roth dataset."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-167",
"text": "Since the corpus was created by collecting sentences that contained phrasal verbs with specific verbs, this dataset contains a lot of negative and ambiguous examples besides annotated VPCs, hence the distribution of VPCs in the Tu&Roth dataset is not comparable to those in Wiki50, where each occurrence of a VPCs were manually annotated in a running text."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-168",
"text": "Moreover, in this dataset, only one positive or negative example was annotated in each sentence, and they examined just the verb-particle pairs formed with the six verbs as a potential VPC."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-169",
"text": "However, the corpus probably contains other VPCs which were not annotated."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-170",
"text": "For example, in the sentence The agency takes on any kind of job -you just name the subject and give us some indication of the kind of thing you want to know, and then we go out and get it for you., the only phrase takes on was listed as a positive example in the Tu&Roth dataset. But two examples, (go out -positive and get it for -negative) were not marked."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-171",
"text": "This is problematic if we would like to evaluate our candidate extractor on this dataset as it would identify all these phrases, even if it is restricted to verbparticle pairs containing one of the six verbs mentioned above, thus yielding false positives already in the candidate extraction phase."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-172",
"text": "In addition, this dataset contains 878 positive VPC occurrences, but only 23 different VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-173",
"text": "Consequently, some positive examples were overrepresented."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-174",
"text": "But the Wiki50 corpus may contain some rare examples and it probably reflects a more realistic distribution as it contains 342 unique VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-175",
"text": "A striking difference between the Tu & Roth database and Wiki50 is that while Tu and Roth (2012) included the verbs do and have in their data, they do not occur at all among the VPCs collected from Wiki50."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-176",
"text": "Moreover, these verbs are just responsible for 25 positive VPCs examples in the Tu & Roth dataset."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-177",
"text": "Although these verbs are very frequent in language use, they do not seem to occur among the most frequent verbal components concerning VPCs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-178",
"text": "A possible reason for this might be that VPCs usually contain a verb referring to movement in its original sense and neither have nor do belong to motion verbs."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-179",
"text": "An ablation analysis was carried out to examine the effectiveness of each individual feature types of the machine learning based candidate classification."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-180",
"text": "Besides the feature classification described in Section 4.3, we also examined the effectiveness of the contextual features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-181",
"text": "In this case, the feature which examined whether the candidates object was a personal pronoun or not and the semantic type of the prepositional object, object and subject were treated as contextual features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-182",
"text": "Table 7 shows the usefulness of each individual feature type on the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-183",
"text": "For each feature type, a J48 classifier was trained with all of the features except that one."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-184",
"text": "Then we compared the performance to that got with all the features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-185",
"text": "As the ablation analysis shows, each type of feature contributed to the overall performance."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-186",
"text": "We found that the lexical and orthographic features were the most powerful, the semantic, syntactic features were also useful; while contextual features were less effective, but were still exploited by the model."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-187",
"text": "Table 7 : The usefulness of individual features in terms of precision, recall and F-score using the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-188",
"text": "The most important features in our system are lexical ones, namely, the lists of the most frequent English verbs and particles."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-189",
"text": "It is probably due to the fact that the set of verbs used in VPCs is rather limited, furthermore, particles form a closed word class that is, they can be fully listed, hence the par-ticle component of a VPC will necessarily come from a well-defined set of words."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-190",
"text": "Besides the ablation analysis, we also investigated the decision tree model produced by our experiments."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-191",
"text": "The model profited most from the syntactic and lexical features, i.e. the dependency label provided by the parsers between the verb and the particle also played an important role in the classification process."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-192",
"text": "We carried out a manual error analysis in order to find the most typical errors our system made."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-193",
"text": "Most errors could be traced back to POS-tagging or parsing errors, where the particle was classified as a preposition."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-194",
"text": "VPCs that include an adverb (as labeled by the POS tagger and the parser) were also somewhat more difficult to identify, like come across or go back."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-195",
"text": "Preposition stranding (in e.g. relative clauses) also resulted in false positives like in planets he had an adventure on."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-196",
"text": "Other types of multiword expressions were also responsible for errors."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-197",
"text": "For instance, the system classified come out as a VPC within the idiom come out of the closet but the gold standard annotation in Wiki50 just labeled the phrase as an idiom and no internal structure for it was marked."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-198",
"text": "A similar error could be found for light verb constructions, for example, run for office was marked as a VPC in the data, but run for was classified as a VPC, yielding a false positive case."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-199",
"text": "Multiword prepositions like up to also led to problems: in he taught up to 1986, taught up was erroneously labeled as VPC."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-200",
"text": "Finally, in some cases, annotation errors in the gold standard data were the source of mislabeled candidates."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-202",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-203",
"text": "In this paper, we focused on the automatic detection of verb-particle combinations in raw texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-204",
"text": "Our hypothesis was that parsers trained on texts annotated with extra information for VPCs can identify VPCs in texts."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-205",
"text": "We introduced our machine learning-based tool called VPCTagger, which allowed us to automatically detect VPCs in context."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-206",
"text": "We solved the problem in a two-step approach."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-207",
"text": "In the first step, we extracted potential VPCs from a running text with a syntaxbased candidate extraction method and we applied a machine learning-based approach that made use of a rich feature set to classify extracted syntactic phrases in the second step."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-208",
"text": "In order to achieve a greater efficiency, we defined several new features like semantic and contextual, but according to our ablation analysis we found that each type of features contributed to the overall performance."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-209",
"text": "Moreover, we also examined how syntactic parsers performed in the VPC detection task on the Wiki50 corpus."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-210",
"text": "Furthermore, we compared our methods with others when we evaluated our approach on the Tu&Roth dataset."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-211",
"text": "Our method yielded better results than those got using the dependency parsers on the Wiki50 corpus and the method reported in (Tu and Roth, 2012) on the Tu&Roth dataset."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-212",
"text": "Here, we also showed how dependency parsers performed on identifying VPCs, and our results indicate that although the dependency label provided by the parsers is an essential feature in determining whether a specific VPC candidate is a genuine VPC or not, the results can be further improved by extending the system with additional features like lexical and semantic features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-213",
"text": "Thus, one possible application of the VPCTagger may be to help dependency parsers: based on the output of VPCTagger, syntactic labels provided by the parsers can be overwritten."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-214",
"text": "With backtracking, the accuracy of syntactic parsers may increase, which can be useful for a number of higher-level NLP applications that exploit syntactic information."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-215",
"text": "In the future, we would like to improve our system by defining more complex contextual features."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-216",
"text": "We also plan to examine how the VPCTagger improve the performance of higher level NLP applications like machine translation systems, and we would also like to investigate the systematic differences among the performances of the parsers and VPCTagger, in order to improve the accuracy of parsing."
},
{
"sent_id": "2b10893f03b4f5eaac0fe06b4d6115-C001-217",
"text": "In addition, we would like to compare different automatic detection methods of multiword expressions, as different types of MWEs are manually annotated in the Wiki50 corpus."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-86"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-130"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-141"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-149"
]
],
"cite_sentences": [
"2b10893f03b4f5eaac0fe06b4d6115-C001-86",
"2b10893f03b4f5eaac0fe06b4d6115-C001-130",
"2b10893f03b4f5eaac0fe06b4d6115-C001-141",
"2b10893f03b4f5eaac0fe06b4d6115-C001-149"
]
},
"@BACK@": {
"gold_contexts": [
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-100"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-102"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-105"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-175"
]
],
"cite_sentences": [
"2b10893f03b4f5eaac0fe06b4d6115-C001-100",
"2b10893f03b4f5eaac0fe06b4d6115-C001-102",
"2b10893f03b4f5eaac0fe06b4d6115-C001-105",
"2b10893f03b4f5eaac0fe06b4d6115-C001-175"
]
},
"@SIM@": {
"gold_contexts": [
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-132"
]
],
"cite_sentences": [
"2b10893f03b4f5eaac0fe06b4d6115-C001-132"
]
},
"@DIF@": {
"gold_contexts": [
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-164"
],
[
"2b10893f03b4f5eaac0fe06b4d6115-C001-211"
]
],
"cite_sentences": [
"2b10893f03b4f5eaac0fe06b4d6115-C001-164",
"2b10893f03b4f5eaac0fe06b4d6115-C001-211"
]
}
}
},
"ABC_c4a9b122e8f1b9e98197743c94fea2_7": {
"x": [
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-80",
"text": "6 Note that |Ys| = O(2 |As| ) in general."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-2",
"text": "We present fast, accurate, direct nonprojective dependency parsers with thirdorder features."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-3",
"text": "Our approach uses AD 3 , an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-4",
"text": "Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-7",
"text": "Dependency parsing has become a prominent approach to syntax in the last few years, with increasingly fast and accurate models being devised (K\u00fcbler et al., 2009; Huang and Sagae, 2010; Zhang and Nivre, 2011; Rush and Petrov, 2012) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-8",
"text": "In projective parsing, the arcs in the dependency tree are constrained to be nested, and the problem of finding the best tree can be addressed with dynamic programming."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-9",
"text": "This results in cubic-time decoders for arc-factored and sibling second-order models (Eisner, 1996; , and quartic-time for grandparent models (Carreras, 2007) and third-order models ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-10",
"text": "Recently, Rush and Petrov (2012) trained third-order parsers with vine pruning cascades, achieving runtimes only a small factor slower than first-order systems."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-11",
"text": "Third-order features have also been included in transition systems (Zhang and Nivre, 2011) and graph-based parsers with cube-pruning (Zhang and McDonald, 2012) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-12",
"text": "Unfortunately, non-projective dependency parsers (appropriate for languages with a more flexible word order, such as Czech, Dutch, and German) lag behind these recent advances."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-13",
"text": "The main obstacle is that non-projective parsing is NP-hard beyond arc-factored models (McDonald and Satta, 2007) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-14",
"text": "Approximate parsers have therefore been introduced, based on belief propagation (Smith and Eisner, 2008) , dual decomposition , or multi-commodity flows (Martins et al., 2009 (Martins et al., , 2011 ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-15",
"text": "These are all instances of turbo parsers, as shown by Martins et al. (2010) : the underlying approximations come from the fact that they run global inference in factor graphs ignoring loop effects."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-16",
"text": "While this line of research has led to accuracy gains, none of these parsers use third-order contexts, and their speeds are well behind those of projective parsers."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-17",
"text": "This paper bridges the gap above by presenting the following contributions:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-18",
"text": "\u2022 We apply the third-order feature models of to non-projective parsing."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-19",
"text": "\u2022 This extension is non-trivial since exact dynamic programming is not applicable."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-20",
"text": "Instead, we adapt AD 3 , the dual decomposition algorithm proposed by Martins et al. (2011) , to handle third-order features, by introducing specialized head automata."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-21",
"text": "\u2022 We make our parser substantially faster than the many-components approach of Martins et al. (2011) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-22",
"text": "While AD 3 requires solving quadratic subproblems as an intermediate step, recent results (Martins et al., 2012) show that they can be addressed with the same oracles used in the subgradient method ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-23",
"text": "This enables AD 3 to exploit combinatorial subproblems like the the head automata above."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-24",
"text": "Along with this paper, we provide a free distribution of our parsers, including training code."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-25",
"text": "1"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-27",
"text": "**DEPENDENCY PARSING WITH AD**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-29",
"text": "**3**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-30",
"text": "Dual decomposition is a class of optimization techniques that tackle the dual of combinatorial Figure 1 : Parts considered in this paper."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-31",
"text": "Firstorder models factor over arcs (Eisner, 1996; McDonald et al., 2005) , and second-order models include also consecutive siblings and grandparents (Carreras, 2007) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-32",
"text": "Our parsers add also arbitrary siblings (not necessarily consecutive) and head bigrams, as in Martins et al. (2011) , in addition to third-order features for grand-and tri-siblings ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-33",
"text": "problems in a modular and extensible manner (Komodakis et al., 2007; ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-34",
"text": "In this paper, we employ alternating directions dual decomposition (AD 3 ; Martins et al., 2011) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-35",
"text": "Like the subgradient algorithm of , AD 3 splits the original problem into local subproblems, and seeks an agreement on the overlapping variables."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-36",
"text": "The difference is that the AD 3 subproblems have an additional quadratic term to accelerate consensus."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-37",
"text": "Recent analysis (Martins et al., 2012 ) has shown that: (i) AD 3 converges at a faster rate, 2 and (ii) the quadratic subproblems can be solved using the same combinatorial machinery that is used in the subgradient algorithm."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-38",
"text": "This opens the door for larger subproblems (such as the combination of trees and head automata in instead of a many-components approach (Martins et al., 2011) , while still enjoying faster convergence."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-39",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-40",
"text": "**OUR SETUP**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-41",
"text": "Given a sentence with L words, to which we prepend a root symbol $, let A := { h, m | h \u2208 {0, . ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-42",
"text": ". , L}, m \u2208 {1, . . . , L}, h = m} be the set of possible dependency arcs."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-43",
"text": "We parameterize a dependency tree via an indicator vector u := u a a\u2208A , where u a is 1 if the arc a is in the tree, and 0 otherwise, and we denote by Y \u2286 R |A| the set of such vectors that are indicators of well-2 Concretely, AD 3 needs O(1/ ) iterations to converge to a -accurate solution, while subgradient needs O(1/ 2 )."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-44",
"text": "formed trees."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-45",
"text": "Let {A s } S s=1 be a cover of A, where each A s \u2286 A. We assume that the score of a parse tree u \u2208 Y decomposes as f (u) := S s=1 f s (z s ), where each z s := z s,a a\u2208As is a \"partial view\" of u, and each local score function f s comes from a feature-based linear model."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-46",
"text": "Past work in dependency parsing considered either (i) a few \"large\" components, such as trees and head automata (Smith and Eisner, 2008; , or (ii) many \"small\" components, coming from a multi-commodity flow formulation (Martins et al., 2009 (Martins et al., , 2011 )."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-47",
"text": "Let Y s \u2286 R |As| denote the set of feasible realizations of z s , i.e., those that are partial views of an actual parse tree."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-48",
"text": "A tuple of views z 1 , . . . , z S \u2208 S s=1 Y s is said to be globally consistent if z s,a = z s ,a holds for every a, s and s such that a \u2208 A s \u2229A s ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-49",
"text": "We assume each parse u \u2208 Y corresponds uniquely to a globally consistent tuple of views, and vice-versa."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-50",
"text": "Following Martins et al. (2011) , the problem of obtaining the best-scored tree can be written as follows:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-51",
"text": "where the equality constraint ensures that the partial views \"glue\" together to form a coherent parse tree."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-52",
"text": "3"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-54",
"text": "**DUAL DECOMPOSITION AND AD 3**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-55",
"text": "Dual decomposition methods dualize out the equality constraint in Eq. 1 by introducing Lagrange multipliers \u03bb s,a ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-106",
"text": "Head bigrams can be captured with a simple sequence model:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-56",
"text": "In doing so, they solve a relaxation where the combinatorial sets Y s are replaced by their convex hulls Z s := conv(Y s )."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-57",
"text": "4 All that is necessary is the following assumption: Assumption 1 (Local-Max Oracle)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-58",
"text": "Every s \u2208 {1, . . . , S} has an oracle that solves efficiently any instance of the following subproblem:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-59",
"text": "Typically, Assumption 1 is met whenever the maximization of f s over Y s is tractable, since the objective in Eq. 2 just adds a linear function to f s ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-60",
"text": "3 Note that any tuple z1, . . . , zS \u2208 S s=1 Ys satisfying the equality constraints will be globally consistent; this fact, due the assumptions above, will imply u \u2208 Y."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-61",
"text": "4 Let \u2206 |Ys| := {\u03b1 \u2208 R |Ys| | \u03b1 \u2265 0, y s \u2208Ys \u03b1y s = 1} be the probability simplex."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-62",
"text": "The convex hull of Ys is the set conv(Ys) := { y s \u2208Ys \u03b1y s y s | \u03b1 \u2208 \u2206 |Ys| }."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-63",
"text": "Its members represent marginal probabilities over the arcs in As."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-64",
"text": "The AD 3 algorithm (Martins et al., 2011) alternates among the following iterative updates:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-65",
"text": "\u2022 z-updates, which decouple over s = 1, . . . , S, and solve a penalized version of Eq. 2:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-66",
"text": "Above, \u03c1 is a constant and the quadratic term penalizes deviations from the current global solution (stored in u (t) )."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-67",
"text": "5 We will see (Prop. 2) that this problem can be solved iteratively using only the Local-Max Oracle (Eq. 2)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-68",
"text": "\u2022 u-updates, a simple averaging operation:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-69",
"text": "\u2022 \u03bb-updates, where the Lagrange multipliers are adjusted to penalize disagreements:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-70",
"text": "In sum, the only difference between AD 3 and the subgradient method is in the z-updates, which in AD 3 require solving a quadratic problem."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-71",
"text": "While closed-form solutions have been developed for some specialized components (Martins et al., 2011) , this problem is in general more difficult than the one arising in the subgradient algorithm."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-72",
"text": "However, the following result, proved in Martins et al. (2012) , allows to expand the scope of AD 3 to any problem which satisfies Assumption 1. Proposition 2."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-73",
"text": "The problem in Eq. 3 admits a solution z * s which is spanned by a sparse basis W \u2286 Y s with cardinality at most |W| \u2264 O(|A s |)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-74",
"text": "In other words, there is a distribution \u03b1 with support in W such that z * s = y s \u2208W \u03b1 y s y s ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-75",
"text": "6 Prop."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-76",
"text": "2 has motivated an active set algorithm (Martins et al., 2012) that maintains an estimate of W by iteratively adding and removing elements computed through the oracle in Eq. 2."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-77",
"text": "7 Typically, very few iterations are necessary and great speedups are achieved by warm-starting W with the active set computed in the previous AD 3 iteration."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-78",
"text": "This has a huge impact in practice and is crucial to obtain the fast runtimes in \u00a74 (see Fig. 2 )."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-79",
"text": "5 In our experiments ( \u00a74), we set \u03c1 = 0.05."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-81",
"text": "What Prop. 2 tells us is that the solution of Eq. 3 can be represented as a distribution over Ys with a very sparse support."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-82",
"text": "7 The algorithm is a specialization of Nocedal and Wright (1999) , \u00a716.4, which effectively exploits the sparse representation of z * s ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-83",
"text": "For details, see Martins et al. (2012) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-85",
"text": "**SOLVING THE SUBPROBLEMS**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-86",
"text": "We next describe the actual components used in our third-order parsers."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-87",
"text": "Tree component."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-88",
"text": "We use an arc-factored score function (McDonald et al., 2005) :"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-89",
"text": ", where \u03c0(m) is the parent of the mth word according to the parse tree z, and \u03c3 ARC (h, m) is the score of an individual arc."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-90",
"text": "The parse tree that maximizes this function can be found in time O(L 3 ) via the Chu-Liu-Edmonds' algorithm (Chu and Liu, 1965; Edmonds, 1967) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-91",
"text": "8 Grand-sibling head automata."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-92",
"text": "Let A in h and A out h denote respectively the sets of incoming and outgoing candidate arcs for the hth word, where the latter subdivides into arcs pointing to the right, A out h,\u2192 , and to the left, A out h,\u2190 ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-93",
"text": "Define the sets A GSIB h,\u2192 = A in h \u222aA out h,\u2192 and A GSIB h,\u2190 = A in h \u222aA out h,\u2190 ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-94",
"text": "We describe right-side grand-sibling head automata; their left-side counterparts are analogous."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-95",
"text": "For each head word h in the parse tree z, define g := \u03c0(h), and let m 0 , m 1 , . . . , m p+1 be the sequence of right modifiers of h, with m 0 = START and m p+1 = END."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-96",
"text": "Then, we have the following grand-sibling component:"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-97",
"text": "where we use the shorthand z| B to denote the subvector of z indexed by the arcs in B \u2286 A. Note that this score function absorbs grandparent and consecutive sibling scores, in addition to the grand-sibling scores."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-98",
"text": "9 For each h, f GSIB h,\u2192 can be 8 In fact, there is an asymptotically faster O(L 2 ) algorithm (Tarjan, 1977) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-99",
"text": "Moreover, if the set of possible arcs is reduced to a subset B \u2286 A (via pruning), then the fastest known algorithm (Gabow et al., 1986)"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-100",
"text": "9 used an identical automaton for their second-order model, but leaving out the grand-sibling scores."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-101",
"text": "No pruning Tri-sibling head automata."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-102",
"text": "In addition, we define left and right-side tri-sibling head automata that remember the previous two modifiers of a head word."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-103",
"text": "This corresponds to the following component function (for the right-side case):"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-104",
"text": "Again, each of these functions can be maximized"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-105",
"text": "Sequential head bigram model."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-107",
"text": "Each score \u03c3 HB (m, h, h ) is obtained via features that look at the heads of consecutive words (as in Martins et al. (2011) )."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-108",
"text": "This function can be maximized in time O(L 3 ) with the Viterbi algorithm."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-109",
"text": "Arbitrary siblings."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-110",
"text": "We handle arbitrary siblings as in Martins et al. (2011) , defining O(L 3 ) component functions of the form f ASIB h,m,s (z h,m , z h,s ) = \u03c3 ASIB (h, m, s)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-111",
"text": "In this case, the quadratic problem in Eq. 3 can be solved directly in constant time."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-112",
"text": "Tab."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-113",
"text": "1 details the time complexities of each subproblem."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-114",
"text": "Without pruning, each iteration of AD 3 has O(L 4 ) runtime."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-115",
"text": "With a simple strategy that limits the number of candidate heads per word to a constant K, this drops to cubic time."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-116",
"text": "10 Further speed-ups are possible with more pruning: by limiting the number of possible modifiers to a constant J, the runtime would reduce to O(L log L)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-117",
"text": "10 In our experiments, we employed this strategy with K = 10, by pruning with a first-order probabilistic model."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-118",
"text": "Following , for each word m, we also pruned away incoming arcs h, m with posterior probability less than 0.0001 times the probability of the most likely head."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-120",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-121",
"text": "We first evaluated our non-projective parser in a projective English dataset, to see how its speed and accuracy compares with recent projective parsers, which can take advantage of dynamic programming."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-122",
"text": "To this end, we converted the Penn Treebank to dependencies through (i) the head rules of Yamada and Matsumoto (2003) (PTB-YM) and (ii) basic dependencies from the Stanford parser 2.0.5 (PTB-S)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-123",
"text": "11 We trained by running 10 epochs of cost-augmented MIRA (Crammer et al., 2006) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-124",
"text": "To ensure valid parse trees at test time, we rounded fractional solutions as in Martins et al. (2009 )-yet, solutions were integral \u2248 95% of the time."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-125",
"text": "Tab. 2 shows the results in the dev-set (top block) and in the test-set (two bottom blocks)."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-126",
"text": "In the dev-set, we see consistent gains when more expressive features are added, the best accuracies being achieved with the full third-order model; this comes at the cost of a 6-fold drop in runtime compared with a first-order model."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-127",
"text": "By looking at the two bottom blocks, we observe that our parser has slightly better accuracies than recent projective parsers, with comparable speed levels (with the exception of the highly optimized vine cascade approach of Rush and Petrov, 2012 Martins et al. (2010 Martins et al. ( , 2011 , , Rush and Petrov (2012) , Zhang and McDonald (2012) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-128",
"text": "The last two are shown separately in the rightmost columns."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-129",
"text": "In our second experiment (Tab. 3), we used 14 datasets, most of which are non-projective, from the CoNLL 2006 and 2008 shared tasks (Buchholz and Marsi, 2006; Surdeanu et al., 2008) ."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-130",
"text": "Our third-order model achieved the best reported scores for English, Czech, German, and Dutchwhich includes the three largest datasets and the ones with the most non-projective dependenciesand is on par with the state of the art for the remaining languages."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-131",
"text": "To our knowledge, the speeds are the highest reported among higherorder non-projective parsers, and only about 3-4 times slower than the vine parser of Rush and Petrov (2012) , which has lower accuracies."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-133",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-134",
"text": "We presented new third-order non-projective parsers which are both fast and accurate."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-135",
"text": "We decoded with AD 3 , an accelerated dual decomposition algorithm which we adapted to handle large components, including specialized head automata for the third-order features, and a sequence model for head bigrams."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-136",
"text": "Results are above the state of the art for large datasets and non-projective languages."
},
{
"sent_id": "c4a9b122e8f1b9e98197743c94fea2-C001-137",
"text": "In the hope that other researchers may find our implementation useful or are willing to contribute with further improvements, we made our parsers publicly available as open source software."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"c4a9b122e8f1b9e98197743c94fea2-C001-14"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-37",
"c4a9b122e8f1b9e98197743c94fea2-C001-38"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-46"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-64",
"c4a9b122e8f1b9e98197743c94fea2-C001-65"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-71"
]
],
"cite_sentences": [
"c4a9b122e8f1b9e98197743c94fea2-C001-14",
"c4a9b122e8f1b9e98197743c94fea2-C001-38",
"c4a9b122e8f1b9e98197743c94fea2-C001-46",
"c4a9b122e8f1b9e98197743c94fea2-C001-64",
"c4a9b122e8f1b9e98197743c94fea2-C001-71"
]
},
"@USE@": {
"gold_contexts": [
[
"c4a9b122e8f1b9e98197743c94fea2-C001-20"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-32"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-34"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-50",
"c4a9b122e8f1b9e98197743c94fea2-C001-51"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-107"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-110"
]
],
"cite_sentences": [
"c4a9b122e8f1b9e98197743c94fea2-C001-20",
"c4a9b122e8f1b9e98197743c94fea2-C001-32",
"c4a9b122e8f1b9e98197743c94fea2-C001-34",
"c4a9b122e8f1b9e98197743c94fea2-C001-50",
"c4a9b122e8f1b9e98197743c94fea2-C001-107",
"c4a9b122e8f1b9e98197743c94fea2-C001-110"
]
},
"@DIF@": {
"gold_contexts": [
[
"c4a9b122e8f1b9e98197743c94fea2-C001-21"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-37",
"c4a9b122e8f1b9e98197743c94fea2-C001-38"
]
],
"cite_sentences": [
"c4a9b122e8f1b9e98197743c94fea2-C001-21",
"c4a9b122e8f1b9e98197743c94fea2-C001-38"
]
},
"@MOT@": {
"gold_contexts": [
[
"c4a9b122e8f1b9e98197743c94fea2-C001-37",
"c4a9b122e8f1b9e98197743c94fea2-C001-38"
],
[
"c4a9b122e8f1b9e98197743c94fea2-C001-71"
]
],
"cite_sentences": [
"c4a9b122e8f1b9e98197743c94fea2-C001-38",
"c4a9b122e8f1b9e98197743c94fea2-C001-71"
]
},
"@EXT@": {
"gold_contexts": [
[
"c4a9b122e8f1b9e98197743c94fea2-C001-71",
"c4a9b122e8f1b9e98197743c94fea2-C001-72"
]
],
"cite_sentences": [
"c4a9b122e8f1b9e98197743c94fea2-C001-71"
]
},
"@SIM@": {
"gold_contexts": [
[
"c4a9b122e8f1b9e98197743c94fea2-C001-127"
]
],
"cite_sentences": [
"c4a9b122e8f1b9e98197743c94fea2-C001-127"
]
}
}
},
"ABC_07b062d569749924fa6ee1b2223411_7": {
"x": [
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-72",
"text": "**PERCEPTRON TRAINING**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-73",
"text": "The parsing problem is to find a mapping from a set of sentences x \u2208 X to a set of parses y \u2208 Y ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-2",
"text": "This paper investigates perceptron training for a wide-coverage CCG parser and compares the perceptron with a log-linear model."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-3",
"text": "The CCG parser uses a phrase-structure parsing model and dynamic programming in the form of the Viterbi algorithm to find the highest scoring derivation."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-4",
"text": "The difficulty in using the perceptron for a phrase-structure parsing model is the need for an efficient decoder."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-5",
"text": "We exploit the lexicalized nature of CCG by using a finite-state supertagger to do much of the parsing work, resulting in a highly efficient decoder."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-6",
"text": "The perceptron performs as well as the log-linear model; it trains in a few hours on a single machine; and it requires only a few hundred MB of RAM for practical training compared to 20 GB for the log-linear model."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-7",
"text": "We also investigate the order in which the training examples are presented to the online perceptron learner, and find that order does not significantly affect the results."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-10",
"text": "A recent development in data-driven parsing is the use of discriminative training methods (Riezler et al., 2002; Taskar et al., 2004; Collins and Roark, 2004; Turian and Melamed, 2006) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-11",
"text": "One popular approach is to use a log-linear parsing model and maximise the conditional likelihood function (Johnson et al., 1999; Riezler et al., 2002; Clark and Curran, 2004b; Malouf and van Noord, 2004; Miyao and Tsujii, 2005) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-12",
"text": "Maximising the likelihood involves calculating feature expectations, which is computationally expensive."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-13",
"text": "Dynamic programming (DP) in the form of the inside-outside algorithm can be used to calculate the expectations, if the features are sufficiently local (Miyao and Tsujii, 2002) ; however, the memory requirements can be prohibitive, especially for automatically extracted, wide-coverage grammars."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-14",
"text": "In Clark and Curran (2004b) we use cluster computing resources to solve this problem."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-15",
"text": "Parsing research has also begun to adopt discriminative methods from the Machine Learning literature, such as the perceptron (Freund and Schapire, 1999; Collins and Roark, 2004) and the largemargin methods underlying Support Vector Machines (Taskar et al., 2004; McDonald, 2006) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-16",
"text": "Parser training involves decoding in an iterative process, updating the model parameters so that the decoder performs better on the training data, according to some training criterion."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-17",
"text": "Hence, for efficient training, these methods require an efficient decoder; in fact, for methods like the perceptron, the update procedure is so trivial that the training algorithm essentially is decoding."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-18",
"text": "This paper describes a decoder for a lexicalizedgrammar parser which is efficient enough for practical discriminative training."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-19",
"text": "We use a lexicalized phrase-structure parser, the CCG parser of Clark and Curran (2004b) , together with a DP-based decoder."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-20",
"text": "The key idea is to exploit the properties of lexicalized grammars by using a finite-state supertagger prior to parsing (Bangalore and Joshi, 1999; Clark and Curran, 2004a) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-21",
"text": "The decoder still uses the CKY algorithm, so the worst case complexity of the parsing is unchanged; however, by allowing the supertagger to do much of the parsing work, the efficiency of the decoder is greatly increased in practice."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-22",
"text": "We chose the perceptron for the training algorithm because it has shown good performance on other NLP tasks; in particular, Collins (2002) reported good performance for a perceptron tagger compared to a Maximum Entropy tagger."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-23",
"text": "Like Collins (2002) , the decoder is the same for both the perceptron and the log-linear parsing models; the only change is the method for setting the weights."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-24",
"text": "The perceptron model performs as well as the loglinear model, but is considerably easier to train."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-25",
"text": "Another contribution of this paper is to advance wide-coverage CCG parsing."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-26",
"text": "Previous discriminative models for CCG (Clark and Curran, 2004b) required cluster computing resources to train."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-27",
"text": "In this paper we reduce the memory requirements from 20 GB of RAM to only a few hundred MB, but without greatly increasing the training time or reducing parsing accuracy."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-28",
"text": "This provides state-of-the-art CCG parsing with a practical development environment."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-29",
"text": "More generally, this work provides a practical environment for experimenting with discriminative models for phrase-structure parsing; because the training time for the CCG parser is relatively short (a few hours), experiments such as comparing alternative feature sets can be performed."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-30",
"text": "As an example, we investigate the order in which the training examples are presented to the perceptron learner."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-31",
"text": "Since the perceptron training is an online algorithm -updating the weights one training sentence at a time -the order in which the data is processed affects the resulting model."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-32",
"text": "We consider random ordering; presenting the shortest sentences first; and presenting the longest sentences first; and find that the order does not significantly affect the final results."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-33",
"text": "We also use the random orderings to investigate model averaging."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-34",
"text": "We produced 10 different models, by randomly permuting the data, and averaged the weights."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-35",
"text": "Again the averaging was found to have no impact on the results, showing that the perceptron learner -at least for this parsing task -is robust to the order of the training examples."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-36",
"text": "The contributions of this paper are as follows."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-37",
"text": "First, we compare perceptron and log-linear parsing models for a wide-coverage phrase-structure parser, the first work we are aware of to do so."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-38",
"text": "Second, we provide a practical framework for developing discriminative models for CCG, reducing the memory requirements from over 20 GB to a few hundred MB. And third, given the significantly shorter training time compared to other discriminative parsing models (Taskar et al., 2004) , we provide a practical framework for investigating discriminative training methods more generally."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-39",
"text": "2 The CCG Parser Clark and Curran (2004b) describes the CCG parser."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-40",
"text": "The grammar used by the parser is extracted from CCGbank, a CCG version of the Penn Treebank (Hockenmaier, 2003) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-41",
"text": "The grammar consists of 425 lexical categories, expressing subcategorisation information, plus a small number of combinatory rules which combine the categories (Steedman, 2000) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-42",
"text": "A Maximum Entropy supertagger first assigns lexical categories to the words in a sentence, which are then combined by the parser using the combinatory rules and the CKY algorithm."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-43",
"text": "A log-linear model scores the alternative parses."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-44",
"text": "We use the normalform model, which assigns probabilities to single derivations based on the normal-form derivations in CCGbank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-45",
"text": "The features in the model are defined over local parts of the derivation and include wordword dependencies."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-46",
"text": "A packed chart representation allows efficient decoding, with the Viterbi algorithm finding the most probable derivation."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-47",
"text": "The supertagger is a key part of the system."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-48",
"text": "It uses a log-linear model to define a distribution over the lexical category set for each word and the previous two categories (Ratnaparkhi, 1996) and the forward backward algorithm efficiently sums over all histories to give a distibution for each word."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-49",
"text": "These distributions are then used to assign a set of lexical categories to each word (Curran et al., 2006) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-50",
"text": "Supertagging was first defined for LTAG (Bangalore and Joshi, 1999) , and was designed to increase parsing speed for lexicalized grammars by allowing a finite-state tagger to do some of the parsing work."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-51",
"text": "Since the elementary syntactic units in a lexicalized grammar -in LTAG's case elementary trees and in CCG's case lexical categories -contain a significant amount of grammatical information, combining them together is easier than the parsing typically performed by phrase-structure parsers."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-52",
"text": "Hence Bangalore and Joshi (1999) refer to supertagging as almost parsing."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-53",
"text": "Supertagging has been especially successful for CCG: Clark and Curran (2004a) demonstrates the considerable increases in speed that can be obtained through use of a supertagger."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-54",
"text": "The supertagger interacts with the parser in an adaptive fashion."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-55",
"text": "Initially the supertagger assigns a small number of categories, on average, to each word in the sentence, and the parser attempts to create a spanning analysis."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-56",
"text": "If this is not possible, the supertagger assigns more categories, and this process continues until a spanning analysis is found."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-57",
"text": "The number of categories assigned to each word is determined by a parameter \u03b2 in the supertagger: all categories are assigned whose forward-backward probabilities are within \u03b2 of the highest probability category (Curran et al., 2006) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-58",
"text": "Clark and Curran (2004a) also shows how the supertagger can reduce the size of the packed charts to allow discriminative log-linear training."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-59",
"text": "However, even with the use of a supertagger, the packed charts for the complete CCGbank require over 20 GB of RAM."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-60",
"text": "Reading the training instances into memory one at a time and keeping a record of the relevant feature counts would be too slow for practical development, since the log-linear model requires hundreds of iterations to converge."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-61",
"text": "Hence the packed charts need to be stored in memory."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-62",
"text": "In Clark and Curran (2004b) we use a cluster of 45 machines, together with a parallel implementation of the BFGS training algorithm, to solve this problem."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-63",
"text": "The need for cluster computing resources presents a barrier to the development of further CCG parsing models."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-64",
"text": "Hockenmaier and Steedman (2002) describe a generative model for CCG, which only requires a non-iterative counting process for training, but it is generally acknowledged that discriminative models provide greater flexibility and typically higher performance."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-65",
"text": "In this paper we propose the perceptron algorithm as a solution."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-66",
"text": "The perceptron is an online learning algorithm, and so the parameters are updated one training instance at a time."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-67",
"text": "However, the key difference compared with the loglinear training is that the perceptron converges in many fewer iterations, and so it is practical to read the training instances into memory one at a time."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-68",
"text": "The difficulty in using the perceptron for training phrase-structure parsing models is the need for an efficient decoder (since perceptron training essentially is decoding)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-69",
"text": "Here we exploit the lexicalized nature of CCG by using the supertagger to restrict the size of the charts over which Viterbi decoding is performed, resulting in an extremely effcient decoder."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-70",
"text": "In fact, the decoding is so fast that we can estimate a state-of-the-art discriminative parsing model in only a few hours on a single machine."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-74",
"text": "We assume that the mapping F is represented through a feature vector \u03a6(x, y) \u2208 R d and a parameter vector \u03b1 \u2208 R d in the following way (Collins, 2002) :"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-75",
"text": "where GEN(x) denotes the set of possible parses for sentence x and \u03a6(x, y) \u00b7 \u03b1 = i \u03b1 i \u03a6 i (x, y) is the inner product."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-76",
"text": "The learning task is to set the parameter values (the feature weights) using the training set as evidence, where the training set consists of examples (x i , y i ) for 1 \u2264 i \u2264 N ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-77",
"text": "The decoder is an algorithm which finds the argmax in (1)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-78",
"text": "In this paper, Y is the set of possible CCG derivations and GEN(x) enumerates the set of derivations for sentence x. We use the same feature representation \u03a6(x, y) as in Clark and Curran (2004b) , to allow comparison with the log-linear model."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-79",
"text": "The features are defined in terms of local subtrees in the derivation, consisting of a parent category plus one or two children."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-80",
"text": "Some features are lexicalized, encoding word-word dependencies."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-81",
"text": "Features are integervalued, counting the number of times some configuration occurs in a derivation."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-82",
"text": "GEN(x) is defined by the CCG grammar, plus the supertagger, since the supertagger determines how many lexical categories are assigned to each word in x (through the \u03b2 parameter)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-83",
"text": "Rather than try to recreate the adaptive supertagging described in Section 2 for training, we simply fix the the value of \u03b2 so that GEN(x) is the set of derivations licenced by the grammar for sentence x, given that value."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-84",
"text": "\u03b2 is now a parameter of the training process which we determine experimentally using development data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-85",
"text": "The \u03b2 parameter can be thought of as determining the set of incorrect derivations which the training algorithm uses to \"discriminate against\", with a smaller value of \u03b2 resulting in more derivations."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-87",
"text": "**FEATURE FORESTS**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-88",
"text": "The same decoder is used for both training and testing: the Viterbi algorithm."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-89",
"text": "However, the packed representation of GEN(x) in each case is different."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-90",
"text": "When running the parser, a lot of grammatical information is stored in order to produce linguistically meaningful output."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-91",
"text": "For training, all that is required is a packed representation of the features on each derivation in GEN(x) for each sentence in the training data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-92",
"text": "The feature forests described in Miyao and Tsujii (2002) provide such a representation."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-93",
"text": "Clark and Curran (2004b) describe how a set of CCG derivations can be represented as a feature forest."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-94",
"text": "The feature forests are created by first building packed charts for the training sentences, and then extracting the feature information."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-95",
"text": "Packed charts group together equivalent chart entries."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-96",
"text": "Entries are equivalent when they interact in the same manner with both the generation of subsequent parse structure and the numerical parse selection."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-97",
"text": "In practice, this means that equivalent entries have the same span, and form the same structures and generate the same features in any further parsing of the sentence."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-98",
"text": "Back pointers to the daughters indicate how an individual entry was created, so that any derivation can be recovered from the chart."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-99",
"text": "A feature forest is essentially a packed chart with only the feature information retained (see Miyao and Tsujii (2002) and Clark and Curran (2004b) for the details)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-100",
"text": "Dynamic programming algorithms can be used with the feature forests for efficient estimation."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-101",
"text": "For the log-linear parsing model in Clark and Curran (2004b) , the inside-outside algorithm is used to calculate feature expectations, which are then used by the BFGS algorithm to optimise the likelihood function."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-102",
"text": "For the perceptron, the Viterbi algorithm finds the features corresponding to the highest scoring derivation, which are then used in a simple additive update process."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-104",
"text": "**THE PERCEPTRON ALGORITHM**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-105",
"text": "The training algorithm initializes the parameter vector as all zeros, and updates the vector by decoding the examples."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-106",
"text": "Each feature forest is decoded with the current parameter vector."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-107",
"text": "If the output is incorInputs: training examples (x i , y i ) Initialisation: set \u03b1 = 0 Algorithm: Figure 1 : The perceptron training algorithm rect, the parameter vector is updated by adding the feature vector of the correct derivation and subtracting the feature vector of the decoder output."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-108",
"text": "Training typically involves multiple passes over the data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-109",
"text": "Figure 1 gives the algorithm, where N is the number of training sentences and T is the number of iterations over the data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-110",
"text": "For all the experiments in this paper, we used the averaged version of the perceptron."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-111",
"text": "Collins (2002) introduced the averaged perceptron, as a way of reducing overfitting, and it has been shown to perform better than the non-averaged version on a number of tasks."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-112",
"text": "The averaged parameters are defined as follows:"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-113",
"text": "s /N T where \u03b1 t,i s is the value of the sth feature weight after the tth sentence has been processed in the ith iteration."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-114",
"text": "A naive implementation of the averaged perceptron updates the accumulated weight for each feature after each example."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-115",
"text": "However, the number of features whose values change for each example is a small proportion of the total."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-116",
"text": "Hence we use the algorithm described in Daume III (2006) which avoids unnecessary calculations by only updating the accumulated weight for a feature f s when \u03b1 s changes."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-118",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-119",
"text": "The feature forests were created as follows."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-143",
"text": "Table 1 compares the training for the perceptron and log-linear models."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-120",
"text": "First, the value of the \u03b2 parameter for the supertagger was fixed (for the first set of experiments at 0.004)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-121",
"text": "The supertagger was then run over the sentences in Sections 2-21 of CCGbank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-122",
"text": "We made sure that every word was assigned the correct lexical category among its set (we did not do this for testing)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-123",
"text": "Then the parser was run on the supertagged sentences, using the CKY algorithm and the CCG combinatory rules."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-124",
"text": "We applied the same normal-form restrictions used in Clark and Curran (2004b) : categories can only combine if they have been seen to combine in Sections 2-21 of CCGbank, and only if they do not violate the Eisner (1996a) normal-form constraints."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-125",
"text": "This part of the process requires a few hundred MB of RAM to run the parser, and takes a few hours for Sections 2-21 of CCGbank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-126",
"text": "Any further training times or memory requirements reported do not include the resources needed to create the forests."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-127",
"text": "The feature forests are extracted from the packed chart representation used in the parser."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-128",
"text": "We only use a feature forest for training if it contains the correct derivation (according to CCGbank)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-129",
"text": "Some forests do not have the correct derivation, even though we ensure the correct lexical categories are present, because the grammar used by the parser is missing some low-frequency rules in CCGbank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-130",
"text": "The total number of forests used for the experiments was 35,370 (89% of Sections 2-21) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-131",
"text": "Only features which occur at least twice in the training data were used, of which there are 477,848."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-132",
"text": "The complete set of forests used to obtain the final perceptron results in Section 4.1 require 21 GB of disk space."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-133",
"text": "The perceptron is an online algorithm, updating the weights after each forest is processed."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-134",
"text": "Each forest is read into memory one at a time, decoding is performed, and the weight values are updated."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-135",
"text": "Each forest is discarded from memory after it has been used."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-136",
"text": "Constantly reading forests off disk is expensive, but since the perceptron converges in so few iterations the training times are reasonable."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-137",
"text": "In contrast, log-linear training takes hundreds of iterations to converge, and so it would be impractical to keep reading the forests off disk."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-138",
"text": "Also, since loglinear training uses a batch algorithm, it is more convenient to keep the forests in memory at all times."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-139",
"text": "In Clark and Curran (2004b) we use a cluster of 45 machines, together with a parallel implementation of BFGS, to solve this problem, but need up to 20 GB of RAM."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-140",
"text": "The feature forest representation, and our implementation of it, is so compact that the perceptron training requires only 20 MB of RAM."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-141",
"text": "Since the supertagger has already removed much of the practical parsing complexity, decoding one of the forests is extremely quick, and much of the training time is taken with continually reading the forests off disk."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-142",
"text": "However, the training time for the perceptron is still only around 5 hours for 10 iterations."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-144",
"text": "The perceptron was run for 10 iterations and the log-linear training was run to convergence."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-145",
"text": "The training time for 10 iterations of the perceptron is longer than the log-linear training, although the results in Section 4.1 show that the perceptron typically converges in around 4 iterations."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-146",
"text": "The striking result in the table is the significantly smaller memory requirement for the perceptron."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-147",
"text": "Table 2 gives the first set of results for the averaged perceptron model."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-148",
"text": "These were obtained using Section 00 of CCGbank as development data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-149",
"text": "Goldstandard POS tags from CCGbank were used for all the experiments."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-150",
"text": "The parser provides an analysis for 99.37% of the sentences in Section 00."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-151",
"text": "The F-scores are based only on the sentences for which there is an analysis."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-152",
"text": "Following Clark and Curran (2004b) , accuracy is measured using F-score over the goldstandard predicate-argument dependencies in CCGbank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-153",
"text": "The table shows that the accuracy increases initially with the number of iterations, but converges quickly after only 4 iterations."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-154",
"text": "The accuracy after only one iteration is also surprisingly high."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-155",
"text": "Table 3 compares the accuracy of the perceptron and log-linear models on the development data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-156",
"text": "LP is labelled precision, LR is labelled recall, and CAT is the lexical category accuracy."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-157",
"text": "The same feature forests were used for training the perceptron and log-linear models, and the same parser and decoding algorithm were used for testing, so the results for the two models are directly comparable."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-158",
"text": "The only difference in each case was the weights file used."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-159",
"text": "Table 3 : Comparison of the perceptron and loglinear models on the development data forest creation (with the number of training iterations again optimised on the development data)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-160",
"text": "A smaller \u03b2 value results in larger forests, giving more incorrect derivations for the training algorithm to \"discriminate against\"."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-161",
"text": "Increasing the size of the forests is no problem for the perceptron, since the memory requirements are so modest, but this would cause problems for the log-linear training which is already highly memory intensive."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-162",
"text": "The table shows that increasing the number of incorrect derivations gives a small improvement in performance for the perceptron."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-163",
"text": "Table 4 gives the accuracies for the two models on the test data, Section 23 of CCGbank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-164",
"text": "Here the coverage of the parser is 99.63%, and again the accuracies are computed only for the sentences with an analysis."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-165",
"text": "The figures for the averaged perceptron were obtained using 6 iterations, with \u03b2 = 0.002."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-166",
"text": "The perceptron slightly outperforms the log-linear model (although we have not carried out significance tests)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-167",
"text": "We justify the use of different \u03b2 values for the two models by arguing that the perceptron is much more flexible in terms of the size of the training forests it can handle."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-169",
"text": "**RESULTS**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-170",
"text": "Note that the important result here is that the perceptron model performs at least as well as the loglinear model."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-171",
"text": "Since the perceptron is considerably easier to train, this is a useful finding."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-172",
"text": "Also, since the log-linear parsing model is a Conditional Random Field (CRF), the results suggest that the perceptron should be compared with a CRF for other tasks for which the CRF is considered to give state-of-theart results."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-174",
"text": "**ORDER OF TRAINING EXAMPLES**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-175",
"text": "As an example of the flexibility of our discriminative training framework, we investigated the order in which the training examples are presented to the online perceptron learner."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-176",
"text": "These experiments were particularly easy to carry out in our framework, since the 21 GB file containing the complete set of training forests can be sampled from directly."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-177",
"text": "We stored the position on disk of each of the forests, and selected the forests one by one, according to some order."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-178",
"text": "The first set of experiments investigated ordering the training examples by sentence length."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-179",
"text": "Buttery (2006) found that a psychologically motivated Categorial Grammar learning system learned faster when the simplest linguistic examples were presented first."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-180",
"text": "Table 5 shows the results both when the shortest sentences are presented first and when the longest sentences are presented first."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-181",
"text": "Training on the longest sentences first provides the best performance, but is no better than the standard ordering."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-182",
"text": "For the random ordering experiments, forests were randomly sampled from the complete 21 GB training file on disk, without replacement."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-183",
"text": "The new forests file was then used for the averagedperceptron training, and this process was repeated 9 times."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-184",
"text": "The number of iterations for each training run was optimised in terms of the accuracy of the resulting model on the development data."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-185",
"text": "There was little variation among the models, with the best model scoring 86.84% F-score on the development data and the worst scoring 86.63%."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-186",
"text": "Table 6 shows that the performance of this best model on the test data is only slightly better than the model trained using the CCGbank ordering."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-187",
"text": "Table 6 shows that the averaged model again performs only marginally better than the original model, and not as well as the best-performing \"random\" model, which is perhaps not surprising given the small variation among the performances of the component models."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-188",
"text": "In summary, the perceptron learner appears highly robust to the order of the training examples, at least for this parsing task."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-189",
"text": "Taskar et al. (2004) investigate discriminative training methods for a phrase-structure parser, and also use dynamic programming for the decoder."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-190",
"text": "The key difference between our work and theirs is that they are only able to train on sentences of 15 words or less, because of the expense of the decoding."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-192",
"text": "**COMPARISON WITH OTHER WORK**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-193",
"text": "There is work on discriminative models for dependency parsing (McDonald, 2006) ; since there are efficient decoding algorithms available (Eisner, 1996b) , complete resources such as the Penn Treebank can used for estimation, leading to accurate parsers."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-194",
"text": "There is also work on discriminative models for parse reranking (Collins and Koo, 2005) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-195",
"text": "The main drawback with this approach is that the correct parse may get lost in the first phase."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-196",
"text": "The existing work most similar to ours is Collins and Roark (2004) ."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-197",
"text": "They use a beam-search decoder as part of a phrase-structure parser to allow practical estimation."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-198",
"text": "The main difference is that we are able to store the complete forests for training, and can guarantee that the forest contains the correct derivation (assuming the grammar is able to generate it given the correct lexical categories)."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-199",
"text": "The downside of our approach is the restriction on the locality of the features, to allow dynamic programming."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-200",
"text": "One possible direction for future work is to compare the search-based approach of Collins and Roark with our DP-based approach."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-201",
"text": "In the tagging domain, Collins (2002) compared log-linear and perceptron training for HMM-style tagging based on dynamic programming."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-202",
"text": "Our work could be seen as extending that of Collins since we compare log-linear and perceptron training for a DPbased wide-coverage parser."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-203",
"text": "----------------------------------"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-204",
"text": "**CONCLUSION**"
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-205",
"text": "Investigation of discriminative training methods is one of the most promising avenues for breaking the current bottleneck in parsing performance."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-206",
"text": "The drawback of these methods is the need for an efficient decoder."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-207",
"text": "In this paper we have demonstrated how the lexicalized nature of CCG can be used to develop a very efficient decoder, which leads to a practical development environment for discriminative training."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-208",
"text": "We have also provided the first comparison of a perceptron and log-linear model for a wide-coverage phrase-structure parser."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-209",
"text": "An advantage of the perceptron over the log-linear model is that it is considerably easier to train, requiring 1/1000th of the memory requirements and converging in only 4 iterations."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-210",
"text": "Given that the global log-linear model used here (CRF) is thought to provide state-of-the-art performance for many NLP tasks, it is perhaps surprising that the perceptron performs as well."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-211",
"text": "The evaluation in this paper was based solely on CCGbank, but we have shown in Clark and Curran (2007) that the CCG parser gives state-of-the-art performance, outperforming the RASP parser (Briscoe et al., 2006) by over 5% on DepBank."
},
{
"sent_id": "07b062d569749924fa6ee1b2223411-C001-212",
"text": "This suggests the need for more comparisons of CRFs and discriminative methods such as the perceptron for other NLP tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"07b062d569749924fa6ee1b2223411-C001-11"
],
[
"07b062d569749924fa6ee1b2223411-C001-14"
],
[
"07b062d569749924fa6ee1b2223411-C001-39"
],
[
"07b062d569749924fa6ee1b2223411-C001-62"
],
[
"07b062d569749924fa6ee1b2223411-C001-99"
],
[
"07b062d569749924fa6ee1b2223411-C001-101"
]
],
"cite_sentences": [
"07b062d569749924fa6ee1b2223411-C001-11",
"07b062d569749924fa6ee1b2223411-C001-14",
"07b062d569749924fa6ee1b2223411-C001-39",
"07b062d569749924fa6ee1b2223411-C001-62",
"07b062d569749924fa6ee1b2223411-C001-99",
"07b062d569749924fa6ee1b2223411-C001-101"
]
},
"@USE@": {
"gold_contexts": [
[
"07b062d569749924fa6ee1b2223411-C001-19"
],
[
"07b062d569749924fa6ee1b2223411-C001-78"
],
[
"07b062d569749924fa6ee1b2223411-C001-124"
],
[
"07b062d569749924fa6ee1b2223411-C001-152"
]
],
"cite_sentences": [
"07b062d569749924fa6ee1b2223411-C001-19",
"07b062d569749924fa6ee1b2223411-C001-78",
"07b062d569749924fa6ee1b2223411-C001-124",
"07b062d569749924fa6ee1b2223411-C001-152"
]
},
"@MOT@": {
"gold_contexts": [
[
"07b062d569749924fa6ee1b2223411-C001-26",
"07b062d569749924fa6ee1b2223411-C001-27"
],
[
"07b062d569749924fa6ee1b2223411-C001-62",
"07b062d569749924fa6ee1b2223411-C001-63"
]
],
"cite_sentences": [
"07b062d569749924fa6ee1b2223411-C001-26",
"07b062d569749924fa6ee1b2223411-C001-62"
]
},
"@DIF@": {
"gold_contexts": [
[
"07b062d569749924fa6ee1b2223411-C001-139",
"07b062d569749924fa6ee1b2223411-C001-140"
]
],
"cite_sentences": [
"07b062d569749924fa6ee1b2223411-C001-139"
]
}
}
},
"ABC_c870d761c6fcd24de73f5bf98a9fd3_7": {
"x": [
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-81",
"text": "**MODELS**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-104",
"text": "**DATA**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-36",
"text": "**LATENT VARIABLE CONTEXT MODELS**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-57",
"text": "This measure takes values between 0 and 1."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-172",
"text": "**CONCLUSION**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-34",
"text": "**MODELS**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-2",
"text": "This paper investigates novel methods for incorporating syntactic information in probabilistic latent variable models of lexical choice and contextual similarity."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-3",
"text": "The resulting models capture the effects of context on the interpretation of a word and in particular its effect on the appropriateness of replacing that word with a potentially related one."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-4",
"text": "Evaluating our techniques on two datasets, we report performance above the prior state of the art for estimating sentence similarity and ranking lexical substitutes."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-7",
"text": "Distributional models of lexical semantics, which assume that aspects of a word's meaning can be related to the contexts in which that word is typically used, have a long history in Natural Language Processing (Sp\u00e4rck Jones, 1964; Harper, 1965) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-8",
"text": "Such models still constitute one of the most popular approaches to lexical semantics, with many proven applications."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-9",
"text": "Much work in distributional semantics treats words as non-contextualised units; the models that are constructed can answer questions such as \"how similar are the words body and corpse?\" but do not capture the way the syntactic context in which a word appears can affect its interpretation."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-10",
"text": "Recent developments (Mitchell and Lapata, 2008; Erk and Pad\u00f3, 2008; Thater et al., 2010; Grefenstette et al., 2011) have aimed to address compositionality of meaning in terms of distributional semantics, leading to new kinds of questions such as \"how similar are the usages of the words body and corpse in the phrase the body/corpse deliberated the motion. . . ?\" and \"how similar are the phrases the body deliberated the motion and the corpse rotted?\"."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-11",
"text": "In this paper we focus on answering questions of the former type and investigate models that describe the effect of syntactic context on the meaning of a single word."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-12",
"text": "The work described in this paper uses probabilistic latent variable models to describe patterns of syntactic interaction, building on the selectional preference models of\u00d3 S\u00e9aghdha (2010) and Ritter et al. (2010) and the lexical substitution models of Dinu and Lapata (2010) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-58",
"text": "In this paper we train LDA models of P (w|c) and P (c|w)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-13",
"text": "We propose novel methods for incorporating information about syntactic context in models of lexical choice, yielding a probabilistic analogue to dependency-based models of contextual similarity."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-14",
"text": "Our models attain state-of-the-art performance on two evaluation datasets: a set of sentence similarity judgements collected by Mitchell and Lapata (2008) and the dataset of the English Lexical Substitution Task (McCarthy and Navigli, 2009) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-15",
"text": "In view of the well-established effectiveness of dependency-based distributional semantics and of probabilistic frameworks for semantic inference, we expect that our approach will prove to be of value in a wide range of application settings."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-17",
"text": "**RELATED WORK**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-18",
"text": "The literature on distributional semantics is vast; in this section we focus on outlining the research that is most directly related to capturing effects of context and compositionality."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-19",
"text": "1 Mitchell and Lapata (2008) follow Kintsch (2001) in observing that most distributional approaches to meaning at the phrase or sentence level assume that the contribution of syntactic structure can be ignored and the meaning of a phrase is simply the commutative sum of the meanings of its constituent words."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-20",
"text": "As Mitchell and Lapata argue, this assumption clearly leads to an impoverished model of semantics."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-21",
"text": "Mitchell and Lapata investigate a number of simple methods for combining distributional word vectors, concluding that pointwise multiplication best corresponds to the effects of syntactic interaction."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-22",
"text": "Erk and Pad\u00f3 (2008) introduce the concept of a structured vector space in which each word is associated with a set of selectional preference vectors corresponding to different syntactic dependencies."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-23",
"text": "Thater et al. (2010) develop this geometric approach further using a space of second-order distributional vectors that represent the words typically co-occurring with the contexts in which a word typically appears."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-24",
"text": "The primary concern of these authors is to model the effect of context on word meaning; the work we present in this paper uses similar intuitions in a probabilistic modelling framework."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-25",
"text": "A parallel strand of research seeks to represent the meaning of larger compositional structures using matrix and tensor algebra (Smolensky, 1990; Rudolph and Giesbrecht, 2010; Baroni and Zamparelli, 2010; Grefenstette et al., 2011) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-26",
"text": "This nascent approach holds the promise of providing a much richer notion of context than is currently exploited in semantic applications."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-27",
"text": "Probabilistic latent variable frameworks for generalising about contextual behaviour (in the form of verb-noun selectional preferences) were proposed by Pereira et al. (1993) and Rooth et al. (1999) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-28",
"text": "Latent variable models are also conceptually similar to non-probabilistic dimensionality reduction techniques such as Latent Semantic Analysis (Landauer and Dumais, 1997)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-29",
"text": "More recently,\u00d3 S\u00e9aghdha (2010) and Ritter et al. (2010) reformulated Rooth et al.'s approach in a Bayesian framework using models related to Latent Dirichlet Allocation (Blei et al., 2003) , demonstrating that this \"topic modelling\" architecture is a very good fit for capturing selectional preferences."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-30",
"text": "Reisinger and Mooney (2010) investigate nonparametric Bayesian models for teasing apart the context distributions of polysemous words."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-31",
"text": "As described in Section 3 below, Dinu and Lapata (2010) propose an LDA-based model for lexical substitution; the techniques presented in this paper can be viewed as a generalisation of theirs."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-32",
"text": "Topic models have also been applied to other classes of semantic task, for example word sense disambiguation (Li et al., 2010) , word sense induction (Brody and Lapata, 2009 ) and modelling human judgements of semantic association (Griffiths et al., 2007) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-37",
"text": "In this paper we consider generative models of lexical choice that assign a probability to a particular word appearing in a given linguistic context."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-38",
"text": "In particular, we follow recent work (Dinu and Lapata, 2010; \u00d3 S\u00e9aghdha, 2010; Ritter et al., 2010) in assuming a latent variable model that associates contexts with distributions over a shared set of variables and associates each variable with a distribution over the vocabulary of word types:"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-39",
"text": "The set of latent variables Z is typically much smaller than the vocabulary size; this induces a (soft) clustering of the vocabulary."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-40",
"text": "Latent Dirichlet Allocation (Blei et al., 2003 ) is a powerful method for learning such models from a text corpus in an unsupervised way; LDA was originally applied to document modelling, but it has recently been shown to be very effective at inducing models for a variety of semantic tasks (see Section 2)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-41",
"text": "Given the latent variable framework in (1) we can develop a generative model of paraphrasing a word o with another word n in a particular context c:"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-42",
"text": "In words, the probability P (n|o, c) is the probability that n would be generated given the latent variable distribution associated with seeing o in context c; this latter distribution P (z|o, c) can be derived using Bayes' rule and the assumption P (o|z, c) = P (o|z)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-43",
"text": "Given a set of contexts C in which an instance o appears (e.g., it may be both the subject of a verb and modified by an adjective), (2) and (3) become:"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-44",
"text": "Equation (6) can be viewed as defining a \"product of experts\" model (Hinton, 2002) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-45",
"text": "Dinu and Lapata (2010) also use a similar formulation to (5), except that P (z|o, C) is factorised over P (z|o, C) rather than just P (z|C):"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-46",
"text": "In Section 5 below, we find that using (5) rather than (7) gives better results."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-47",
"text": "The model described above (henceforth C \u2192 T ) models the dependence of a target word on its context."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-48",
"text": "An alternative perspective is to model the dependence of a set of contexts on a target word, i.e., we induce a model"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-49",
"text": "Making certain assumptions, a formula for P (n|o, c) can be derived from (8):"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-50",
"text": "The assumption of a uniform prior P (n|o) on the choice of a paraphrase n for o is clearly not appropriate from a language modelling perspective (one could imagine an alternative P (n) based on corpus frequency), but in the context of measuring semantic similarity it serves well."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-51",
"text": "The T \u2192 C model for a set of contexts C is:"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-52",
"text": "With appropriate priors chosen for the distributions over words and latent variables, P (n|o, C) is a fully generative model of lexical substitution."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-53",
"text": "A non-generative alternative is one that estimates the similarity of the latent variable distributions associated with seeing n and o in context C. The principle that similarity between topic distributions corresponds to semantic similarity is well-known in document modelling and was proposed in the context of lexical substitution by Dinu and Lapata (2010) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-54",
"text": "In terms of the equations presented above, we could compare the distributions P (z|o, C) with P (z|n, C) using equations (5) or (16)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-55",
"text": "However, Thater et al. (2010) and Dinu and Lapata (2010) both observe that contextualising both o and n can degrade performance; in view of this we actually compare P (z|o, C) with P (z|n) and make the further simplifying assumption that P (z|n) \u221d P (n|z)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-56",
"text": "The similarity measure we adopt is the Bhattacharyya coefficient, which is a natural measure of similarity between probability distributions and is closely related to the Hellinger distance used in previous work on topic modelling (Blei and Lafferty, 2007) :"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-59",
"text": "In the former case, the analogy to document modelling is that each context type plays the role of a \"document\" consisting of all the words observed in that context in a corpus; for P (c|w) the roles are reversed."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-60",
"text": "The models are trained by Gibbs sampling using the efficient procedure of Yao et al. (2009) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-61",
"text": "The empirical estimates for distributions over words and latent variables are derived from the assignment 1049 of topics over the training corpus in a single sampling state."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-62",
"text": "For example, to model P (w|c) we calculate:"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-63",
"text": "where f zw is the number of words of type w assigned topic z, f zc is the number of times z is associated with context c, f z\u00b7 and f \u00b7c are the marginal topic and context counts respectively, N is the number of word types and \u03b1 and \u03b2 parameterise the Dirichlet prior distributions over P (z|c) and P (w|z)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-64",
"text": "Following the recommendations of Wallach et al. (2009) we use asymmetric \u03b1 and symmetric \u03b2; rather than using fixed values for these hyperparameters we estimate them from data in the course of LDA training using an EM-like method."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-65",
"text": "2 We use standard settings for the number of training iterations (1000), the length of the burnin period before hyperparameter estimation begins (200 iterations) and the frequency of hyperparameter estimation (50 iterations)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-67",
"text": "**CONTEXT TYPES**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-68",
"text": "We have not yet defined what the contexts c look like."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-69",
"text": "In vector space models of semantics it is common to distinguish between window-based and dependency-based models (Pad\u00f3 and Lapata, 2007) ; one can make the same distinction for probabilistic context models."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-70",
"text": "A broad generalisation is that window-based models capture semantic association (e.g. referee is associated with football), while dependency models capture a finer-grained notion of similarity (referee is similar to umpire but not to football)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-71",
"text": "Dinu and Lapata (2010) propose a window-based model of lexical substitution; the set of contexts in which a word appears is the set of surrounding words within a prespecified \"window size\"."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-72",
"text": "In this paper we also investigate dependencybased context sets derived from syntactic structure."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-73",
"text": "Given a sentence such as 2 We use the estimation methods provided by the MAL-LET toolkit, available from http://mallet.cs.umass."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-74",
"text": "edu/. (2008) collected human judgements of semantic similarity for pairs of short sentences, where the sentences in a pair share the same subject but different verbs."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-75",
"text": "For example, the sales slumped and the sales declined should be judged as very similar while the shoulders slumped and the shoulders declined should be judged as less similar."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-76",
"text": "The resulting dataset (henceforth ML08) consists of 120 such pairs using 15 verbs, balanced across high and low expected similarity."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-77",
"text": "60 subjects rated the data using a scale of 1-7; Mitchell and Lapata calculate average interannotator correlation to be 0.40 (using Spearman's \u03c1)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-78",
"text": "Both Mitchell and Lapata and Erk and Pad\u00f3 (2008) split the data into a development portion and a test portion, the development portion consisting of the judgements of six annotators; in order to compare our results with previous research we use the same data split."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-79",
"text": "To evaluate performance, the predictions made by a model are compared to the judgements of each annotator in turn (using \u03c1) and the resulting per-annotator \u03c1 values are averaged."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-82",
"text": "All models were trained on the written section of the British National Corpus (around 90 million words), parsed with RASP (Briscoe et al., 2006) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-83",
"text": "The BNC was also used by Mitchell and Lapata (2008) and Erk and Pad\u00f3 (2008) ; as the ML08 dataset was compiled using words appearing more than 50 times in the BNC, there are no coverage problems caused by data sparsity."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-84",
"text": "We trained LDA models for the grammatical relations v:ncsubj:n and n:ncsubj Table 1 : Performance (average \u03c1) on the ML08 test set and used these to create predictors of type C \u2192 T and T \u2192 C, respectively."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-85",
"text": "For each predictor, we trained five runs with 100 topics for 1000 iterations and averaged the predictions produced from their final states."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-86",
"text": "We investigate both the generative paraphrasing model (PARA) and the method of comparing topic distributions (SIM)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-87",
"text": "For both PARA and SIM we present results using each predictor type on its own as well as a combination of both types (T \u2194 C); for PARA the contributions of the types are multiplied and for SIM they are averaged."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-88",
"text": "3 One potential complication is that the PARA model is trained to predict P (n|c, o), which might not be comparable across different combinations of subject c and verb o. Using P (n|c, o) as a proxy for the desired joint distribution P (n, c, o) is tantamount to assuming a uniform distribution P (c, o), which can be defended on the basis that the choice of subject noun and reference verb is not directly relevant to the task."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-89",
"text": "As shown by the results below, this assumption seems to work reasonably well in practice."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-90",
"text": "As well as reporting correlations for straightforward averages of each set of five runs, we also investigate whether the development data can be used to select an optimal subset of runs."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-91",
"text": "This is done by simply evaluating every possible subset of 1-5 runs on the development data and picking the best-scoring subset."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-92",
"text": "Table 1 presents the results of the PARA and SIM predictors on the ML08 dataset."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-93",
"text": "The best results previously reported for this dataset were given by Erk and Pad\u00f3 (2008) , who measured average \u03c1 values of 0.24 for a vector multiplication method and 0.27 for their structured vector space (SVS) syntactic disambiguation method."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-94",
"text": "Even without using the development set to select models, performance is well above the previous state of the art for all predictors except PARA C\u2192T ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-95",
"text": "Model selection on the development data brings average \u03c1 up to 0.41, which is comparable to the human \"ceiling\" of 0.40 measured by Mitchell and Lapata."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-96",
"text": "In all cases the T \u2192 C predictors outperform C \u2192 T : models that associate target words with distributions over context clusters are superior to those that associate contexts with distributions over target words."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-97",
"text": "Figure 1 plots the beneficial effect of averaging over multiple runs; as the number of runs n is increased, the average performance over all combinations of n predictors chosen from the set of five T \u2192 C and five C \u2192 T runs is observed to increase monotonically."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-98",
"text": "Figure 1 also shows that the model selection procedure is very effective at selecting the optimal combination of models; development set performance is a reliable indicator of test set performance."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-100",
"text": "**RESULTS**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-102",
"text": "**EXPERIMENT 2: LEXICAL SUBSTITUTION**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-105",
"text": "The English Lexical Substitution task, run as part of the SemEval-1 competition, required participants to propose good substitutes for a set of target words in various sentential contexts (McCarthy and Navigli, 2009)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-106",
"text": "Table 2 shows two example sentences and the substitutes appearing in the gold standard, ranked by the number of human annotators who proposed each substitute."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-107",
"text": "The dataset contains a total of 2,010 annotated sentences with 205 distinct target words across four parts of speech (noun, verb, adjective, adverb)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-108",
"text": "In line with previous work on contextual disambiguation, we focus here on the subtask of ranking attested substitutes rather than proposing them from an unrestricted vocabulary."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-109",
"text": "To this end, a candidate set is constructed for each target word from all the substitutes proposed for that word in all sentences in the dataset."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-110",
"text": "The data contains a number of multiword paraphrases such as rush at; as our models (like most 1051 Realizing immediately that strangers have come, attack (5), rush at (1) the animals charge them and the horses began to fight."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-111",
"text": "Commission is the amount charged to execute a trade. levy (2), impose (1), take (1), demand (1) Table 2 : Examples for the verb charge from the English Lexical Substitution Task current models of distributional semantics) do not represent multiword expressions, we remove such paraphrases and discard the 17 sentences which have only multiword substitutes in the gold standard."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-112",
"text": "4 There are also 7 sentences for which the gold standard contains no substitutes."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-113",
"text": "This leaves a total of 1986 sentences."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-114",
"text": "These sentences were lemmatised and parsed with RASP."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-115",
"text": "Previous authors have partitioned the dataset in various ways."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-116",
"text": "Erk and Pad\u00f3 (2008) use only a subset of the data where the target is a noun headed by a verb or a verb heading a noun."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-117",
"text": "Thater et al. (2010) discard sentences which their parser cannot parse and paraphrases absent from their training corpus and then optimise the parameters of their model through four-fold cross-validation."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-118",
"text": "Here we aim for complete coverage on the dataset and do not perform any parameter tuning."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-119",
"text": "We use two measures to evaluate performance: Generalised Averaged Precision (Kishida, 2005 ) and Kendall's \u03c4 b rank correlation coefficient, which were used for this task by Thater et al. (2010) and Dinu and Lapata (2010) , respectively."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-120",
"text": "Generalised Averaged Precision (GAP) is a precision-like measure for evaluating ranked predictions against a gold standard."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-121",
"text": "\u03c4 b is a variant of Kendall's \u03c4 that is appropriate for data containing tied ranks."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-122",
"text": "We do not use the \"precision out of ten\" 1052 Table 3 : Dependency graph preprocessing measure that was used in the original Lexical Substitution Task; this measure assigns credit for the proportion of the first 10 proposed paraphrases that are present in the gold standard and in the context of ranking attested substitutes it is unclear how to obtain non-trivial results for target words with 10 or fewer possible substitutes."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-123",
"text": "We calculate statistical significance of performance differences using stratified shuffling (Yeh, 2000) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-124",
"text": "5"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-126",
"text": "**MODELS**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-127",
"text": "We apply the models developed in Section 3.1 to the Lexical Substitution Task dataset using dependencyand window-based context information."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-128",
"text": "Here we only use the SIM predictor type."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-129",
"text": "PARA did not give satisfactory results; in particular, it tended to rank common words highly in most contexts."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-130",
"text": "6 As before we compiled training data by extracting target-context cooccurrences from a text corpus."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-131",
"text": "In addition to the parsed BNC described above we used a corpus of Wikipedia text consisting of over 45 million sentences (almost 1 billion words) parsed using the fast Combinatory Categorial Grammar (CCG) parser described by Clark et al. (2009) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-132",
"text": "The depen- 5 We use the software package available at http://www."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-133",
"text": "nlpado.de/\u02dcsebastian/sigf.html."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-134",
"text": "6 Favouring more general words may indeed make sense in some paraphrasing tasks (Nulty and Costello, 2010)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-135",
"text": "dency representation produced by this parser is interoperable with the RASP dependency format."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-136",
"text": "In order to focus our models on semantically discriminative information and make inference more tractable we ignored all parts of speech other than nouns, verbs, adjectives, prepositions and adverbs."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-137",
"text": "Stopwords and words of fewer than three characters were removed."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-138",
"text": "We also removed the very frequent but semantically weak lemmas be and have."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-139",
"text": "We compare two classes of context models: models learned from window-based contexts and models learned from syntactic dependency contexts."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-140",
"text": "For the syntactic models we extracted all dependencies and inverse dependencies between lemmas of the aforementioned POS types; in order to maximise the extraction yield, the dependency graph for each sentence was preprocessed using the transformations shown in Table 3 ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-141",
"text": "For the window-based context model we follow Dinu and Lapata (2010) in treating each word within five words of a target as a member of its context set."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-142",
"text": "It proved necessary to subsample the corpora in order to make LDA training tractable, especially for the window-based model where the training set of context-target counts is extremely dense (each instance of a word in the corpus contributes up to 10 context instances)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-143",
"text": "For the window-based data, we divided each context-target count by a factor of 5 and a factor of 70 for the BNC and Wikipedia corpora respectively, rounding fractional counts to the closest integer."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-144",
"text": "The choice of 70 for scaling Wikipedia counts is adopted from Dinu and Lapata (2010) , who used the same factor for the comparably sized English Gigaword corpus."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-145",
"text": "As the dependency data is an order of magnitude smaller we downsampled the Wikipedia counts by 5 and left the BNC counts untouched."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-146",
"text": "Finally, we created a larger corpus by combining the counts from the BNC and Wikipedia datasets."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-147",
"text": "Type and token counts for the BNC and combined corpora are given in Table 4 ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-148",
"text": "We trained three LDA predictors for each corpus: a window-based predictor (W5), a Context \u2192 Target predictor (C \u2192 T ) and a Target \u2192 Context predictor (T \u2192 C)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-149",
"text": "For W5 the sets of types and contexts should be symmetrical (in practice there is some discrepancy due to preprocessing artefacts)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-150",
"text": "For C \u2192 T , individual models were trained for each of the four target parts of speech; in each case the set 1053 Table 5 : Results on the English Lexical Substitution Task dataset; boldface denotes best performance at full coverage for each corpus of types is the vocabulary for that part of speech and the set of contexts is the set of dependencies taking those types as dependents."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-151",
"text": "For T \u2192 C we again train four models; the sets of types and contexts are reversed."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-152",
"text": "For the both corpora we trained models with Z = {600, 800, 1000, 1200} topics; for each setting of Z we ran five estimation runs."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-153",
"text": "Each individual prediction of similarity between P (z|C, o) and P (z|n) is made by averaging over the predictions of all runs and over all settings of Z. Choosing a single setting of Z does not degrade performance significantly; however, averaging over settings is a convenient way to avoid having to pick a specific value."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-154",
"text": "We also investigate combinations of predictor types, once again produced by averaging: we combine C \u2192 T with C \u2194 T (T \u2194 C) and combine each of these three models with W5."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-155",
"text": "Table 5 presents the results attained by our models on the Lexical Substitution Task data."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-156",
"text": "The dependency-based models have imperfect coverage (86% of the data); they can make no prediction when no syntactic context is provided for a target, perhaps as a result of parsing error."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-157",
"text": "The window-based models have perfect coverage, but score noticeably lower."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-158",
"text": "By combining dependency-and windowbased models we can reach high performance with perfect coverage."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-159",
"text": "All combinations outperform the corresponding W5 results to a statistically significant degree (p < 0.01)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-160",
"text": "Performance at full coverage is already very good (GAP= 48.6, \u03c4 b = 0.21) on the BNC corpus, but the best results are attained by W5 + T \u2194 C trained on the combined corpus (GAP= 49.5, \u03c4 b = 0.23)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-161",
"text": "The results for the W5 model trained on BNC data is comparable to that trained on the combined corpus; however the syntactic models show a clear benefit from the less sparse dependency data in the combined training corpus."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-163",
"text": "**RESULTS**"
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-164",
"text": "As remarked in Section 3.1, Dinu and Lapata (2010) use a slightly different formulation of P (z|C, o)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-165",
"text": "Using the window-based context model our formulation (5) outperforms (7) for both training corpora; the Dinu and Lapata (2010) Table 6 : Performance by part of speech Table 6 gives a breakdown of performance by target part of speech for the BNC+Wikipedia-trained W5 and W5 + T \u2194 C models, as well as figures provided by previous researchers."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-166",
"text": "7 W5 + T \u2194 C outperforms W5 on all parts of speech using both evaluation metrics."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-167",
"text": "As remarked above, previous researchers have used the corpus in slightly different ways; we believe that the results of Dinu and Lapata (2010) are fully comparable, while those of Thater et al. (2010) were attained on a slightly smaller dataset with parameters set through cross-validation."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-168",
"text": "The results for W5 + T \u2194 C outperform all of Dinu and Lapata's per-POS and overall results except for a slightly superior score on adverbs attained by their NMF model (\u03c4 b = 0.26 compared to 0.24)."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-169",
"text": "Turning to Thater et al., we report higher scores for every POS with the exception of the verbs where their Model 1 achieves 45.9 GAP compared to 45.1; the overall average for W5 + T \u2194 C is substantially higher at 49.5 compared to 44.6."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-170",
"text": "On balance, we suggest that our models do have an advantage over the current state of the art for lexical substitution."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-173",
"text": "In this paper we have proposed novel methods for modelling the effect of context on lexical meaning, demonstrating that information about syntactic context and textual proximity can fruitfully be integrated to produce state-of-the-art models of lexical choice."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-174",
"text": "We have demonstrated the effectiveness of our techniques on two datasets but they are potentially applicable to a range of applications where semantic disambiguation is required."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-175",
"text": "In future work, 7 The overall average GAP for Thater et al. (2010) does not appear in their paper but can be calculated from the score and number of instances listed for each POS."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-176",
"text": "we intend to adapt our approach for word sense disambiguation as well as related domain-specific tasks such as gene name normalisation (Morgan et al., 2008) ."
},
{
"sent_id": "c870d761c6fcd24de73f5bf98a9fd3-C001-177",
"text": "A further, more speculative direction for future research is to investigate more richly structured models of context, for example capturing correlations between words in a text within a framework similar to the Correlated Topic Model of Blei and Lafferty (2007) or more explicitly modelling polysemy effects as in Reisinger and Mooney (2010) ."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-12"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-38"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-119"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-141"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-144"
]
],
"cite_sentences": [
"c870d761c6fcd24de73f5bf98a9fd3-C001-12",
"c870d761c6fcd24de73f5bf98a9fd3-C001-38",
"c870d761c6fcd24de73f5bf98a9fd3-C001-119",
"c870d761c6fcd24de73f5bf98a9fd3-C001-141",
"c870d761c6fcd24de73f5bf98a9fd3-C001-144"
]
},
"@EXT@": {
"gold_contexts": [
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-12",
"c870d761c6fcd24de73f5bf98a9fd3-C001-13"
]
],
"cite_sentences": [
"c870d761c6fcd24de73f5bf98a9fd3-C001-12"
]
},
"@BACK@": {
"gold_contexts": [
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-31"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-53"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-55"
]
],
"cite_sentences": [
"c870d761c6fcd24de73f5bf98a9fd3-C001-31",
"c870d761c6fcd24de73f5bf98a9fd3-C001-53",
"c870d761c6fcd24de73f5bf98a9fd3-C001-55"
]
},
"@MOT@": {
"gold_contexts": [
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-55"
]
],
"cite_sentences": [
"c870d761c6fcd24de73f5bf98a9fd3-C001-55"
]
},
"@DIF@": {
"gold_contexts": [
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-164"
],
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-165"
]
],
"cite_sentences": [
"c870d761c6fcd24de73f5bf98a9fd3-C001-164",
"c870d761c6fcd24de73f5bf98a9fd3-C001-165"
]
},
"@SIM@": {
"gold_contexts": [
[
"c870d761c6fcd24de73f5bf98a9fd3-C001-167"
]
],
"cite_sentences": [
"c870d761c6fcd24de73f5bf98a9fd3-C001-167"
]
}
}
},
"ABC_8dd8c0e61010d97d0ddae6d81a9067_7": {
"x": [
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-109",
"text": "**MULTI-WORD UNITS**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-154",
"text": "We measure statistical significance using two different tests."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-34",
"text": "A selective unpacking algorithm allows the extraction of an n-best list of realisations where realisation ranking is based on a maximum entropy model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-35",
"text": "This unpacking algorithm is used in (Velldal and Oepen, 2005) to rank realisations with features defined over HPSG derivation trees."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-2",
"text": "We present a simple history-based model for sentence generation from LFG f-structures, which improves on the accuracy of previous models by breaking down PCFG independence assumptions so that more f-structure conditioning context is used in the prediction of grammar rule expansions."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-3",
"text": "In addition, we present work on experiments with named entities and other multi-word units, showing a statistically significant improvement of generation accuracy."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-4",
"text": "Tested on section 23 of the Penn Wall Street Journal Treebank, the techniques described in this paper improve BLEU scores from 66.52 to 68.82, and coverage from 98.18% to 99.96%."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-7",
"text": "Sentence generation, or surface realisation, is the task of generating meaningful, grammatically correct and fluent text from some abstract semantic or syntactic representation of the sentence."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-8",
"text": "It is an important and growing field of natural language processing with applications in areas such as transferbased machine translation (Riezler and Maxwell, 2006) and sentence condensation (Riezler et al., 2003) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-9",
"text": "While recent work on generation in restricted domains, such as (Belz, 2007) , has shown promising results there remains much room for improvement particularly for broad coverage and robust generators, like those of Nakanishi et al. (2005) and Cahill and van Genabith (2006) , which do not rely on handcrafted grammars and thus can easily be ported to new languages."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-10",
"text": "This paper is concerned with sentence generation from Lexical-Functional Grammar (LFG) fstructures (Kaplan, 1995) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-11",
"text": "We present improvements in previous LFG-based generation models firstly by breaking down PCFG independence assumptions so that more f-structure conditioning context is included when predicting grammar rule expansions."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-12",
"text": "This history-based approach has worked well in parsing (Collins, 1999; Charniak, 2000) and we show that it also improves PCFG-based generation."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-36",
"text": "They achieved the best results when combining the tree-based model with an n-gram language model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-37",
"text": "Nakanishi et al. (2005) describe a treebankextracted HPSG-based chart generator."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-38",
"text": "Importing techniques developed for HPSG parsing, they apply a log linear model to a packed representation of all alternative derivation trees for a given input."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-39",
"text": "They found that a model which included syntactic information outperformed a bigram model as well as a combination of bigram and syntax model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-40",
"text": "The probability model described in this paper also incorporates syntactic information, however, unlike the discriminative HPSG models just described, it is a generative history-and PCFG-based model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-41",
"text": "While Belz (2007) and Humphreys et al. (2001) mention the use of contextual features for the rules in their generation models, they do not provide details nor do they provide a formal probability model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-13",
"text": "We also present work on utilising named entities and other multi-word units to improve generation results for both accuracy and coverage."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-14",
"text": "There has been a limited amount of exploration into the use of multi-word units in probabilistic parsing, for example in (Kaplan and King, 2003) (LFG parsing) and (Nivre and Nilsson, 2004 ) (dependency parsing)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-15",
"text": "We are not aware of any similar work on generation."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-16",
"text": "In the LFG-based generation algorithm presented by Cahill and van Genabith (2006) complex named entities (i.e. those consisting of more than one word token) and other multi-word units can be fragmented in the surface realization."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-17",
"text": "We show that the identification of such units may be used as a simple measure to constrain the generation model's output."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-18",
"text": "We take the generator of (Cahill and van Genabith, 2006) as our baseline generator."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-19",
"text": "When tested on f-structures for all sentences from Section 23 of the Penn Wall Street Journal (WSJ) treebank (Mar-cus et al., 1993) , the techniques described in this paper improve BLEU score from 66.52 to 68.82."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-20",
"text": "In addition, coverage is increased from 98.18% to almost 100% (99.96%)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-21",
"text": "The remainder of the paper is structured as follows: in Section 2 we review related work on statistical sentence generation."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-22",
"text": "Section 3 describes the baseline generation model and in Section 4 we show how the new history-based model improves over the baseline."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-23",
"text": "In Section 5 we describe the source of the multi-word units (MWU) used in our experiments and the various techniques we employ to make use of these MWUs in the generation process."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-24",
"text": "Section 6 gives experimental details and results."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-26",
"text": "**RELATED WORK ON STATISTICAL GENERATION**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-27",
"text": "In (statistical) generators, sentences are generated from an abstract linguistic encoding via the application of grammar rules."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-28",
"text": "These rules can be handcrafted grammar rules, such as those of (LangkildeGeary, 2002; Carroll and Oepen, 2005) , created semi-automatically (Belz, 2007) or, alternatively, extracted fully automatically from treebanks (Bangalore and Rambow, 2000; Nakanishi et al., 2005; Cahill and van Genabith, 2006) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-29",
"text": "Insofar as it is a broad coverage generator, which has been trained and tested on sections of the WSJ corpus, our generator is closer to the generators of (Bangalore and Rambow, 2000; Langkilde-Geary, 2002; Nakanishi et al., 2005) than to those designed for more restricted domains such as weather forecast (Belz, 2007) and air travel domains (Ratnaparkhi, 2000) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-30",
"text": "Another feature which characterises statistical generators is the probability model used to select the most probable sentence from among the space of all possible sentences licensed by the grammar."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-31",
"text": "One generation technique is to first generate all possible sentences, storing them in a word lattice (Langkilde and Knight, 1998) or, alternatively, a generation forest, a packed represention of alternate trees proposed by the generator (Langkilde, 2000) , and then select the most probable sequence of words via an n-gram language model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-32",
"text": "Increasingly syntax-based information is being incorporated directly into the generation model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-42",
"text": "To the best of our knowledge this is the first paper providing a probabilistic generative, history-based generation model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-33",
"text": "For example, Carroll and Oepen (2005) describe a sentence realisation process which uses a hand-crafted HPSG grammar to generate a generation forest."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-43",
"text": "Cahill and van Genabith (2006) present a probabilistic surface generation model for LFG (Kaplan, 1995) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-44",
"text": "LFG is a constraint-based theory of grammar, which analyses strings in terms of c(onstituency)-structure and f(unctional)-structure ( Figure 1 )."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-45",
"text": "C-structure is defined in terms of CFGs, and f-structures are recursive attribute-value matrices which represent abstract syntactic functions (such as SUBJect, OBJect, OBLique, COMPlement (sentential), ADJ(N)unct), agreement, control, longdistance dependencies and some semantic information (e.g. tense, aspect)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-47",
"text": "**SURFACE REALISATION FROM F-STRUCTURES**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-48",
"text": "C-structures and f-structures are related in a projection architecture in terms of a piecewise correspondence \u03c6."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-49",
"text": "1 The correspondence is indicated in"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-50",
"text": "Figure 1: C-and f-structures with \u03c6 links for the sentence Susan contacted her."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-51",
"text": "terms of the curvy arrows pointing from c-structure nodes to f-structure components in Figure 1 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-52",
"text": "Given a c-structure node n i , the corresponding f-structure component f j is \u03c6(n i )."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-53",
"text": "F-structures and the cstructure/f-structure correspondence are described in terms of functional annotations on c-structure nodes (CFG grammar rules)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-54",
"text": "An equation of the form (\u2191F) = \u2193 states that the f-structure associated with the mother of the current c-structure node (\u2191) has an attribute (grammatical function) (F), whose value is the f-structure of the current node (\u2193)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-55",
"text": "The up-arrows and down-arrows are shorthand for \u03c6(M(n i )) = \u03c6(n i ) where n i is the c-structure node annotated with the equation."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-56",
"text": "2"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-57",
"text": "T ree best := argmax Tree P (Tree|F-Str) (1)"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-58",
"text": "The generation model of (Cahill and van Genabith, 2006) maximises the probability of a tree given an f-structure (Eqn. 1), and the string generated is the yield of the highest probability tree."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-59",
"text": "The generation process is guided by purely local information in the input f-structure: f-structure annotated CFG rules (LHS \u2192 RHS) are conditioned on their LHSs and on the set of features/attributes Feats = {a i |\u2203v j \u03c6(LHS)a i = v j } 3 \u03c6-linked to the LHS (Eqn."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-60",
"text": "2)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-61",
"text": "Table 1 shows a generation grammar rule and conditioning features extracted from the example in Figure 1 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-62",
"text": "The probability of a tree is decomposed into the product of the probabilities of the f-structure annotated rules (conditioned on the LHS and local Feats) contributing to the tree."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-63",
"text": "Conditional probabilities are estimated using maximum likelihood estimation."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-64",
"text": "Figure 1 )."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-65",
"text": "(2006) note that conditioning f-structure annotated generation rules on local features (Eqn. 2) can sometimes cause the model to make inappropriate choices."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-66",
"text": "Consider the following scenario where in addition to the c-/f-structure in Figure 1 , the training set contains the c-/f-structure displayed in Figure 2 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-68",
"text": "**CAHILL AND VAN GENABITH**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-69",
"text": "From Figures 1 and 2, the model learns (among others) the generation rules and conditional probabilities displayed in Tables 2 and 3 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-71",
"text": "**F-STRUCT FEATS**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-72",
"text": "Grammar Rules Prob Given the input f-structure (for She accepted) in Figure 3 , (and assuming suitable generation rules for intransitive VPs and accepted) the model would produce the inappropriate highest probability tree of Figure 4 with an incorrect case for the pronoun in subject position."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-73",
"text": "Figure 2: C-and f-structures with \u03c6 links for the sentence She hired her."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-74",
"text": "To solve the problem, Cahill and van Genabith (2006) apply an automatic generation grammar transformation to their training data: they automatically label CFG nodes with additional case information and the model now learns the new improved generation rules of Tables 4 and 5 . Note how the additional case labelling subverts the problematic independence assumptions of the probability model and communicates the fact that a subject NP has to be realised as nominative case from the S \u2192 NP-nom VP production, via the intermediate NP-nom \u2192 PRP-nom, down to the lexical production PRP-nom \u2192 she."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-75",
"text": "The labelling guarantees that, given the example f-structure in Figure 3 , the model generates the correct string she accepted."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-77",
"text": "**F-STRUCT FEATS GRAMMAR RULES PROB**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-78",
"text": "{PRED=PRO,NUM=SG PER=3, GEN=FEM} PRP(\u2191=\u2193) \u2192 she 0.33 {PRED=PRO,NUM=SG PER=3, GEN=FEM} PRP(\u2191=\u2193) \u2192 her 0.66\uf8ee \uf8ef \uf8ef \uf8ef \uf8ef \uf8f0 SUBJ \uf8ee \uf8f0 PRED pro NUM sg PERS 3 GEND fem \uf8f9 \uf8fb PRED accept TENSE past \uf8f9 \uf8fa \uf8fa \uf8fa \uf8fa \uf8fb"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-80",
"text": "**F-STRUCT FEATS**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-81",
"text": "Grammar Rules"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-82",
"text": "Figure 4: Inappropriate output: her accepted."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-83",
"text": "Grammar Rules"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-84",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-85",
"text": "**A HISTORY-BASED GENERATION MODEL**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-86",
"text": "The automatic generation grammar transform presented in (Cahill and van Genabith, 2006) provides a solution to coarse-grained and (in fact) inappropriate independence assumptions in the basic generation model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-87",
"text": "However, there is a sense in which the proposed cure improves on the symptoms, but not the cause of the problem: it weakens independence assumptions by multiplying and hence increasing the specificity of conditioning CFG category labels."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-88",
"text": "There is another option available to us, and that is the option we will explore in this paper: instead of applying a generation grammar transform, we will improve the f-structure-based conditioning of the generation rule probabilities."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-89",
"text": "In the original model, rules are conditioned on purely local f-structure context: the set of features/attributes \u03c6-linked to the LHS of a grammar rule."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-90",
"text": "As a direct consequence of this, the conditioning (and hence the model) cannot not distinguish between NP, PRP and NNP rules appropriate to e.g. subject (SUBJ) or object contexts (OBJ) in a given input f-structure."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-91",
"text": "However, the required information can easily be incorporated into the generation model by uniformly conditioning generation rules on their parent (mother) grammatical function, in addition to the local \u03c6-linked feature set."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-92",
"text": "This additional conditioning has the effect of making the choice of generation rules sensitive to the history of the generation process, and, we argue, provides a simpler, more uniform, general, intuitive and natural probabilistic generation model obviating the need for CFG-grammar transforms in the original proposal of (Cahill and van Genabith, 2006) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-93",
"text": "In the new model, each generation rule is now conditioned on the LHS rule CFG category, the set of features \u03c6-linked to LHS and the parent grammatical function of the f-structure \u03c6-linked to LHS."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-94",
"text": "In a given c-/f-structure pair, for a CFG node n, the parent grammatical function of the f-structure \u03c6-linked to n is that grammatical function GF, which, if we take the f-structure \u03c6-linked to the mother M(n), and apply it to GF, returns the f-structure \u03c6-linked to n:"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-95",
"text": "The basic idea is best explained by way of an example."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-96",
"text": "Consider again Figure 1 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-97",
"text": "The mother grammatical function of the f-structure f 2 associated with node NP(\u2191SUBJ=\u2193) and its daughter NNP(\u2191=\u2193) (via the \u2191=\u2193 functional annotation) is SUBJ, as (\u03c6(M(n 2 ))SUBJ) = \u03c6(n 2 ), or equivalently (f 1 SUBJ) = f 2 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-98",
"text": "Given Figures 1 and 2 as training set, the improved model learns the generation rules (the mother grammatical function of the outermost f-structure is assumed to be a dummy TOP grammatical function) of Tables 6 and 7."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-99",
"text": "F-Struct Feats Grammar Rules Note, that for our example the effect of the uniform additional conditioning on mother grammatical function has the same effect as the generation grammar transform of (Cahill and van Genabith, 2006 ), but without the need for the gram- mar transform."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-100",
"text": "Given the input f-structure in Figure 3 , the model will generate the correct string she accepted."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-101",
"text": "In addition, uniform conditioning on mother grammatical function is more general than the case-phenomena specific generation grammar transform of (Cahill and van Genabith, 2006) , in that it applies to each and every sub-part of a recursive input f-structure driving generation, making available relevant generation history (context) to guide local generation decisions."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-102",
"text": "The new history-based probabilistic generation model is defined as:"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-103",
"text": "Note that the new conditioning feature, the fstructure mother grammatical function, GF, is available from structure previously generated in the cstructure tree."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-104",
"text": "As such, it is part of the history of the tree, i.e. it has already been generated in the topdown derivation of the tree."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-105",
"text": "In this way, the generation model resembles history-based models for parsing (Black et al., 1992; Collins, 1999; Charniak, 2000) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-106",
"text": "Unlike, say, the parent annotation for parsing of (Johnson, 1998 ) the parent GF feature for a particular node expansion is not merely extracted from the parent node in the c-structure tree, but is sometimes extracted from an ancestor node further up the c-structure tree via intervening \u2191=\u2193 functional annotations."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-107",
"text": "Section 6 provides evaluation results for the new model on section 23 of the Penn treebank."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-110",
"text": "In another effort to improve generator accuracy over the baseline model we explored the use of multiword units in generation."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-111",
"text": "We expect that the identification of MWUs may be useful in imposing wordorder constraints and reducing the complexity of the generation task."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-112",
"text": "Take, for example, the following Figure 5 : Three different f-structure formats."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-113",
"text": "From left to right: the original f-structure format; the MWU chunk format; the MWU mark-up format."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-114",
"text": "two sentences which show the gold version of a sentence followed by the version of the sentence produced by the generator: The gold version of the sentence contains a multiword unit, New York, which appears fragmented in the generator output."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-115",
"text": "If multi-word units were either treated as one token throughout the generation process, or, alternatively, if a constraint were imposed on the generator such that multi-word units were always generated in the correct order, then this should help improve generation accuracy."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-116",
"text": "In Section 5.1 we describe the various techniques that were used to incorporate multi-word units into the generation process and in 5.2 we detail the different types and sources of multi-word unit used in the experiments."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-117",
"text": "Section 6 provides evaluation results on test and development sets from the WSJ treebank."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-119",
"text": "**INCORPORATING MWUS INTO THE GENERATION PROCESS**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-120",
"text": "We carried out three types of experiment which, in different ways, enabled the generation process to respect the restrictions on word-order provided by multi-word units."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-121",
"text": "For the first experiments (type 1), the WSJ treebank training and test data were altered so that multi-word units are concatenated into single words (for example, New York becomes New York)."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-122",
"text": "As in (Cahill and van Genabith, 2006) fstructures are generated from the (now altered) treebank and from this data, along with the treebank trees, the PCFG-based grammar, which is used for training the generation model, is extracted."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-123",
"text": "Similarly, the f-structures for the test and development sets are created from Penn Treebank trees which have been modified so that multi-word units form single units."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-124",
"text": "The leftmost and middle f-structures in Figure 5 show an example of an original f-structure format and a named-entity chunked format, respectively."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-125",
"text": "Strings output by the generator are then postprocessed so that the concatenated word sequences are converted back into single words."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-126",
"text": "In the second experiment (type 2) only the test data was altered with no concatenation of MWUs carried out on the training data."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-127",
"text": "In the final experiments (type 3), instead of concatenating named entities, a constraint is introduced to the generation algorithm which penalises the generation of sequences of words which violate the internal word order of named entities."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-128",
"text": "The input is marked-up in such a way that, although named entities are no longer chunked together to form single words, the algorithm can read which items are part of named entities."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-129",
"text": "See the rightmost f-structure in Figure 5 for an example of an f-structure markedup in this way."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-130",
"text": "The tag NE1 1, for example, indicates that the sub-f-structure is part of a named identity with id number 1 and that the item corresponds to the first word of the named entity."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-131",
"text": "The baseline generation algorithm, following Kay (1996) 's work on chart generation, already contains the hard constraint that when combining two chart edges they must cover disjoint sets of words."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-132",
"text": "We added an additional constraint which prevents edges from being combined if this would result in the generation of a string which contained a named entity which was either incomplete or where the words in the named entity were generated in the wrong order."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-134",
"text": "**TYPES OF MWUS USED IN EXPERIMENTS**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-135",
"text": "We carry out experiments with multi-word units from three different sources."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-136",
"text": "First, we use the output of the maximum entropy-based named entity recognition system of (Chieu and Ng, 2003) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-137",
"text": "This system identifies four types of named entity: person, organisation, location, and miscellaneous."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-138",
"text": "Additionally we use a dictionary of candidate multi-word expressions based on a list from the Stanford Multiword Expression Project 4 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-139",
"text": "Finally, we also carry out experiments with multi-word units extracted from the BBN Pronoun Coreference and Entity Type Corpus (Weischedel and Brunstein, 2005) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-140",
"text": "This supplements the Penn WSJ treebank's one million words of syntax-annotated Wall Street Journal text with additional annotations of 23 named entity types, including nominal-type named entities such as person, organisation, location, etc. as well as numeric types such as date, time, quantity and money."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-141",
"text": "Since the BBN corpus data is very comprehensive and is handannotated we take this be be a gold standard, representing an upper bound for any gains that might be made by identifying complex named entities in our experiments."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-142",
"text": "5 Table 8 gives examples of the various types of MWUs identified by the three sources."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-143",
"text": "For our purposes we are not concerned with the distinctions between different types of named entities; we are merely exploiting the fact that they may be treated as atomic units in the generation model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-144",
"text": "In all cases we disregard multi-word units that cross the original syntactic bracketing of the WSJ treebank."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-145",
"text": "An overview of the various types of multi-word units used in our experiments is presented in Table 9 ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-147",
"text": "**EXPERIMENTAL EVALUATION**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-148",
"text": "All experiments were carried out on the WSJ treebank with sections 02-21 for training, section 24 for development and section 23 for final test results."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-149",
"text": "The LFG annotation algorithm of (Cahill et al., 2004) was used to produce the f-structures for development, test and training sets."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-150",
"text": "4 mwe.stanford.edu 5 Although it is possible there are other types of MWUs that may be more suitable to the task than the named entities identified by BBN, so further gains might be possible."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-151",
"text": "Table 9 : Average number of MWUs per sentence and average MWU length in the WSJ treebank grouped by MWU source."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-152",
"text": "Table 10 shows the final results for section 23."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-153",
"text": "For each test we present BLEU score results as well as String Edit Distance and coverage."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-155",
"text": "First we use a bootstrap resampling method, popular for machine translation evaluations, to measure the significance of improvements in BLEU scores, with a resampling rate of 1000."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-156",
"text": "6 We also calculated the significance of an increase in String Edit Distance by carrying out a paired t-test on the mean difference of the String Edit Distance scores."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-157",
"text": "In Table 10 , means significant at level 0.005."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-158",
"text": "> means significant at level 0.05."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-159",
"text": "In Table 10 , Baseline gives the results of the generation algorithm of (Cahill and van Genabith, 2006) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-160",
"text": "HB Model refers to the improved model with the increased history context, as described in Section 4."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-161",
"text": "The results, where for example the BLEU score rises from 66.52 to 67.24, show that even increasing the conditioning context by a limited amount increases the accuracy of the system significantly for both BLEU and String Edit Distance."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-162",
"text": "In addition, coverage goes up from 98.18% to 99.88%."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-163",
"text": "+MWU Best Automatic displays our best results using automatically identified named entities."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-164",
"text": "These were achieved using experiment type 2, described in Section 5, with the MWUs produced by (Chieu and Ng, 2003) ."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-165",
"text": "Results displayed in Table 10 up to this point are cumulative."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-166",
"text": "The final row in Table 10, MWU BBN, shows the best results with BBN MWUs: the history-based model with BBN multiword units incorporated in a type 1 experiment."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-167",
"text": "We now discuss the various MWU experiments in more detail."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-168",
"text": "See Table 11 for a breakdown of the MWU experiment results on the development set, WSJ section 24."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-169",
"text": "Our baseline for these experiments is the history-based generator presented in Section 4."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-170",
"text": "For each experiment type described in Section 5.1 we ran three experiments, varying the source of MWUs."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-171",
"text": "First, MWUs came from the automatic NE recogniser of (Chieu and Ng, 2003) , then we added the MWUs from the Stanford list and finally we ran tests with MWUs extracted from the BBN corpus."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-172",
"text": "Our first set of experiments (type 1), where both training data and development set data were MWUchunked, produced the worst results for the automatically chunked MWUs."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-173",
"text": "BLEU score accuracy actually decreased for the automatically chunked MWU experiments."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-174",
"text": "In an error analysis of type 1 experiments with (Chieu and Ng, 2003) concatenated MWUs, we inspected those sentences where accuracy had decreased from the baseline."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-175",
"text": "We found that for over half (51.5%) of these sentences, the input f-structures contained no multi-word units at all."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-176",
"text": "The problem for these sentences therefore lay with the probabilistic grammar extracted from the MWUchunked training data."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-177",
"text": "When the source of MWU for the type 1 experiments was the BBN, however, accuracy improved significantly over the baseline and the result is the highest accuracy achieved over all experiment types."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-178",
"text": "One possible reason for the low accuracy scores in the type 1 experiments with the (Chieu and Ng, 2003) MWU chunked data could be noisy MWUs which negatively affect the grammar."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-179",
"text": "For example, the named entity recogniser of (Chieu and Ng, 2003) achieves an accuracy of 88.3% on section 23 of the Penn Treebank."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-180",
"text": "In order to avoid changing the grammar through concatenation of MWU components (as in experiment type 1) and thus risking side-effects which cause some heretofore likely constructions become less likely and vice versa, we ran the next set of experiments (type 2) which leave the original grammar intact and alter the input f-structures only."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-181",
"text": "These experiments were more successful overall and we achieved an improvement over the baseline for both BLEU and String Edit Distance scores with all MWU types."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-182",
"text": "As can be seen from Table 11 the best score for automatically chunked MWUs are with the (Chieu and Ng, 2003) MWUs."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-183",
"text": "Accuracy decreases marginally when we added the Stanford MWUs."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-184",
"text": "In our final set of experiments (type 3) although the accuracy for all three types of MWUs improves over the baseline, accuracy is a little below the type 2 experiments."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-185",
"text": "It is difficult to compare sentence generators since the information contained in the input varies greatly between systems, systems are evaluated on different test sets and coverage also varies considerably."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-186",
"text": "In order to compare our system with those of (Nakanishi et al., 2005) and (Langkilde-Geary, 2002) we report our best results with automatically acquired MWUs for sentences of \u2264 20 words in length on section 23: our system gets coverage of 100% and a BLEU score of 71.39."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-187",
"text": "For the same test set Nakanishi et al. (2005)"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-188",
"text": "----------------------------------"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-189",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-190",
"text": "We have presented techniques which improve the accuracy of an already state-of-art surface generation model."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-191",
"text": "We found that a history-based model that increases conditioning context in PCFG style rules by simply including the grammatical function of the f-structure parent, improves generator accuracy."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-192",
"text": "In the future we will experiment with increasing conditioning context further and using more sophisticated smoothing techniques to avoid sparse data problems when conditioning is increased."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-193",
"text": "We have also demonstrated that automatically acquired multi-word units can bring about moderate, but significant, improvements in generator accuracy."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-194",
"text": "For automatically acquired MWUs, we found that this could best be achieved by concatenating input items when generating the f-structure input to the generator, while training the input generation grammar on the original (i.e. non-MWU concatenated) sections of the treebank."
},
{
"sent_id": "8dd8c0e61010d97d0ddae6d81a9067-C001-195",
"text": "Relying on the BBN corpus as a source of multi-word units, we gave an upper bound to the potential usefulness of multi-word units in generation and showed that automatically acquired multi-word units, encouragingly, give results not far below the upper bound."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-9"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-16"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-28"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-58"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-74"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-86"
]
],
"cite_sentences": [
"8dd8c0e61010d97d0ddae6d81a9067-C001-9",
"8dd8c0e61010d97d0ddae6d81a9067-C001-16",
"8dd8c0e61010d97d0ddae6d81a9067-C001-28",
"8dd8c0e61010d97d0ddae6d81a9067-C001-58",
"8dd8c0e61010d97d0ddae6d81a9067-C001-74",
"8dd8c0e61010d97d0ddae6d81a9067-C001-86"
]
},
"@MOT@": {
"gold_contexts": [
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-16",
"8dd8c0e61010d97d0ddae6d81a9067-C001-17"
]
],
"cite_sentences": [
"8dd8c0e61010d97d0ddae6d81a9067-C001-16"
]
},
"@USE@": {
"gold_contexts": [
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-18"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-122"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-159"
]
],
"cite_sentences": [
"8dd8c0e61010d97d0ddae6d81a9067-C001-18",
"8dd8c0e61010d97d0ddae6d81a9067-C001-122",
"8dd8c0e61010d97d0ddae6d81a9067-C001-159"
]
},
"@DIF@": {
"gold_contexts": [
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-92"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-99"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-101"
]
],
"cite_sentences": [
"8dd8c0e61010d97d0ddae6d81a9067-C001-92",
"8dd8c0e61010d97d0ddae6d81a9067-C001-99",
"8dd8c0e61010d97d0ddae6d81a9067-C001-101"
]
},
"@SIM@": {
"gold_contexts": [
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-99"
],
[
"8dd8c0e61010d97d0ddae6d81a9067-C001-122"
]
],
"cite_sentences": [
"8dd8c0e61010d97d0ddae6d81a9067-C001-99",
"8dd8c0e61010d97d0ddae6d81a9067-C001-122"
]
}
}
},
"ABC_a7d6441ad365994edb41209e6405e0_8": {
"x": [
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-2",
"text": "Bidirectional long short-term memory (bi-LSTM) networks have recently proven successful for various NLP sequence modeling tasks, but little is known about their reliance to input representations, target languages, data set size, and label noise."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-3",
"text": "We address these issues and evaluate bi-LSTMs with word, character, and unicode byte embeddings for POS tagging."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-4",
"text": "We compare bi-LSTMs to traditional POS taggers across languages and data sizes."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-5",
"text": "We also present a novel bi-LSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-6",
"text": "The model obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-7",
"text": "Our analysis suggests that biLSTMs are less sensitive to training data size and label corruptions (at small noise levels) than previously assumed."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-10",
"text": "Recently, bidirectional long short-term memory networks (bi-LSTM) (Graves and Schmidhuber, 2005; Hochreiter and Schmidhuber, 1997) have been used for language modelling (Ling et al., 2015) , POS tagging (Ling et al., 2015; Wang et al., 2015) , transition-based dependency parsing (Ballesteros et al., 2015; Kiperwasser and Goldberg, 2016) , fine-grained sentiment analysis (Liu et al., 2015) , syntactic chunking (Huang et al., 2015) , and semantic role labeling (Zhou and Xu, 2015) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-11",
"text": "LSTMs are recurrent neural networks (RNNs) in which layers are designed to prevent vanishing gradients."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-12",
"text": "Bidirectional LSTMs make a backward and forward pass through the sequence before passing on to the next layer."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-13",
"text": "For further details, see (Goldberg, 2015; Cho, 2015) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-14",
"text": "We consider using bi-LSTMs for POS tagging."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-15",
"text": "Previous work on using deep learning-based methods for POS tagging has focused either on a single language (Collobert et al., 2011; Wang et al., 2015) or a small set of languages (Ling et al., 2015; Santos and Zadrozny, 2014 )."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-16",
"text": "Instead we evaluate our models across 22 languages."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-17",
"text": "In addition, we compare performance with representations at different levels of granularity (words, characters, and bytes)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-18",
"text": "These levels of representation were previously introduced in different efforts (Chrupa\u0142a, 2013; Zhang et al., 2015; Ling et al., 2015; Santos and Zadrozny, 2014; Gillick et al., 2016; Kim et al., 2015) , but a comparative evaluation was missing."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-19",
"text": "Moreover, deep networks are often said to require large volumes of training data."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-20",
"text": "We investigate to what extent bi-LSTMs are more sensitive to the amount of training data and label noise than standard POS taggers."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-21",
"text": "Finally, we introduce a novel model, a bi-LSTM trained with auxiliary loss."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-22",
"text": "The model jointly predicts the POS and the log frequency of the next word."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-23",
"text": "The intuition behind this model is that the auxiliary loss, being predictive of word frequency, helps to differentiate the representations of rare and common words."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-24",
"text": "We indeed observe performance gains on rare and out-of-vocabulary words."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-25",
"text": "These performance gains transfer into general improvements for morphologically rich languages."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-26",
"text": "Contributions In this paper, we a) evaluate the effectiveness of different representations in biLSTMs, b) compare these models across a large set of languages and under varying conditions (data size, label noise) and c) propose a novel bi-LSTM model with auxiliary loss (LOGFREQ)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-28",
"text": "**TAGGING WITH BI-LSTMS**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-29",
"text": "Recurrent neural networks (RNNs) (Elman, 1990) allow the computation of fixed-size vector representations for word sequences of arbitrary length."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-30",
"text": "An RNN is a function that reads in n vectors x 1 , ..., x n and produces an output vector h n , that depends on the entire sequence x 1 , ..., x n ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-31",
"text": "The vector h n is then fed as an input to some classifier, or higher-level RNNs in stacked/hierarchical models."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-32",
"text": "The entire network is trained jointly such that the hidden representation captures the important information from the sequence for the prediction task."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-33",
"text": "A bidirectional recurrent neural network (bi-RNN) (Graves and Schmidhuber, 2005) is an extension of an RNN that reads the input sequence twice, from left to right and right to left, and the encodings are concatenated."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-34",
"text": "The literature uses the term bi-RNN to refer to two related architectures, which we refer to here as \"context bi-RNN\" and \"sequence bi-RNN\"."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-35",
"text": "In a sequence bi-RNN (bi-RNN seq ), the input is a sequence of vectors x 1:n and the output is a concatenation (\u2022) of a forward (f ) and reverse (r) RNN each reading the sequence in a different directions:"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-36",
"text": "In a context bi-RNN (bi-RNN ctx ), we get an additional input i indicating a sequence position, and the resulting vectors v i result from concatenating the RNN encodings up to i:"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-37",
"text": "Thus, the state vector v i in this bi-RNN encodes information at position i and its entire sequential context."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-38",
"text": "Another view of the context bi-RNN is of taking a sequence x 1:n and returning the corresponding sequence of state vectors v 1:n ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-81",
"text": "The only system we are aware of that evaluates on UD is Gillick et al. (2016) (last column)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-39",
"text": "LSTMs (Hochreiter and Schmidhuber, 1997 ) are a variant of RNNs that replace the cells of RNNs with LSTM cells that were designed to prevent vanishing gradients."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-40",
"text": "Bidirectional LSTMs are the bi-RNN counterpart based on LSTMs."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-41",
"text": "Our basic bi-LSTM tagging model is a context bi-LSTM taking as input word embeddings w. We incorporate subtoken information using an hierarchical bi-LSTM architecture (Ling et al., 2015; Ballesteros et al., 2015 )."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-42",
"text": "We compute subtokenlevel (either characters c or unicode byte b) embeddings of words using a sequence bi-LSTM at the lower level."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-43",
"text": "This representation is then concatenated with the (learned) word embeddings vector w which forms the input to the context bi-LSTM at the next layer."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-44",
"text": "This model, illustrated in Figure 1 (lower part in left figure), is inspired by Ballesteros et al. (2015) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-45",
"text": "We also test models in which we only keep sub-token information, e.g., either both byte and character embeddings (Figure 1 , right) or a single (sub-)token representation alone."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-46",
"text": "In our novel model, cf."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-47",
"text": "Figure 1 left, we train the bi-LSTM tagger to predict both the tags of the sequence, as well as a label that represents the log frequency of the next token as estimated from the training data."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-48",
"text": "Our combined cross-entropy loss is now: L(\u0177 t , y t ) + L(\u0177 a , y a ), where t stands for a POS tag and a is the log frequency label, i.e., a = int(log(f req train (w))."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-49",
"text": "Combining this log frequency objective with the tagging task can be seen as an instance of multi-task learning in which the labels are predicted jointly."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-50",
"text": "The idea behind this model is to make the representation predictive for frequency, which encourages the model to not share representations between common and rare words, thus benefiting the handling of rare tokens."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-51",
"text": "epochs, default learning rate (0.1), 128 dimensions for word embeddings, 100 for character and byte embeddings, 100 hidden states and Gaussian noise with \u03c3=0.2."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-52",
"text": "As training is stochastic in nature, we use a fixed seed throughout."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-53",
"text": "Embeddings are not initialized with pre-trained embeddings, except when reported otherwise."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-54",
"text": "In that case we use offthe-shelf polyglot embeddings (Al-Rfou et al., 2013) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-55",
"text": "2 No further unlabeled data is considered in this paper."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-56",
"text": "The code is released at: https: //github.com/bplank/bilstm-aux Taggers We want to compare POS taggers under varying conditions."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-57",
"text": "We hence use three different types of taggers: our implementation of a bi-LSTM; TNT (Brants, 2000) -a second order HMM with suffix trie handling for OOVs."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-58",
"text": "We use TNT as it was among the best performing taggers evaluated in Horsmann et al. (2015) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-59",
"text": "3 We complement the NN-based and HMM-based tagger with a CRF tagger, using a freely available implementation (Plank et al., 2014 ) based on crfsuite."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-61",
"text": "**DATASETS**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-62",
"text": "For the multilingual experiments, we use the data from the Universal Dependencies project v1.2 (Nivre et al., 2015) (17 POS) with the canonical data splits."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-63",
"text": "For languages with token segmentation ambiguity we use the provided gold segmentation."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-64",
"text": "If there is more than one treebank per language, we use the treebank that has the canonical language name (e.g., Finnish instead of Finnish-FTB)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-65",
"text": "We consider all languages that have at least 60k tokens and are distributed with word forms, resulting in 22 languages."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-66",
"text": "We also report accuracies on WSJ (45 POS) using the standard splits (Collins, 2002; Manning, 2011) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-67",
"text": "The overview of languages is provided in Table 1 ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-69",
"text": "**RESULTS**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-70",
"text": "Our results are given in Table 2 ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-71",
"text": "First of all, notice that TNT performs remarkably well across the 22 languages, closely followed by CRF."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-72",
"text": "The bi-LSTM tagger ( w) without lower-level bi-LSTM for subtokens falls short, outperforms the traditional taggers only on 3 languages."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-73",
"text": "The bi-LSTM 2 https://sites.google.com/site/rmyeid/ projects/polyglot 3 They found TreeTagger was closely followed by HunPos, a re-implementation of TnT, and Stanford and ClearNLP were lower ranked."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-74",
"text": "In an initial investigation, we compared Tnt, HunPos and TreeTagger and found Tnt to be consistently better than Treetagger, Hunpos followed closely but crashed on some languages (e.g., Arabic)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-75",
"text": "model clearly benefits from character representations."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-76",
"text": "The model using characters alone ( c) works remarkably well, it improves over TNT on 9 languages (incl."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-77",
"text": "Slavic and Nordic languages)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-78",
"text": "The combined word+character representation model is the best representation, outperforming the baseline on all except one language (Indonesian), providing strong results already without pre-trained embeddings."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-79",
"text": "This model ( w + c) reaches the biggest improvement (more than +2% accuracy) on Hebrew and Slovene."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-80",
"text": "Initializing the word embeddings (+POLYGLOT) with off-the-shelf languagespecific embeddings further improves accuracy."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-82",
"text": "However, note that these results are not strictly comparable as they use the earlier UD v1.1 version."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-83",
"text": "The overall best system is the multi-task bi-LSTM FREQBIN (it uses w + c and POLYGLOT initialization for w)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-84",
"text": "While on macro average it is on par with bi-LSTM w + c, it obtains the best results on 12/22 languages, and it is successful in predicting POS for OOV tokens (cf."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-85",
"text": "Table 2 OOV ACC columns), especially for languages like Arabic, Farsi, Hebrew, Finnish."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-86",
"text": "We examined simple RNNs and confirm the finding of Ling et al. (2015) that they performed worse than their LSTM counterparts."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-87",
"text": "Finally, the bi-LSTM tagger is competitive on WSJ, cf."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-88",
"text": "Table 3."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-89",
"text": "Rare words In order to evaluate the effect of modeling sub-token information, we examine accuracy rates at different frequency rates."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-90",
"text": "Figure 2 shows absolute improvements in accuracy of bi-LSTM w + c over mean log frequency, for different language families."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-91",
"text": "We see that especially for Slavic and non-Indoeuropean languages, having high morphologic complexity, most of the improvement is obtained in the Zipfian tail."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-92",
"text": "Rare tokens benefit from the sub-token representations."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-93",
"text": "Data set size Prior work mostly used large data sets when applying neural network based approaches (Zhang et al., 2015) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-94",
"text": "We evaluate how brittle such models are with respect to their more traditional counterparts by training bi-LSTM ( w + c without Polyglot embeddings) for increas-"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-96",
"text": "**WSJ ACCURACY**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-97",
"text": "Convnet (Santos and Zadrozny, 2014) 97.32 Convnet reimplementation (Ling et al., 2015) 96.80 Bi-RNN (Ling et al., 2015) 95.93 Bi-LSTM (Ling et al., 2015) 97.36"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-98",
"text": "Our bi-LSTM w+ c 97.22 ing amounts of training instances (number of sentences)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-99",
"text": "The learning curves in Figure 3 show similar trends across language families."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-100",
"text": "4 TNT is better with little data, bi-LSTM is better with more data, and bi-LSTM always wins over CRF."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-101",
"text": "The bi-LSTM model performs already surprisingly well after only 500 training sentences."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-102",
"text": "For non-Indoeuropean languages it is on par and above the other taggers with even less data (100 sentences)."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-124",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-103",
"text": "This shows that the bi-LSTMs often needs more data than the generative markovian model, but this is definitely less than what we expected."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-104",
"text": "Label Noise We investigated the susceptibility of the models to noise, by artificially corrupting training labels."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-105",
"text": "Our initial results show that at low noise rates, bi-LSTMs and TNT are affected similarly, their accuracies drop to a similar degree."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-106",
"text": "Only at higher noise levels (more than 30% corrupted labels), bi-LSTMs are less robust, showing higher drops in accuracy compared to TNT."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-107",
"text": "This is the case for all investigated language families."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-109",
"text": "**RELATED WORK**"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-110",
"text": "Character embeddings were first introduced by Sutskever et al. (2011) for language modeling."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-111",
"text": "Early applications include text classification (Chrupa\u0142a, 2013; Zhang et al., 2015) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-112",
"text": "Recently, these representations were successfully applied to a range of structured prediction tasks."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-113",
"text": "For POS tagging, Santos and Zadrozny (2014) were the first to propose character-based models."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-114",
"text": "They use a convolutional neural network (CNN; or convnet) and evaluated their model on English (PTB) and Portuguese, showing that the model achieves state-of-the-art performance close to taggers using carefully designed feature templates."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-115",
"text": "Ling et al. (2015) extend this line and compare a novel bi-LSTM model, learning word representations through character embeddings."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-116",
"text": "They evaluate their model on a language modeling and POS tagging setup, and show that bi-LSTMs outperform the CNN approach of Santos and Zadrozny (2014) ."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-117",
"text": "Similarly, Labeau et al. (2015) evaluate character embeddings for German."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-118",
"text": "Bi-LSTMs for POS tagging are also reported in Wang et al. (2015) , however, they only explore word embeddings, orthographic information and evaluate on WSJ only."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-119",
"text": "A related study is Cheng et al. (2015) who propose a multi-task RNN for named entity recognition by jointly predicting the next token and current token's name label."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-120",
"text": "Our model is simpler, it uses a very coarse set of labels rather then integrating an entire language modeling task which is computationally more expensive."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-121",
"text": "An interesting recent study is Gillick et al. (2016) , they build a single byte-to-span model for multiple languages based on a sequence-to-sequence RNN (Sutskever et al., 2014) achieving impressive results."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-122",
"text": "We would like to extend this work in their direction."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-125",
"text": "We evaluated token and subtoken-level representations for neural network-based part-of-speech tagging across 22 languages and proposed a novel multi-task bi-LSTM with auxiliary loss."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-126",
"text": "The auxiliary loss is effective at improving the accuracy of rare words."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-127",
"text": "Subtoken representations are necessary to obtain a state-of-the-art POS tagger, and character embeddings are particularly helpful for nonIndoeuropean and Slavic languages."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-128",
"text": "Combining them with word embeddings in a hierarchical network provides the best representation."
},
{
"sent_id": "a7d6441ad365994edb41209e6405e0-C001-129",
"text": "The bi-LSTM tagger is as effective as the CRF and HMM taggers with already as little as 500 training sentences, but is less robust to label noise (at higher noise rates)."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a7d6441ad365994edb41209e6405e0-C001-10"
],
[
"a7d6441ad365994edb41209e6405e0-C001-15"
],
[
"a7d6441ad365994edb41209e6405e0-C001-18"
]
],
"cite_sentences": [
"a7d6441ad365994edb41209e6405e0-C001-10",
"a7d6441ad365994edb41209e6405e0-C001-15",
"a7d6441ad365994edb41209e6405e0-C001-18"
]
},
"@MOT@": {
"gold_contexts": [
[
"a7d6441ad365994edb41209e6405e0-C001-18"
]
],
"cite_sentences": [
"a7d6441ad365994edb41209e6405e0-C001-18"
]
},
"@USE@": {
"gold_contexts": [
[
"a7d6441ad365994edb41209e6405e0-C001-41"
]
],
"cite_sentences": [
"a7d6441ad365994edb41209e6405e0-C001-41"
]
},
"@SIM@": {
"gold_contexts": [
[
"a7d6441ad365994edb41209e6405e0-C001-86"
]
],
"cite_sentences": [
"a7d6441ad365994edb41209e6405e0-C001-86"
]
},
"@DIF@": {
"gold_contexts": [
[
"a7d6441ad365994edb41209e6405e0-C001-118"
]
],
"cite_sentences": [
"a7d6441ad365994edb41209e6405e0-C001-118"
]
}
}
},
"ABC_1fd85a350d9ec7ac12151cfe4412e4_8": {
"x": [
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-29",
"text": "It uses multi-step attention mechanism."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-2",
"text": "Natural language generation (NLG) is an important component in spoken dialog systems (SDSs)."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-3",
"text": "A model for NLG involves sequence to sequence learning."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-4",
"text": "State-of-the-art NLG models are built using recurrent neural network (RNN) based sequence to sequence models (Du\u0161ek and Jurcicek, 2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-5",
"text": "Convolutional sequence to sequence based models have been used in the domain of machine translation but their application as natural language generators in dialogue systems is still unexplored."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-6",
"text": "In this work, we propose a novel approach to NLG using convolutional neural network (CNN) based sequence to sequence learning."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-7",
"text": "CNN-based approach allows to build a hierarchical model which encapsulates dependencies between words via shorter path unlike RNNs."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-8",
"text": "In contrast to recurrent models, convolutional approach allows for efficient utilization of computational resources by parallelizing computations over all elements, and eases the learning process by applying constant number of nonlinearities."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-9",
"text": "We also propose to use CNN-based reranker for obtaining responses having semantic correspondence with input dialogue acts."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-10",
"text": "The proposed model is capable of entrainment."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-11",
"text": "Studies using a standard dataset shows the effectiveness of the proposed CNN-based approach to NLG."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-12",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-13",
"text": "**INTRODUCTION**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-14",
"text": "In task-specific spoken dialogue systems (SDS), the function of natural language generation (NLG) components is to generate natural language response from a dialogue act (DA) (Young et al., 2009) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-15",
"text": "DA is a meaning representation specifying actions along with various attributes and their values."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-16",
"text": "NLG plays a very important role in realizing the overall quality of the SDS."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-17",
"text": "Entrainment to users way of speaking is essential for generating more natural and high quality natural language responses."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-18",
"text": "Most of the approaches for incorporating entrainment are rule-based models."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-19",
"text": "Recent advances have been in the direction of developing a fully trainable context aware NLG model (Du\u0161ek and Jurcicek, 2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-20",
"text": "However, all these approaches are based on recurrent sequence to sequence architecture."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-21",
"text": "Convolutional neural networks are largely unexplored in the domain of NLG for SDS inspite of having several advantages (Waibel et al., 1989; LeCun and Bengio, 1995) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-22",
"text": "Recurrent networks depend on the computations of previous time step and thus inhibits parallelization within a sequence."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-23",
"text": "Convolutional networks on the other hand, allows parallelization within a sequence resulting in efficient use of GPUs and other computational resources (Gehring et al., 2017) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-24",
"text": "Multi-block (multilayer) convolutional networks enable controlling the upper bound on the effective context size and form a hierarchical structure."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-25",
"text": "In contrast to the sequential structure of RNNs, hierarchical structure provides shorter paths for modeling longrange dependencies."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-26",
"text": "Recurrent networks apply variable number of nonlinearities to the inputs, whereas convolutional networks apply fixed number of nonlinearities which simplifies the learning (Gehring et al., 2017) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-27",
"text": "In this paper, we present a novel approach of using convolutional sequence to sequence model (ConvSeq2Seq) for the task of NLG."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-28",
"text": "ConvSeq2Seq generator is an encoder decoder model where convolutional neural networks (CNNs) are used to build both encoder and decoder states."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-30",
"text": "In the decoding phase, beam search is implemented and nbest natural language responses are chosen."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-31",
"text": "The n-best beam search responses from ConvSeq2Seq generator may have some missing and/or irrelevant information."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-32",
"text": "To address this, we propose to rank the n-best outputs from ConvSeq2Seq generator using convolutional reranker (CNN reranker)."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-33",
"text": "CNN reranker implements one dimensional convolution on beam search responses and generates binary vectors."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-34",
"text": "These binary vectors are used to penalize the responses having missing and/or irrelevant information."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-35",
"text": "We evaluate our model on the Alex Context natural language generation (NLG) dataset of Du\u0161ek and Jurcicek (2016a) and demonstrate that our model outperforms the RNNbased model of Du\u0161ek and Jurcicek (2016a) (TGen model) in automatic metrics."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-36",
"text": "Training time of proposed model is observed to be significantly lower than TGen model."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-37",
"text": "The main contributions of this work are (i) ConvSeq2Seq generator for NLG and (ii) CNN-based reranker for ranking n-best beam search responses for obtaining semantically appropriate responses with respect to input DA."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-38",
"text": "The rest of this paper is organized as follows."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-39",
"text": "Section 2 gives a brief review of different approaches to NLG."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-40",
"text": "In Section 3, proposed convolutional natural language generator (ConvSeq2Seq) is described along with CNN reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-41",
"text": "The experimental studies are presented in Section 4 and conclusions are given in Section 5."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-43",
"text": "**RELATED WORK**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-44",
"text": "Natural language generation (NLG) task is divided into two phases: sentence planning and surface realization."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-45",
"text": "Sentence planning generates intermediate structure such as dependency trees or templates modeling the input semantic symbols."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-46",
"text": "Surface realization phase converts the intermediate structure into the final natural language response."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-47",
"text": "Conventional approaches to NLG are rule based approaches (Stent et al., 2004; Walker et al., 2002) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-48",
"text": "Most recent NLG approaches include sequence to sequence RNN models (Wen et al., 2015a,b; Du\u0161ek and Jurcicek, 2016b,a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-49",
"text": "Sequence to sequence learning is to map the input sequence to a fixed sized vector using one RNN, and then to map the vector to the target sequence with another RNN."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-50",
"text": "In (Wen et al., 2015a) , a sequence to sequence RNN model is used with some decay factor to avoid vanishing gradient problem."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-51",
"text": "The n-best outputs generated by the model are ranked using a CNN-based reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-52",
"text": "The model also uses a backward sequence to sequence RNN reranker to further improve the performance."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-53",
"text": "Model proposed by Wen et al. (2015b) is a statistical language generator based on a semantically controlled long-short term memory (LSTM) structure."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-54",
"text": "The LSTM generator can learn from unaligned data by jointly optimizing sentence planning and surface realization using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-55",
"text": "Model proposed by Du\u0161ek and Jurcicek (2016b) serves as a sequence to sequence generation model for SDS which doesn't take into account the context."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-56",
"text": "The model uses single layer sequence to sequence RNN encoder decoder architecture along with attention mechanism to generate n-best output utterances."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-57",
"text": "It then uses RNN reranker to rank the n-best outputs of generator to get the utterance which best describes the input DA."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-58",
"text": "The model can also be used to generate deep syntax trees which can be converted to output utterance using a surface realization mechanism."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-59",
"text": "This model is context unaware because it takes into account only the input DA and no preceding user utterance(s)."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-60",
"text": "This leads to generation of very rigid responses and also inhibits flexible interactions."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-61",
"text": "Context awareness adapts/entrains to the user's way of speaking and thereby generates responses of high quality and naturalness."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-62",
"text": "The semantic meaning which is required to be given in response to a query is very well modelled if context awareness is taken into account."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-63",
"text": "This leads to generation of more informative response."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-64",
"text": "Model proposed by Du\u0161ek and Jurcicek (2016a) serves as a baseline sequence to sequence generation model (TGen model) for SDS which takes into account the context."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-65",
"text": "The model takes into account the preceding user utterance while generating natural language output."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-66",
"text": "The model implemented three modifications to the model proposed by Du\u0161ek and Jurcicek (2016b) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-67",
"text": "The first modification was prepending context to the input DAs."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-68",
"text": "The second modification was implementing a separate encoder for user utterances/contexts."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-69",
"text": "The third modification was implementing a N-gram match reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-70",
"text": "This reranker is based on n-gram precision scores and promotes responses having phrase overlaps with user utterances (Du\u0161ek and Jurcicek, 2016a )."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-71",
"text": "In the next section, we present the proposed CNN-based sequence to sequence generator for NLG."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-73",
"text": "**PROPOSED APPROACH**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-74",
"text": "The pipeline of the proposed approach for NLG is shown in Figure 1 ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-75",
"text": "Input DA with prepended context is first given to convolutional sequence to sequence generator (ConvSeq2Seq) to get nbest natural language responses or hypotheses (n is beam size)."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-76",
"text": "These n-best hypotheses and binary vector representation of input DA are given as input to CNN reranker to get the misfit penalties of the hypotheses."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-77",
"text": "The n-best hypotheses and context user utterance are given as input to the N-gram match reranker to get bigram precision scores of hypotheses."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-78",
"text": "Final rank of each hypothesis i where 1 \u2264 i \u2264 n is calculated as follows:"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-79",
"text": "Here, we get log probabilities from ConvSeq2Seq generator, bigram precision scores from N-gram match reranker and misfit penalties from CNN reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-80",
"text": "Here, \u03c9 and W are constants."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-81",
"text": "We implement the N-gram match reranker as given by Du\u0161ek and Jurcicek (2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-82",
"text": "We describe the proposed convolutional sequence to sequence generator in Section 3.1 and convolutional reranker in Section 3.2."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-84",
"text": "**CONVSEQ2SEQ GENERATOR**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-85",
"text": "The proposed sequence to sequence generator is based on convolutional sequence to sequence approach proposed by Gehring et al. (2017) 1 ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-86",
"text": "It is a CNN-based encoder decoder architecture."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-87",
"text": "Figure 2 shows the working of proposed ConvSeq2Seq generator on an input instance from training dataset."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-88",
"text": "In this architecture, CNNs are used to compute the encoder states and decoder states."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-89",
"text": "This architecture is based on succession of convolutional blocks/layers."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-90",
"text": "Input sequence is represented as a combination of word and position embeddings."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-91",
"text": "These embeddings are operated upon by first convolutional block and gated linear units (GLUs) to get the outputs for the first block."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-92",
"text": "This can be seen in Figure 2 where only one convolutional block is shown for representation purpose."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-93",
"text": "The output from first block is input to the second convolutional block and this succession follows till the last convolutional block."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-94",
"text": "Stacking of several convolutional layers/blocks allows to increase and control the effective context size."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-95",
"text": "For example, stacking 10 layers of convolutional blocks, each having a kernel width of k=4, results in effective context size of 31 elements."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-96",
"text": "Each output is dependent on 31 inputs."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-97",
"text": "Stacking of several convolutional layers/blocks results in a hierarchical structure."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-98",
"text": "In hierarchical structure, nearby elements interact at lower blocks and distant elements at higher blocks."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-99",
"text": "It provides a shorter path for modeling long-range dependencies and eases discovery of compositional structure in sequences compared to sequential structure of RNNs."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-100",
"text": "For example, to model dependencies between n words, only O ( n k ) convolutional operations would be required by CNN in contrast to O(n) operations in RNN."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-101",
"text": "RNNs over-process the first word and under-process the last word, Figure 2 : Working of ConvSeq2Seq generator on an input instance from training dataset."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-102",
"text": "Here, for representation purpose, encoder and decoder consists of only one convolutional layer with kernel width k=3."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-103",
"text": "The encoder input sequence x =(how, far, is, that, inform, distance, X-distance) comprises of user context \"how far is that\" prepended to input DA \"inform distance X-distance\"."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-104",
"text": "whereas a constant number of kernels and nonlinearities are applied to the inputs of CNN which eases the learning process (Gehring et al., 2017 )."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-105",
"text": "The ConvSeq2Seq model uses position embeddings in addition to word embeddings in order to get a sense of which part of the input sequence it is currently processing (Gehring et al., 2017) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-106",
"text": "Let e = (w 1 + p 1 ,. . . ,w s + p s ) be the input sequence representation, where w = (w 1 ,. . . ,w s ) and p = (p 1 ,. ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-107",
"text": ". ,p s ) are the the word embeddings and positional embeddings of the input sequence x = (x 1 ,. . . ,x s ) (having s elements) to the encoder network respectively."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-108",
"text": "Intermediate states are computed based on a fixed number of input elements."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-109",
"text": "In encoding phase, input is padded with k\u22121 2 elements on the left and right side with zero vectors."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-110",
"text": "For each block l, the output z l = (z l 1 ,. . . ,z l s ) is computed as follows:"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-111",
"text": "] is the input A \u2208 R kd from the previous block, W l z \u2208 R 2d\u00d7kd , b l z \u2208 R 2d are parameters of convolution kernel and d is embedding dimension."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-112",
"text": "Let B \u2208 R 2d be the output of convolution kernel."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-113",
"text": "\u03bd() is gated linear unit(GLU) which is the nonlinearity function applied to the output B of convolution kernel."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-114",
"text": "Let z u be the encoder output from the last block u. Let g = (g 1 ,. . . ,g t ) be the representation of the sequence that is being fed to the decoder network."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-115",
"text": "Computation of g is similar to that of encoder network."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-116",
"text": "Input to decoder is padded with k-1 elements on both left and right side with zero vectors to prevent decoder from having access to future information."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-117",
"text": "As a result, last k-1 intermediate decoder outputs are removed."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-118",
"text": "In decoding phase, for each block l, the output h l = (h l 1 ,. . . ,h l t ) is computed as follows:"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-119",
"text": "Here, q l = (q l 1 , . . . , q l t ) is the intermediate decoder output and its computation is similar to that of encoder network."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-120",
"text": "For computing attention, current intermediate decoder state q l i is combined with the embedding of the previous target element g i as shown in Equation (2) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-121",
"text": "Equation (3) computes attention of i-th decoder state and j-th encoder output element for the l-th decoder block."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-122",
"text": "Equation (4) computes the conditional input which is weighted sum of combination of encoder outputs and input embeddings."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-123",
"text": "Equation (5) computes the current decoder output which is combination of conditional input, intermediate decoder output and previous layer decoder output."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-124",
"text": "Let h L i be the decoder output of i-th element and the final decoding block L. Distribution over T possible next target elements y i+1 is computed as follows:"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-125",
"text": "Here, \u03b6 is softmax function, W o and b o are the weights and bias of fully connected linear layer."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-127",
"text": "**CNN RERANKER**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-128",
"text": "The n-best beam search responses from ConvSeq2Seq model may have missing information and/or irrelevant information."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-129",
"text": "CNN reranker reranks the n-best beam search responses and heavily penalizes those responses which are not semantically in correspondence with the input DA."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-130",
"text": "Responses having missing information and/or irrelevant information are heavily penalized."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-131",
"text": "Convolutional networks are excellent feature extractors and have achieved state-of-the-art results in many text classification and sentence-level classification tasks such as sentiment analysis, question classification, etc (Kim, 2014; Kalchbrenner et al., 2014) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-132",
"text": "This classifier takes as input a natural language response and outputs a binary vector."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-133",
"text": "Each element of binary vector is a binary decision on the presence of DA type or slot-value combinations."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-134",
"text": "For the dataset which we have used (Du\u0161ek and Jurcicek, 2016a) , there are 19 such classes of DA types and slot-value combinations."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-135",
"text": "These 19 classes are shown in Figure 3 ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-136",
"text": "Input DAs are converted to similar binary vector."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-137",
"text": "Hamming distance between the classifier output and binary vector representation of input DA is considered as reranking penalty."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-138",
"text": "The weighted reranking penalties of all the n-best responses are subtracted from their log-probabilities similar to Du\u0161ek and Jurcicek (2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-139",
"text": "The architecture and working of the CNN reranker on an input instance from training dataset is shown in Figure 4 ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-140",
"text": "It is based on the CNN architecture proposed for sentence classification by Kim (2014) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-141",
"text": "Input is a natural language response x = (x 1 , x 2 , . . . , x n ) where x i 's are word embeddings each having m dimensions, resulting in a input matrix of n\u00d7m dimensions."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-142",
"text": "Each filter has the width equal to the size of word embeddings, i.e., m and its height specifies the number of words it will operate on."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-143",
"text": "This one dimensional convolution is followed by applying activation function and 1-max pooling."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-144",
"text": "The resulting feature vector has the dimension equal to the total number of filters."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-145",
"text": "This penultimate layer is operated upon by a logistic layer to output the binary vector."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-146",
"text": "Given penultimate layer feature vector t, the output binary vector y is computed as: Here, \u03c3 is sigmoid activation function, W f is the weight matrix and b is the bias vector."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-147",
"text": "The model proposed by Wen et al. (2015a) implements a CNN reranker that uses onedimensional filters where convolutional operations are carried out on segments of words."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-148",
"text": "It uses padding vectors."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-149",
"text": "Proposed CNN reranker uses two-dimensional filters which operate on complete words rather than segments of words."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-150",
"text": "This is more intuitive and meaningful."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-151",
"text": "Also, no padding is required."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-152",
"text": "The feature vector from proposed CNN reranker is v-dimensional whereas CNN reranker by Wen et al. (2015a) outputs longer feature vectors having dimension equal to v * m, where v = total number of filters and m = embedding size."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-153",
"text": "Thus, proposed CNN reranker requires lesser number of computations."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-155",
"text": "**EXPERIMENTAL STUDIES**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-156",
"text": "The studies in this work are performed on Alex Context natural language generation (NLG) dataset (Du\u0161ek and Jurcicek, 2016a )."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-157",
"text": "This dataset is intended for fully trainable NLG systems in task-oriented spoken dialogue systems (SDS)."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-158",
"text": "It is in the domain of public transport information and has four dialogue act (DA) types namely request, inform, iconfirm and inform no match."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-159",
"text": "It contains 1859 data instances each having 3 target responses."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-160",
"text": "Each data instance consists of a preceding context (user utterance), source meaning representation and target natural language responses/sentences."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-161",
"text": "Data is delexicalized and split into training, validation and test sets as done by Du\u0161ek and Jurcicek (2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-162",
"text": "For training and validation, the three paraphrases are used as separate instances."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-163",
"text": "For evaluation they are used as three target references."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-164",
"text": "Input to our ConvSeq2Seq generator is a DA prepended with user utterance."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-165",
"text": "This allows entrainment of the model to the user utterances."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-166",
"text": "A single dictionary is used for context utterances and DA tokens."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-190",
"text": "Both the models have been evaluated on five different metrics, with NIST and BLEU scores being of atmost importance."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-167",
"text": "Our model is trained by minimizing cross-entropy error using Nesterov Accelerated Gradient (NAG) optimizer (Nesterov, 1983) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-168",
"text": "The hyper-parameters are chosen by crossvalidation method."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-169",
"text": "Based on our experiments on validation set, we use maximum sentences per batch 20, learning rate 0.07, minimum learning rate 0.00001, maximum number of epochs 2000, learning rate shrink factor 0.5, clip-norm 0.5, encoder embedding dimension 100, decoder embedding dimension 100, decoder output embedding dimension 100 and dropout 0.3."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-170",
"text": "Encoder part includes 10 layers/blocks, each having 100 units and kernel width of 7."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-171",
"text": "Decoder part includes 10 layers, each having 100 units and kernel width of 7."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-172",
"text": "For generating outputs on test set, we choose batch size 128 and beam size 20."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-173",
"text": "For our CNN reranker, all the possible combinations of DA tokens and its values are considered as classes."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-174",
"text": "We have 19 such classes."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-175",
"text": "Each input is a natural language sentence and each output is a set of class labels."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-176",
"text": "Training is done by minimizing cross-entropy loss using Adam optimizer (Kingma and Ba, 2015) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-177",
"text": "Cross-entropy error is measured on validation set after every 100 steps."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-178",
"text": "Misclassification penalty for CNN reranker is set to 100."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-179",
"text": "Based on our experiments, we choose embedding dimension 128, filter sizes (3, 5, 7, 9) , number of filters 64, dropout keep probability 0.5, batch size 100, number of epochs 100 and L2 regularization, \u03bb=0.05."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-180",
"text": "The performance of the proposed ConvSeq2Seq model for NLG is compared with that of TGen model (Du\u0161ek and Jurcicek, 2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-181",
"text": "For comparison, we have considered NIST (Doddington, 2002) , BLEU (Papineni et al., 2002) , METEOR (Denkowski and Lavie, 2014) , ROUGE L (Lin, 2004) and CIDEr metrics (Vedantam et al., 2015) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-182",
"text": "For this study, we have considered script \"mtevalv13a-sig.pl\" (version 13a) that implements these metrics."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-183",
"text": "This script was used for E2E NLG challenge (Novikova et al., 2017) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-184",
"text": "We focus on the evaluations using this version."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-185",
"text": "Our model has also been evaluated using the metric script \"mtevalv11b.pl\" (version 11b) to compare our results with those stated in (Du\u0161ek and Jurcicek, 2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-186",
"text": "The 13a version takes into account the closest reference length with respect to candidate length for calculation of brevity penalty."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-187",
"text": "This is in accordance with IBM BLEU."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-188",
"text": "On the contrary, 11b version takes shortest reference length for measuring brevity penalty."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-189",
"text": "This is the reason behind higher BLEU scores in the 11b version when compared to 13a version."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-214",
"text": "Studies also show that CNN reranker outperforms the RNN reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-191",
"text": "We have used N-gram match reranker with the weight \u03c9 set to 1 based on experiments done on validation set."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-192",
"text": "When using 11b version for evaluating automatic metrics, weight \u03c9 is set to 5."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-193",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-194",
"text": "**STUDIES OF THE MODELS USING 13A VERSION OF THE METRICS**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-195",
"text": "The comparison of the performance of the proposed model with that of TGen model using the 13a version of the metric implementation is given in Table 1 ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-196",
"text": "It is seen from Table 1 that there is a slight improvement in the scores of our ConvSeq2Seq generator after using CNN reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-197",
"text": "However, scores improve significantly when Ngram match reranker is used in addition to CNN reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-198",
"text": "An improvement of 3.32 BLEU points is seen."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-199",
"text": "The best scores are obtained when \u03c9 is set to 1 for N-gram match reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-200",
"text": "ConvSeq2Seq model in combination with CNN reranker and N-gram match reranker outperforms TGen model with N-gram match reranker in all the metrics, with a difference of 0.65 in terms of NIST score which is 8% more than the TGen NIST score on this setup."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-201",
"text": "ConvSeq2Seq model with CNN reranker outperforms TGen model with RNN reranker in all the metrics, with a difference of 1.8 in terms of NIST score which is 27% more than the TGen NIST score on this setup."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-202",
"text": "In the domain of NLG, NIST score is found to have highest correlation with human based judgments when compared to other metrics (Belz and Reiter, 2006) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-203",
"text": "In Table 1 , the bold numbers indicate the best scores."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-204",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-205",
"text": "**STUDIES OF THE MODELS USING 11B VERSION OF THE METRICS**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-206",
"text": "The comparison of the performance of the proposed model with that of TGen model using the 11b version of the metric implementation is given in Table 2 ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-207",
"text": "A slight improvement in the scores of our ConvSeq2Seq generator after using CNN reranker is seen in Table 2 except for BLEU score."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-208",
"text": "We see an improvement of 6.7 BLEU points when using N-gram match reranker with \u03c9 set to 5."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-209",
"text": "A decrease in scores of other metrics is seen."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-210",
"text": "These inconsistencies are due to the way brevity penalty is calculated for computing BLEU scores in 11b version of metric implementation."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-211",
"text": "BLEU and NIST scores of the TGen model given in Table 2 match with that represented in (Du\u0161ek and Jurcicek, 2016a) ."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-212",
"text": "The scores of our model shows slight improvement over TGen model."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-213",
"text": "The studies done to compare the proposed model with the TGen model, show the effectiveness of considering the CNN-based approach to NLG."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-215",
"text": "Further, CNN-based model is expected to take less time to train when compared to RNN-based model."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-216",
"text": "We compare the time taken by the models in the next section."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-218",
"text": "**STUDIES ON THE MODELS BASED ON TRAINING TIME**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-219",
"text": "In this section, we compare the proposed model with that of TGen based on time taken for training."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-220",
"text": "All the experiments were performed on 8GB Nvidia GeForce GTX 1080 GPU."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-221",
"text": "The time taken for training ConvSeq2Seq generator is approximately 4 minutes."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-222",
"text": "The time taken for training CNN reranker is approximately 2 minutes."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-223",
"text": "The time taken for training TGen model is approximately 128 minutes which is 21 times more than our ConvSeq2Seq generator in combination with CNN reranker."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-224",
"text": "This shows the effectiveness of using convolutional neural network in building a model for NLG than using recurrent neural network based approach used in TGen."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-225",
"text": "----------------------------------"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-226",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-227",
"text": "In this paper a novel approach to natural language generation (NLG) using convolutional sequence to sequence learning is proposed."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-228",
"text": "The convolutional model for NLG is found to encapsulate dependencies between words in a better way than recurrent neural network (RNN) based sequence to sequence learning."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-229",
"text": "It is also seen that the convolutional approach makes efficient use of computational resources."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-230",
"text": "The proposed model in combination with CNN reranker and N-gram match reranker is capable of entraining to users' way of speaking."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-231",
"text": "Studies conducted on a standard dataset shows the effectiveness of proposed approach which outperforms the conventional RNNbased approach."
},
{
"sent_id": "1fd85a350d9ec7ac12151cfe4412e4-C001-232",
"text": "In future, we propose to perform human based evaluations to support the present performance of the model."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-4"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-19"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-70"
]
],
"cite_sentences": [
"1fd85a350d9ec7ac12151cfe4412e4-C001-4",
"1fd85a350d9ec7ac12151cfe4412e4-C001-19",
"1fd85a350d9ec7ac12151cfe4412e4-C001-70"
]
},
"@USE@": {
"gold_contexts": [
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-35"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-64"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-70"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-81"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-134"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-138"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-156"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-161"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-180"
],
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-185"
]
],
"cite_sentences": [
"1fd85a350d9ec7ac12151cfe4412e4-C001-35",
"1fd85a350d9ec7ac12151cfe4412e4-C001-64",
"1fd85a350d9ec7ac12151cfe4412e4-C001-70",
"1fd85a350d9ec7ac12151cfe4412e4-C001-81",
"1fd85a350d9ec7ac12151cfe4412e4-C001-134",
"1fd85a350d9ec7ac12151cfe4412e4-C001-138",
"1fd85a350d9ec7ac12151cfe4412e4-C001-156",
"1fd85a350d9ec7ac12151cfe4412e4-C001-161",
"1fd85a350d9ec7ac12151cfe4412e4-C001-180",
"1fd85a350d9ec7ac12151cfe4412e4-C001-185"
]
},
"@DIF@": {
"gold_contexts": [
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-35"
]
],
"cite_sentences": [
"1fd85a350d9ec7ac12151cfe4412e4-C001-35"
]
},
"@SIM@": {
"gold_contexts": [
[
"1fd85a350d9ec7ac12151cfe4412e4-C001-211"
]
],
"cite_sentences": [
"1fd85a350d9ec7ac12151cfe4412e4-C001-211"
]
}
}
},
"ABC_6d5a52c29e4f91bc17502e250c9187_8": {
"x": [
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-2",
"text": "State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-3",
"text": "The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-4",
"text": "We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-5",
"text": "The DisCo transformer is trained to predict every output token given an arbitrary subset of the other reference tokens."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-6",
"text": "We also develop the parallel easy-first inference algorithm, which iteratively refines every token in parallel and reduces the number of required iterations."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-7",
"text": "Our extensive experiments on 7 directions with varying data sizes demonstrate that our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-10",
"text": "State-of-the-art neural machine translation systems use autoregressive decoding where words are predicted one-byone conditioned on all previous words (Bahdanau et al., 2015; Vaswani et al., 2017) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-11",
"text": "Non-autoregressive machine translation (NAT, Gu et al. (2018) ), on the other hand, generates all words in one shot and speeds up decoding at the expense of performance drop."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-12",
"text": "Parallel decoding results in conditional independence and prevents the model from properly capturing highly multimodal distribution of target translations (Gu et al., 2018) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-13",
"text": "One way to remedy this fundamental problem is to refine model output iteratively (Lee et al., 2018; Ghazvininejad et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-14",
"text": "This work pursues this iterative approach to non-autoregressive translation."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-15",
"text": "1 In this work, we propose a transformer-based architecture with attention masking, which we call Disentangled Context (DisCo) transformer, and use it for non-autoregressive decoding."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-16",
"text": "Specifically, our DisCo transformer predicts every word in a sentence conditioned on an arbitrary subset of the rest of the words."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-17",
"text": "Unlike the masked language models (Devlin et al., 2019; Ghazvininejad et al., 2019) where the model only predicts the masked words, the DisCo transformer can predict all words simultaneously, leading to faster inference as well as a substantial performance gain when training data are relatively large."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-18",
"text": "We also introduce a new inference algorithm for iterative parallel decoding, parallel easy-first, where each word is predicted by attending to the words that the model is more confident about."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-19",
"text": "This decoding algorithm allows for predicting all tokens with different context in each iteration and terminates when the output prediction converges, contrasting with the constant number of iterations (Ghazvininejad et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-20",
"text": "Indeed, we will show in a later section that this method substantially reduces the number of required iterations without loss in performance."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-21",
"text": "Our extensive empirical evaluations on 7 translation directions from standard WMT benchmarks show that our approach achieves competitive performance to state-of-the-art non-autoregressive and autoregressive machine translation while significantly reducing decoding time on average."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-22",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-23",
"text": "**DISCO TRANSFORMER**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-24",
"text": "In this section, we introduce our DisCo transformer for nonautoregressive translation (Fig. 1 )."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-25",
"text": "We propose a DisCo objective as an efficient alternative to masked language modeling and design an architecture that can compute the objective in a single pass."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-27",
"text": "**DISCO OBJECTIVE**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-28",
"text": "Similar to masked language models for contextual word representations (Devlin et al., 2019; Liu et al., 2019 ), a con- ditional masked language model (CMLM, Ghazvininejad et al. (2019) ) predicts randomly masked target tokens Y mask given a source text X and the rest of the target tokens Y obs ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-29",
"text": "Namely, for every sentence pair in bitext X and Y ,"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-30",
"text": "where RS denotes random sampling of masked tokens."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-31",
"text": "2 CMLMs have proven successful in parallel decoding for machine translation (Ghazvininejad et al., 2019) , video captioning (Yang et al., 2019a) , and speech recognition (Nakayama et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-32",
"text": "However, the fundamental inefficiency with this masked language modeling objective is that the model can only be trained to predict a subset of the reference tokens (Y mask ) for each network pass unlike a normal autoregressive model where we predict all Y from left to right."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-33",
"text": "To address this limitation, we propose a Disentangled Context (DisCo) objective."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-34",
"text": "The objective involves prediction of every token given an arbitrary (thus disentangled) subset of the other tokens."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-35",
"text": "For every 1 \u2264 n \u2264 N where |Y | = N , we predict:"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-37",
"text": "**DISCO TRANSFORMER ARCHITECTURE**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-38",
"text": "Simply computing conditional probabilities P (Y n |X, Y n obs ) with a vanilla transformer decoder will necessitate N separate transformer passes for each Y n obs ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-39",
"text": "We introduce the 2 BERT (Devlin et al., 2019 ) masks a token with probability 0.15 while CMLMs (Ghazvininejad et al., 2019) sample the number of masked tokens uniformly from [1, N ]."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-40",
"text": "DisCo transformer to compute these N contexts in one shot:"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-41",
"text": "In particular, our DisCo transformer makes crucial use of attention masking to achieve this computational efficiency."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-42",
"text": "Denote input word and positional embeddings at position n by w n and p n ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-43",
"text": "For each position n in Y , the vanilla transformer computes self-attention: 3 k n , v n , q n = Proj(w n + p n )"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-44",
"text": "where K and V denote concatenated matricies of k n and v n for 1 \u2264 n \u2264 N ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-45",
"text": "We modify this attention computation in two aspects."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-46",
"text": "First, we separate query input from key and value input to avoid feeding the token we predict."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-47",
"text": "Then we only attend to keys and values that correspond to observed tokens (K n obs , V n obs ) and mask out the connection to the other tokens (Y n mask and Y n itself, dashed lines in Fig. 1 )."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-48",
"text": "k n , v n = Proj(w n + p n ) q n = Proj(p n ) h n = Attention(K n obs , V n obs , q n )"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-50",
"text": "**STACKED DISCO TRANSFORMER**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-51",
"text": "Unfortunately stacking DisCo transformer layers is not straightforward."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-52",
"text": "Suppose that we compute the nth position in the jth layer from the prevous layer's output as follows:"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-53",
"text": "k j n , v j n = Proj(w n + h j\u22121 n ) q j n = Proj(h j\u22121 n ) h j n = Attention(K n,j obs , V n,j obs , q j n ) In this case, however, any cyclic relation between positions will cause information leakage."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-54",
"text": "Concretely, assume that Y = [A, B] and N = 2."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-55",
"text": "Suppose also that Y 1 obs = B and Y 2 obs = A, and thus there is a cycle that position 1 can see B and position 2 can see A. Then the output state at position 1 in the first layer h 1 1 becomes a function of B: h 1 1 (B) = Attention(k 1 2 (B), v 1 2 (B), q 1 1 ) Since position 2 can see position 1, the output state at position 2 in the second layer h 2 2 is computed by"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-56",
"text": "But h 2 2 will be used to predict the token at position 2 i.e. B, and this will clearly make the prediction problem degenerate."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-57",
"text": "To avoid this cyclic leakage, we make keys and values independent of the previous layer's output: k j n , v j n = Proj(w n + p n ) q j n = Proj(h j\u22121 n ) h n = Attention(K n,j obs , V n,j obs , q j n ) In other words, we decontextualize keys and values in stacked DisCo transformer layers."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-59",
"text": "**TRAINING LOSS**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-60",
"text": "We use a standard transformer as an encoder and stacked DisCo layers as a decoder."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-61",
"text": "For each Y n in Y where |Y | = N , we uniformly sample the number of visible tokens from [0, N \u2212 1], and then we randomly choose that number of tokens from Y \\ Y n as Y n obs , similarly to CMLMs (Ghazvininejad et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-62",
"text": "We optimize the negative log likelihood loss from P (Y n |X, Y n obs ) (1 \u2264 n \u2264 N )."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-63",
"text": "Again following CMLMs, we append a special token to the encoder and project the vector to predict the target length for parallel decoding."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-64",
"text": "We add the negative log likelihood loss from this length prediction to the loss from word predictions."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-66",
"text": "**DISCO OBJECTIVE AS GENERALIZATION**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-67",
"text": "We designed the DisCo transformer to compute conditional probabilities at every position efficiently, but here we note that the DisCo transformer can be readily used with other training schemes in the literature."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-68",
"text": "We can train an autoregressive DisCo transformer by always setting Y n obs = Y Ghazvininejad et al. (2019) to decode a conditional masked language model (CMLM)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-85",
"text": "The target length N is first predicted, and then the algorithm iterates over two steps: mask where i t tokens with lowest probability are masked and predict where those masked tokens are updated given the other N \u2212 i t tokens."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-86",
"text": "The number of masked tokens i t decays from N with a constant rate over a fixed number of iterations T ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-87",
"text": "Specifically, at iteration t,"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-88",
"text": "This method is directly applicable to our DisCo transformer by fixing Y n,t obs regardless of the position n."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-90",
"text": "**PARALLEL EASY-FIRST**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-91",
"text": "An advantage of the DisCo transformer over a CMLM is that we can predict tokens in all positions conditioned on different context simultaneously."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-92",
"text": "The mask-predict inference can only update masked tokens given the fixed observed tokens Y t obs , meaning that we are wasting the opportunity to improve upon Y t obs and to take advantage of broader context present in Y t mask ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-93",
"text": "We develop an algorithm, parallel easy-first, which makes predictions in all positions, thereby benefiting from this property."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-94",
"text": "Concretely, in the first iteration, we predict all tokens in parallel given source sentence:"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-95",
"text": "Then, we get the easy-first order z where z(i) denotes the rank of p i in descending order."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-96",
"text": "At iteration t > 1, we update predictions for all positions by"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-97",
"text": "obs Namely, we update each position given previous predictions on the easier positions."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-98",
"text": "In a later section, we will explore several variants of choosing Y n,t obs and show that this easyfirst strategy performs best despite its simplicity."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-100",
"text": "**LENGTH BEAM**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-101",
"text": "Following Ghazvininejad et al. (2019) , we apply length beam."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-102",
"text": "In particular, we predict top K lengths from the distribution in length prediction and run parallel easy-first simultaneously."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-103",
"text": "In order to speed up decoding, we terminate if the one with the highest average log score N n=1 log(p t n )/N converges."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-104",
"text": "It should be noted that for parallel easy-first,"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-105",
"text": "obs for all positions n while mask-predict may keep updating tokens even after because Y t obs changes over iterations."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-106",
"text": "See Alg."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-107",
"text": "1 for full pseudo-code."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-108",
"text": "Notice that all for-loops are parallelizable except the one over iterations t. In the subsequent experiments, we use length beam size of 5 (Ghazvininejad et al., 2019) unless otherwise noted."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-109",
"text": "In Sec. 5.2, we Algorithm 1 Parallel Easy-First with Length Beam Source sentence: X Predicted lengths: N1, \u00b7 \u00b7 \u00b7 , NK Max number of iterations: T for k \u2208 {1, 2, ..., K} do for n \u2208 {1, 2, ..., N k } do Y 1,k n , p k n = (arg)max w P (yn = w|X) end for Get the easy-first order z k by sorting p k and let z k (i) be the rank of the ith position."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-110",
"text": "end for"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-111",
"text": "will illustrate that length beam facilitates decoding both the CMLM and DisCo transformer."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-113",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-114",
"text": "We conduct extensive experiments on standard machine translation benchmarks."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-115",
"text": "We demonstrate that our DisCo transformer with the parallel easy-first inference achieves comparable performance to, if not better than, prior work on non-autoregressive machine translation with substantial reduction in the number of sequential steps of transformer computation."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-116",
"text": "We also find that our DisCo transformer achieves more pronounced improvement when bitext training data are large, getting close to the performance of autoregressive models."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-118",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-119",
"text": "Benchmark datasets We evaluate on 7 directions from four standard datasets with various training data sizes: WMT'14 EN-DE (4.5M pairs), WMT'16 EN-RO (610K pairs), WMT'17 EN-ZH (20M pairs), and WMT'14 EN-FR (36M pairs, en\u2192fr only)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-120",
"text": "These datasets are all encoded into subword units by BPE (Sennrich et al., 2016) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-121",
"text": "4 We use the same preprocessed data and train/dev/test splits as prior work for fair comparisons (EN-DE: Vaswani et al. 2017); 4 We run joint BPE on all language pairs except EN-ZH."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-122",
"text": "Ott et al. (2018) )."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-123",
"text": "We evaluate performance with BLEU scores (Papineni et al., 2002) for all directions except that we use SacreBLEU (Post, 2018) 5 in en\u2192zh again for fair comparison with prior work (Ghazvininejad et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-124",
"text": "For all autoregressive models, we use beam search with b = 5 (Vaswani et al., 2017; Ott et al., 2018) and tune length penalty of \u03b1 \u2208 [0.0, 0.2, \u00b7 \u00b7 \u00b7 , 2.0] in validation."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-125",
"text": "For parallel easy-first, we set the max number of iterations T = 10 and use T = 4, 10 for constant-time mask-predict."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-127",
"text": "**BASELINES AND COMPARISON**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-128",
"text": "There has been a flurry of recent work on non-autoregressive machine translation (NAT) that finds a balance between parallelism and performance."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-129",
"text": "Performance can be measured using automatic evaluation such as BLEU scores (Papineni et al., 2002) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-130",
"text": "Latency is, however, challenging to compare across different methods."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-131",
"text": "For models that have an autoregressive component (e.g. Kaiser et al. (2018) ; Ran et al. (2019)), we can speed up sequential computation by caching states."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-132",
"text": "Further, many of prior NAT approaches generate varying numbers of translation candidates and rescore them using an autoregressive model."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-133",
"text": "The rescoring process typically costs overhead of one parallel pass of a transformer encoder followed by a decoder."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-134",
"text": "Given this complexity in latency comparison, we highlight two state-of-the-art iteration-based NAT models whose latency is comparable to our DisCo transformer due to the similar model structure."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-135",
"text": "See Sec. 6 for descriptions of more work on NAT."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-136",
"text": "CMLM As discussed earlier, we can generate a translation with mask-predict from a CMLM (Ghazvininejad et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-137",
"text": "We can directly compare our DisCo transformer with this method by the number of iterations required."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-138",
"text": "6 We provide results obtained by running their code."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-139",
"text": "7"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-140",
"text": "Levenshtein Transformer Levenshtein transformer (LevT) is a transformer-based iterative model for parallel sequence generation (Gu et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-141",
"text": "Its iteration consists of three sequential steps: deletion, placeholder prediction, and token prediction."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-142",
"text": "Unlike the CMLM with the constant-time mask-predict inference, decoding in LevT terminates adaptively under certain condition."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-143",
"text": "Its latency is roughly comparable by the average number of sequential transformer runs."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-144",
"text": "Each iteration consists of three transformer runs except that the first iteration skips the deletion step."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-145",
"text": "See Gu et al. (2019) (Luong et al., 2015; Vaswani et al., 2017) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-146",
"text": "Unfortunately, we lack consensus in evaluation (Post, 2018) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-147",
"text": "Hyperparameters We generally follow the hyperparameters for a transformer base (Vaswani et al., 2017; Ghazvininejad et al., 2019) : 6 layers for both the encoder and decoder, 8 attention heads, 512 model dimensions, and 2048 hidden dimensions."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-148",
"text": "We sample weights from N (0, 0.02), initialize biases to zero, and set layer normalization parameters to \u03b2 = 0, \u03b3 = 1 (Devlin et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-149",
"text": "For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on dev performance in each direction, and use 0.01 L 2 weight decay and label smoothing with \u03b5 = 0.1."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-150",
"text": "We train batches of 128K tokens using Adam (Kingma & Ba, 2015) with \u03b2 = (0.9, 0.999) and \u03b5 = 10 \u22126 ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-151",
"text": "The learning rate warms up to 5 \u00b7 10 \u22124 in the first 10K steps, and then decays with the inverse square-root schedule."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-152",
"text": "We train all models for 300K steps apart from en\u2192fr where we make 500K steps to account for the data size."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-153",
"text": "We measure the dev BLEU score at the end of each epoch to avoid stochasticity, and average the 5 best checkpoints to obtain the final model."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-154",
"text": "We use 16 Telsa V100 GPUs and accelerate training by utilizing mixed precision floating point (Micikevicius et al., 2018) , and implement all models with fairseq (Gehring et al., 2017) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-155",
"text": "We will release our code for easy replication."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-156",
"text": "Distillation Similar to previous work on nonautoregressive translation (e.g. Gu et al. (2018) ; Lee et al. (2018)), we apply sequence-level knowledge distillation (Kim & Rush, 2016 ) by training every model in all directions on translations produced by a standard left-to-right transformer model (transformer large for EN-DE, EN-ZH, EN-FR and base for EN-RO)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-157",
"text": "We also present results obtained from training a standard autoregressive base transformer on the same distillation data for comparison."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-158",
"text": "We assess the impact of distillation in Sec. 5.1 and demonstrate that distillation is still a key component in our non-autoregressive models."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-160",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-161",
"text": "Seen in Table 1 are the results in the four directions from the WMT'14 EN-DE and WMT'16 EN-RO datasets."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-162",
"text": "First, our re-implementations of CMLM + Mask-Predict outperform Ghazvininejad et al. (2019) (e.g. 31.24 vs. 30.53 in de\u2192en with 10 steps)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-163",
"text": "This is probably due to our tuning on the dropout rate and weight averaging of the 5 best epochs based on the validation BLEU performance (Sec. 4.1)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-164",
"text": "Our DisCo transformer with the parallel easy-first inference achieves at least comparable performance to the CMLM with 10 steps despite the significantly fewer steps on average (e.g. 4.82 steps in en\u2192de)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-165",
"text": "The one exception is ro\u2192en (33.25 vs. 33.67), but DisCo + Easy-First requires only 3.10 steps, and CMLM + Mask-Predict with 4 steps achieves similar performance of 33.27."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-166",
"text": "The limited advantage of our DisCo transformer on the EN-RO dataset suggests that we benefit less from the training efficiency of the DisCo transformer on the small dataset (610K sentence pairs)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-167",
"text": "DisCo + Mask-Predict generally underperforms DisCo + Easy-First, implying that the mask-predict inference, which fixes Y n obs across all positions n, fails to utilize the flexibility of the DisCo transformer."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-168",
"text": "DisCo + Easy-First also accomplishes significant reduction in the average number of steps as compared to the adaptive decoding in LevT (Gu et al., 2019) while performing competitively."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-169",
"text": "As discussed earlier, each iteration in inference on LevT involves three sequential transformer runs, which undermine the latency improvement."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-170",
"text": "Overall, we outperform other NAT models from prior work."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-171",
"text": "We achieve competitive performance to the standard autoregressive models with the same transformer base configuration on the EN-DE dataset except that the autoregressive model with distillation performs comparably to the transformer large teacher in en\u2192de (28.24 vs. 28.60)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-172",
"text": "Nonetheless, we still see a large gap between the autoregressive teachers and our NAT results in both directions from EN-RO, illustrating a limitation of our remedy for the trade-off between decoding parallelism and performance."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-174",
"text": "**DECODING SPEED**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-175",
"text": "We saw the the DisCo transformer with the parallel easy-first inference achieves competitive performance to the CMLM while reducing the number of iterations."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-176",
"text": "Here we compare them in terms of the wall-time speedup with respect to the standard autoregressive model of the same base configuration (Fig. 2) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-177",
"text": "For each decoding run, we feed one sentence at a time and measure the wall time from when the model is loaded until the last sentence is translated, following the setting in Gu et al. (2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-178",
"text": "All models are implemented in fairseq (Gehring et al., 2017) and run on a single Nvidia V100 GPU."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-179",
"text": "We can confirm that the average number of iterations directly translates to decoding time; the average number of iterations of the DisCo transformer with T = 10 was 5.44 and the measured speedup lies between T = 5, 6 of the CMLM."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-180",
"text": "Note that fairseq implements effcient decoding of autoregressive models by caching hidden states."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-181",
"text": "The average length of generated sentences in the autoregressive model was 25.16 (4.6x steps compared to 5.44 steps), but we only gained a threefold speedup from DisCo."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-183",
"text": "**ANALYSIS AND ABLATIONS**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-184",
"text": "In this section, we give an extensive analysis on our apporach along training and inference dimensions."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-185",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-186",
"text": "**TRAINING**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-187",
"text": "Distillation We assess the effects of knowledge distillation across different models and inference configurations (Table 4 )."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-188",
"text": "Consistent with previous models (Gu et al., 2018; Zhou et al., 2020) , we find that distillation facilitates all of the non-autoregressive models."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-189",
"text": "Moreover, the DisCo transformer benefits more from distillation compared to the CMLM under the same mask-predict inference."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-190",
"text": "This is in line with Zhou et al. (2020) who showed that there is correlation between the model capacity and distillation data complexity."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-191",
"text": "The DisCo transformer uses contextless keys and values, resulting in reduced capacity."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-192",
"text": "Autoregressive translation also improves with distillation from a large transformer, but the difference is relatively small."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-193",
"text": "Finally, we can observe that the gain from distillation decreases as we incorporate more global information in inference (more iterations in NAT cases and larger beam size in AT cases)."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-194",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-195",
"text": "**AT WITH CONTEXTLESS KVS**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-196",
"text": "We saw that a decoder with contextless keys and values can still retain performance in non-autoregressive models."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-197",
"text": "Here we use a decoder with contextless keys and values in autoregressive models."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-198",
"text": "The results (Table 5) show that it is able to retain performance even in autoregressive models regardless of distillation, suggesting further potential of our approach."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-199",
"text": "Easy-First Training So far we have trained our models to predict every word given a random subset of the other words. But this training scheme yields a gap between training and inference, which might harm the model."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-200",
"text": "We attempt to make training closer to inference by training the DisCo transformer in the easy-first order."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-201",
"text": "Similarly to the inference, we first predict the easy-first order by estimating P (Y n |X) for all n. Then, use that order to determine Y n obs ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-202",
"text": "8 The overall loss will be the sum of the negative loglikelihood of these two steps."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-203",
"text": "Seen in Table 6 are the results on the dev sets of en\u2192de and ro\u2192en."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-204",
"text": "In both directions, this easy-first training does not ameliorate performance, suggesting that randomness helps the model."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-205",
"text": "Notice also that the average number of iterations in inference decreases (4.03 vs. 4.29, 2.94 vs. 3.17) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-206",
"text": "The model gets trapped in a sub-optimal solution with reduced iterations due to lack of exploration."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-207",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-208",
"text": "**INFERENCE**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-209",
"text": "Alternative Inference Algorithms Here we compare various decoding strategies on the DisCo transformer (Table 7) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-210",
"text": "Recall in the parallel easy-first inference (Sec. 3.2), we find the easy-first order by sorting the probabilities in the first iteration and compute each position's probability conditioned on the easier positions from the previous iteration."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-211",
"text": "We evaluate two alternative orderings: left-to-right and right-to-left."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-212",
"text": "We see that both of them yield much degraded performance."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-213",
"text": "We also attempt to use even broader context than parallel easy-first by computing the probability at each position based on all other positions (all-but-itself, Y n,t obs = Y t\u22121 =n )."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-214",
"text": "We again see degraded performance, suggesting that cyclic dependency (e.g. Y t\u22121 m \u2208 Y n,t obs and Y t\u22121 n \u2208 Y m,t obs ) breaks consistency."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-215",
"text": "Fig. 3 is a translation example in de\u2192en when decoding the same DisCo transformer with the mask-predict or parallel easy-first inference."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-216",
"text": "In both algorithms, iterative refinement resolves structural inconsistency, such as repetition."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-217",
"text": "Parallel easy-first succeeds in incorporating more context in early stages whereas maskpredict continues to produce inconsistent predictions (\"my my activities\") until more context is available later, resulting in one additional iteration to land on a consistent output."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-218",
"text": "Length Beam Fig. 4 shows performance of the CMLM and DisCo transformer with varying size of length beam."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-219",
"text": "All cases benefit from multiple candidates with different lengths to a certain point, but DisCo + Easy-First improves most."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-220",
"text": "This can be because parallel easy-first relies on the easyfirst order as well as the length, and length beam provides opportunity to try multiple orderings."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-222",
"text": "**EXAMPLE TRANSLATION SEEN IN**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-223",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-224",
"text": "**RELATED AND FUTURE WORK**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-225",
"text": "Recent work on non-autoregressive translation developed ways to mitigate the trade-off between decoding parallelism and performance."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-226",
"text": "As in this work, several prior work proposed methods to iteratively refine output predictions (Lee et al., 2018; Ghazvininejad et al., 2019; Gu et al., 2019; Mansimov et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-227",
"text": "Other approaches include adding a lite autoregressive module to parallel decoding (Kaiser et al., 2018; Sun et al., 2019; Ran et al., 2019) , partially decoding autoregressively (Stern et al., 2018; , rescoring output candidates autoregressively (e.g. Gu et al. (2018) ), mimicking hidden states of an autoregressive teacher , training with different objectives than vanilla negative log likelihood (Libovick\u00fd & Helcl, 2018; Wang et al., 2019; Shao et al., 2020) , reordering input sentences (Ran et al., 2019) , and modeling with latent variables (Ma et al., 2019; Shu et al., 2020) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-228",
"text": "While this work took iterative decoding methods, our DisCo transformer can be combined with other approaches for efficient training."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-229",
"text": "For example, Li et al. (2019) trained two separate non-autoregressive and autoregressive models, but it is possible to train a single DisCo transformer with both autoregressive and random masking and use hidden states from autoregressive masking as a teacher."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-230",
"text": "We leave integration of the DisCo transformer with more approaches to non-autoregressive translation for future."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-231",
"text": "We also note that our DisCo transformer can be used for general-purpose representation learning."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-232",
"text": "In particular, Liu et al. (2019) found that masking different tokens in every epoch outperforms static masking in BERT (Devlin et al., 2019) ."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-233",
"text": "Our DisCo transformer would allow for making a prediction at every position given arbitrary context, providing even more flexibility for large-scale pretraining."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-234",
"text": "----------------------------------"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-235",
"text": "**CONCLUSION**"
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-236",
"text": "We presented the DisCo transformer that predicts every word in a sentence conditioned on an arbitrary subset of the other words."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-237",
"text": "We developed an inference algorithm that takes advantage of this efficiency and further speeds up generation without loss in translation quality."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-238",
"text": "Our results provide further support for the claim that non-autoregressive translation is a fast viable alternative to autoregressive translation."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-239",
"text": "Nonetheless, a discrepancy still remains between autoregressive and non-autoregressive performance when knowledge distillation from a large transformer is applied to both."
},
{
"sent_id": "6d5a52c29e4f91bc17502e250c9187-C001-240",
"text": "We will explore ways to narrow this gap in the future."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"6d5a52c29e4f91bc17502e250c9187-C001-13"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-17"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-19"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-28"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-31"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-84"
]
],
"cite_sentences": [
"6d5a52c29e4f91bc17502e250c9187-C001-13",
"6d5a52c29e4f91bc17502e250c9187-C001-17",
"6d5a52c29e4f91bc17502e250c9187-C001-19",
"6d5a52c29e4f91bc17502e250c9187-C001-28",
"6d5a52c29e4f91bc17502e250c9187-C001-31",
"6d5a52c29e4f91bc17502e250c9187-C001-84"
]
},
"@MOT@": {
"gold_contexts": [
[
"6d5a52c29e4f91bc17502e250c9187-C001-12",
"6d5a52c29e4f91bc17502e250c9187-C001-13"
]
],
"cite_sentences": [
"6d5a52c29e4f91bc17502e250c9187-C001-13"
]
},
"@DIF@": {
"gold_contexts": [
[
"6d5a52c29e4f91bc17502e250c9187-C001-17"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-19"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-39"
]
],
"cite_sentences": [
"6d5a52c29e4f91bc17502e250c9187-C001-17",
"6d5a52c29e4f91bc17502e250c9187-C001-19",
"6d5a52c29e4f91bc17502e250c9187-C001-39"
]
},
"@SIM@": {
"gold_contexts": [
[
"6d5a52c29e4f91bc17502e250c9187-C001-61"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-226"
]
],
"cite_sentences": [
"6d5a52c29e4f91bc17502e250c9187-C001-61",
"6d5a52c29e4f91bc17502e250c9187-C001-226"
]
},
"@USE@": {
"gold_contexts": [
[
"6d5a52c29e4f91bc17502e250c9187-C001-101"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-108"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-123"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-136"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-147"
],
[
"6d5a52c29e4f91bc17502e250c9187-C001-161",
"6d5a52c29e4f91bc17502e250c9187-C001-162"
]
],
"cite_sentences": [
"6d5a52c29e4f91bc17502e250c9187-C001-101",
"6d5a52c29e4f91bc17502e250c9187-C001-108",
"6d5a52c29e4f91bc17502e250c9187-C001-123",
"6d5a52c29e4f91bc17502e250c9187-C001-136",
"6d5a52c29e4f91bc17502e250c9187-C001-147",
"6d5a52c29e4f91bc17502e250c9187-C001-162"
]
}
}
},
"ABC_b87a8d14f1c2016caa7538aa08a33f_8": {
"x": [
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-80",
"text": "which is the sum over all training example pairs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-56",
"text": "The authors train an SVM with various semantic and structural features."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-30",
"text": "Furthermore, there is a 'head' component that has no outgoing link (the top of the tree)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-31",
"text": "Figure 1 shows an example that we will use throughout the paper to concretely explain how our approach works."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-2",
"text": "One of the major goals in automated argumentation mining is to uncover the argument structure present in argumentative text."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-3",
"text": "In order to determine this structure, one must understand how different individual components of the overall argument are linked."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-4",
"text": "General consensus in this field dictates that the argument components form a hierarchy of persuasion, which manifests itself in a tree structure."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-5",
"text": "This work provides the first neural network-based approach to argumentation mining, focusing on the two tasks of extracting links between argument components, and classifying types of argument components."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-6",
"text": "In order to solve this problem, we propose to use a joint model that is based on a Pointer Network architecture."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-54",
"text": "Recent work in argumentation mining offers data-driven approaches for the task of predicting links between ACs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-55",
"text": "Stab & Gurevych (2014b) approach the task as a binary classification problem."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-29",
"text": "link, but can have numerous incoming links."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-7",
"text": "A Pointer Network is appealing for this task for the following reasons: 1) It takes into account the sequential nature of argument components; 2) By construction, it enforces certain properties of the tree structure present in argument relations; 3) The hidden representations can be applied to auxiliary tasks."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-8",
"text": "In order to extend the contribution of the original Pointer Network model, we construct a joint model that simultaneously attempts to learn the type of argument component, as well as continuing to predict links between argument components."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-9",
"text": "The proposed joint model achieves state-of-the-art results on two separate evaluation corpora, achieving far superior performance than a regular Pointer Network model."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-10",
"text": "Our results show that optimizing for both tasks, and adding a fully-connected layer prior to recurrent neural network input, is crucial for high performance."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-13",
"text": "Computational approaches to argument mining/understanding have become very popular (Persing & Ng, 2016; Cano-Basave & He, 2016; Wei et al., 2016; Ghosh et al., 2016; Palau & Moens, 2009; Habernal & Gurevych, 2016) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-14",
"text": "One important avenue in this work is to understand the structure in argumentative text (Persing & Ng, 2016; Peldszus & Stede, 2015; Stab & Gurevych, 2016; Nguyen & Litman, 2016) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-15",
"text": "One fundamental assumption when working with argumentative text is the presence of Arguments Components (ACs)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-16",
"text": "The types of ACs are generally characterized as a claim or a premise (Govier, 2013) , with premises acting as support (or possibly attack) units for claims."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-17",
"text": "To model more complex structures of arguments, some annotation schemes also include a major claim AC type (Stab & Gurevych, 2016; 2014b) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-18",
"text": "Generally, the task of processing argument structure encapsulates four distinct subtasks: 1) Given a sequence of tokens that represents an entire argumentative text, determine the token subsequences that constitute non-intersecting ACs; 2) Given an AC, determine the type of AC (claim, premise, etc.); 3) Given a set/list of ACs, determine which ACs have a link that determine overall argument structure; 4) Given two linked ACs, determine whether the link is of a supporting or attacking relation."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-19",
"text": "In this work, we focus on subtasks 2 and 3."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-20",
"text": "There are two key assumptions our work makes going forward."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-21",
"text": "First, we assume subtask 1 has been completed, i.e. ACs have already been identified."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-22",
"text": "Second, we follow previous work that assumes a tree structure for the linking of ACs (Palau & Moens, 2009; Cohen, 1987; Peldszus & Stede, 2015; Stab & Gurevych, 2016) Figure 1: An example of argument structure with four ACs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-23",
"text": "The left side shows raw text that has been annotated for the presence of ACs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-24",
"text": "Squiggly and straight underlining means an AC is a claim or premise, respectively."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-25",
"text": "The ACs in the text have also been annotated for links to other ACs, which is show in the right figure."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-26",
"text": "ACs 3 and 4 are premises that link to another premise, AC2."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-27",
"text": "Finally, AC2 links to a claim, AC1."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-28",
"text": "AC1 therefore acts as the central argumentative component."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-32",
"text": "First, the left side of the figure presents the raw text of a paragraph in a persuasive essay (Stab & Gurevych, 2016) , with the ACs contained in square brackets."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-33",
"text": "Squiggly verse straight underlining differentiates between claims and premises, respectively."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-34",
"text": "The ACs have been annotated as to how the ACs are linked, and the right side of the figure reflects this structure."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-35",
"text": "The argument structure with four ACs forms a tree, where AC2 has two incoming links, and AC1 acts as the head, with no outgoing links."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-36",
"text": "We also specify the type of AC, with the head AC marked as claim and the remaining ACs marked as premise."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-37",
"text": "Lastly, we note that the order of arguments components can be a strong indicator of how components should related."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-38",
"text": "Linking to the first argument component can provide a competitive baseline heuristic (Peldszus & Stede, 2015; Stab & Gurevych, 2016) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-39",
"text": "Given the task at hand, we propose a modification of a Pointer Network (PN) (Vinyals et al., 2015b) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-40",
"text": "A PN is a sequence-to-sequence model that outputs a distribution over the encoding indices at each decoding timestep."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-41",
"text": "The PN is a promising model for link extraction in argumentative text because it inherently possesses three important characteristics: 1) it is able to model the sequential nature of ACs; 2) it constrains ACs to have a single outgoing link, thus partly enforcing the tree structure; 3) the hidden representations learned by the model can be used for jointly predicting multiple subtasks."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-42",
"text": "We also note that since a PN is a type of sequence-to-sequence model , it allows the entire sequence to be seen before making prediction."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-43",
"text": "This is important because if the problem were to be approached as standard sequence modeling (Graves & Schmidhuber, 2009; Robinson, 1994) , making predictions at each forward timestep, it would only allow links to ACs that have already been seen."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-44",
"text": "This is equivalent to only allowing backward links."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-45",
"text": "We note that we do test a simplified model that only uses hidden states from an encoding network to make predictions, as opposed to the sequence-to-sequence architecture present in the PN (see Section 5)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-46",
"text": "PNs were originally proposed to allow a variable length decoding sequence (Vinyals et al., 2015b) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-47",
"text": "Alternatively, the PN we implement differs from the original model in that we decode for the same number of timesteps as there are input components."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-48",
"text": "We also propose a joint PN for both extracting links between ACs and predicting the type of AC."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-49",
"text": "The model uses the hidden representation of ACs produced during the encoding step (see Section 3.4)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-50",
"text": "Aside from the partial assumption of tree structure in the argumentative text, our models do not make any additional assumptions about the AC types or connectivity, unlike the work of Peldszus (2014) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-51",
"text": "We evaluate our models on the corpora of Stab & Gurevych (2016) and Peldszus (2014) , and compare our results with the results of the aformentioned authors."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-53",
"text": "**RELATED WORK**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-57",
"text": "Peldszus & Stede (2015) have also used classification models for predicting the presence of links."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-58",
"text": "Various authors have also proposed to jointly model link extraction with other subtasks from the argumentation mining pipeline, using either an Integer Linear Programming (ILP) framework (Persing & Ng, 2016; Stab & Gurevych, 2016) or directly feeding previous subtask predictions into another model."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-59",
"text": "The former joint approaches are evaluated on annotated corpora of persuasive essays (Stab & Gurevych, 2014a; , and the latter on a corpus of microtexts (Peldszus, 2014) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-60",
"text": "The ILP framework is effective in enforcing a tree structure between ACs when predictions are made from otherwise naive base classifiers."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-61",
"text": "Unrelated to argumentation mining specifically, recurrent neural networks have previously been proposed to model tree/graph structures in a linear manner."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-62",
"text": "Vinyals et al. (2015c) use a sequenceto-sequence model for the task of syntactic parsing."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-63",
"text": "The authors linearize input parse graphs using a depth-first search, allowing it to be consumed as a sequence, achieving state-of-the-art results on several syntactic parsing datasets."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-64",
"text": "Bowman et al. (2015) experiment on an artificial entailment dataset that is specifically engineered to capture recursive logic (Bowman et al., 2014) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-65",
"text": "The text is annotated with brackets, in an original attempt to provide easy input into a recursive neural network."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-66",
"text": "However, standard recurrent neural networks can take in complete sentence sequences, brackets included, and perform competitively with a recursive neural network."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-68",
"text": "**POINTER NETWORK FOR LINK EXTRACTION**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-69",
"text": "In this section we will describe how we use a PN for the problem of extracting links between ACs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-70",
"text": "We begin by giving a general description of the PN model."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-72",
"text": "**POINTER NETWORK**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-73",
"text": "A PN is a sequence-to-sequence model with attention (Bahdanau et al., 2014) that was proposed to handle decoding sequences over the encoding inputs, and can be extended to arbitrary sets (Vinyals et al., 2015a) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-74",
"text": "The original motivation for a pointer network was to allow networks to learn solutions to algorithmic problems, such as the traveling salesperson and convex hull, where the solution is a sequence over candidate points."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-75",
"text": "The PN model is trained on input/output sequence pairs (E, D), where E is the source and D is the target (our choice of E,D is meant to represent the encoding, decoding steps of the sequence-to-sequence model)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-76",
"text": "Given model parameters \u0398, we apply the chain rule to determine the probability of a single training example:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-77",
"text": "where the function m signifies that the number of decoding timesteps is a function of each individual training example."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-78",
"text": "We will discuss shortly why we need to modify the original definition of m for our application."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-79",
"text": "By taking the log-likelihood of Equation 1, we arrive at the optimization objective:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-81",
"text": "The PN uses Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) for sequential modeling, which produces a hidden layer h at each encoding/decoding timestep."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-82",
"text": "In practice, the PN has two separate LSTMs, one for encoding and one for decoding."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-83",
"text": "Thus, we refer to encoding hidden layers as e, and decoding hidden layers as d."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-84",
"text": "The PN uses a form of content-based attention (Bahdanau et al., 2014) to allow the model to produce a distribution over input elements."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-85",
"text": "This can also be thought of as a distribution over input indices, wherein a decoding step 'points' to the input."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-86",
"text": "Formally, given encoding hidden states (e 1 , ..., e n ), The model calculates p(D i |D 1 , ..., D i\u22121 , E) as follows:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-87",
"text": "where matrices W 1 , W 2 and vector v are parameters of the model (along with the LSTM parameters used for encoding and decoding)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-88",
"text": "In Equation 3, prior to taking the dot product with v, the resulting transformation can be thought of as creating a joint, hidden representation of inputs i and j. Vector u i in equation 4 is of length n, and index j corresponds to input element j. Therefore, by taking the softmax of u i , we are able to create a distribution over the input."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-90",
"text": "**LINK EXTRACTION AS SEQUENCE MODELING**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-91",
"text": "A given piece of text has a set of ACs, which occur in a specific order in the text, (C 1 , ..., C n )."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-92",
"text": "Therefore, at encoding timestep i, the model is fed a representation of C i ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-93",
"text": "Since the representation is large and sparse (see Section 3.3 for details on how we represent ACs), we add a fully-connected layer before the LSTM input."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-94",
"text": "Given a representation R i for AC C i the LSTM input A i becomes:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-95",
"text": "where W rep , b rep in turn become model parameters, and \u03c3 is the sigmoid function 1 . (similarly, the decoding network applies a fully-connected layer with sigmoid activation to its inputs, see Figure 3 )."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-96",
"text": "At encoding step i, the encoding LSTM produces hidden layer e i , which can be thought of as a hidden representation of AC C i ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-97",
"text": "In order to make the PN applicable to the problem of link extraction, we explicitly set the number of decoding timesteps to be equal to the number of input components."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-98",
"text": "Using notation from Equation 1, the decoding sequence length for an encoding sequence E is simply m(E) = |{C 1 , ..., C n }|, which is trivially equal to n. By constructing the decoding sequence in this manner, we can associate decoding timestep i with AC C i ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-100",
"text": "**FROM EQUATION 4**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-101",
"text": ", decoding timestep D i will output a distribution over input indices."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-102",
"text": "The result of this distribution will indicate to which AC component C i links."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-103",
"text": "Recall there is a possibility that an AC has no outgoing link, such as if it's the root of the tree."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-104",
"text": "In this case, we state that if AC C i does not have an outgoing link, decoding step D i will output index i. Conversely, if D i outputs index j, such that j is not equal to i, this implies that C i has an outgoing link to C j ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-105",
"text": "For the argument structure in Figure 1 , the corresponding decoding sequence is (1, 1, 2, 2)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-106",
"text": "The topology of this decoding sequence is illustrated in Figure 2 ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-107",
"text": "Note how C 1 points to itself since it has no outgoing link."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-108",
"text": "Finally, we note that we modify the PN structure to have a Bidirectional LSTM as the encoder."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-109",
"text": "Thus, e i is the concatenation of forward and backward hidden states \u2212 \u2192 e i and \u2190 \u2212 e n\u2212i+1 , produced by two separate LSTMs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-110",
"text": "The decoder remains a standard forward LSTM."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-111",
"text": "1 We also experimented with relu and elu activations, but found sigmoid to yeild the best performance."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-113",
"text": "**REPRESENTING ARGUMENT COMPONENTS**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-114",
"text": "At each timestep of the decoder, the network takes in the representation of an AC."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-115",
"text": "Each AC is itself a sequence of tokens, similar to the recently proposed Question-Answering dataset (Weston et al., 2015) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-116",
"text": "We follow the work of Stab & Gurevych (2016) and focus on three different types of features to represent our ACs: 1) Bag-of-Words of the AC; 2) Embedding representation based on GloVe embeddings (Pennington et al., 2014) ; 3) Structural features: Whether or not the AC is the first AC in a paragraph, and Whether the AC is in an opening, body, or closing paragraph."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-117",
"text": "See Section 6 for an ablation study of the proposed features."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-119",
"text": "**JOINT NEURAL MODEL**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-120",
"text": "Up to this point, we focused on the task of extracting links between ACs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-121",
"text": "However, recent work has shown that joint models that simultaneously try to complete multiple aspects of the subtask pipeline outperform models that focus on a single subtask (Persing & Ng, 2016; Stab & Gurevych, 2014b; Peldszus & Stede, 2015) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-122",
"text": "Therefore, we will modify the architecture we proposed in Section 3 so that it would allow us to perform AC classification (Kwon et al., 2007; Rooney et al., 2012) together with link prediction."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-123",
"text": "Knowledge of an individual subtask's predictions can aid in other subtasks."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-124",
"text": "For example, claims do not have an outgoing link, so knowing the type of AC can aid in the link prediction task."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-125",
"text": "This can be seen as a way of regularizing the hidden representations from the encoding component (Che et al., 2015) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-126",
"text": "Predicting AC type is a straightforward classification task: given AC C i , we need to predict whether it is a claim or premise."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-127",
"text": "Some annotation schemes also include the class major claim (Stab & Gurevych, 2014a) , which means this can be a multi-class classification task."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-128",
"text": "For encoding timestep i, the model creates hidden representation e i ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-129",
"text": "This can be thought of as a representation of AC C i ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-130",
"text": "Therefore, our joint model will simply pass this representation through a fully connected layer as follows:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-131",
"text": "where W cls , b cls become elements of the model parameters, \u0398. The dimensionality of W cls , b cls is determined by the number of classes."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-132",
"text": "Lastly, we use softmax to form a distribution over the possible classes."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-133",
"text": "Consequently, the probability of predicting component type at timestep i is defined as:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-134",
"text": "p("
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-135",
"text": "Finally, combining this new prediction task with Equation 2, we arrive at the new training objective:"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-136",
"text": "which simply sums the costs of the individual prediction tasks, and the second summation is the cost for the new task of predicting argument component type."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-137",
"text": "\u03b1 \u2208 [0, 1] is a hyperparameter that specifies how we weight the two prediction tasks in our cost function."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-138",
"text": "The architecture of the joint model, applied to our ongoing example, is illustrated in Figure 3 ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-140",
"text": "**EXPERIMENTAL DESIGN**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-141",
"text": "As we have previously mentioned, our work assumes that ACs have already been identified."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-142",
"text": "That is, the token sequence that comprises a given AC is already known."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-143",
"text": "The order of ACs corresponds directly to the order in which the ACs appear in the text."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-144",
"text": "Since ACs are non-overlapping, there is no ambiguity in this ordering."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-145",
"text": "We test the effectiveness of our proposed model on a dataset of persuasive essays (Stab & Gurevych, 2016) , as well as a dataset of microtexts (Peldszus, 2014) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-146",
"text": "The feature space for the persuasive essay corpus has roughly 3,000 dimensions, and the microtext corpus feature space has between 2,500 and 3,000 dimensions, depending on the data split (see below)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-147",
"text": "The persuasive essay corpus contains a total of 402 essays, with a frozen set of 80 essays held out for testing."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-148",
"text": "There are three AC types in this corpus: major claim, claim, and premise."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-149",
"text": "We follow the creators of the corpus and only evaluate ACs within a given paragraph."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-150",
"text": "That is, each training/test example is a sequence of ACs from a paragraph."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-151",
"text": "This results in a 1,405/144 training/test split."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-152",
"text": "The microtext corpus contains 112 short texts."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-153",
"text": "Unlike, the persuasive essay corpus, each text in this corpus is itself a complete example."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-154",
"text": "Since the dataset is small, the authors have created 10 sets of 5-fold cross-validation, reporting the the average across all splits for final model evaluation."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-155",
"text": "This corpus contains only two types of ACs (claim and premise)"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-156",
"text": "The annotation of argument structure of the microtext corpus varies from the persuasive essay corpus; ACs can be linked to other links, as opposed to ACs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-157",
"text": "Therefore, if AC C i is annotated to be linked to link l, we create a link to the source AC of l. On average, this corpus has 5.14 ACs per text."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-158",
"text": "Lastly, we note that predicting the presence of links is directional (ordered): predicting a link between the pair"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-159",
"text": "We implement our models in TensorFlow (Abadi et al., 2015) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-160",
"text": "Our model has the following parameters: hidden input dimension size 512, hidden layer size 256 for the bidirectional LSTMs, hidden layer size 512 for the LSTM decoder, \u03b1 equal to 0.5, and dropout (Srivastava et al., 2014) of 0.9."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-161",
"text": "We believe the need for such high dropout is due to the small amounts of training data (Zarrella & Marsh, 2016) , particularly in the Microtext corpus."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-162",
"text": "All models are trained with Adam optimizer (Kingma & Ba, 2014 ) with a batch size of 16."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-163",
"text": "For a given training set, we randomly select 10% to become the validation set."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-164",
"text": "Training occurs for 4,000 epochs."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-165",
"text": "Once training is completed, we select the model with the highest validation accuracy (on the link prediction task) and evaluate it on the held-out test set."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-166",
"text": "At test time, we take a greedy approach and select the index of the probability distribution (whether link or type prediction) with the highest value."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-168",
"text": "**RESULTS**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-169",
"text": "The results of our experiments are presented in Tables 1 and 2 ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-170",
"text": "For each corpus, we present f1 scores for the AC type classification experiment, with a macro-averaged score of the individual class f1 scores."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-171",
"text": "We also present the f1 scores for predicting the presence/absence of links between ACs, as well as the associated macro-average between these two values."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-172",
"text": "We implement and compare four types of neural models: 1) The previously described PN-based model depicted in Figure 3 (called PN in the tables); 2) The same as 1), but without the fullyconnected input layers; 3) The same as 1), but the model only predicts the link task, and is therefore not optimized for type prediction; 4) A non-sequence-to-sequence model that uses the hidden layers produced by the BLSTM encoder with the same type of attention as the PN (called BLSTM in the table)."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-173",
"text": "That is, d i in Equation 3 is replaced by e i ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-174",
"text": "In both corpora we compare against the following previously proposed models: Base Classifier (Stab & Gurevych, 2016 ) is feature-rich, task-specific (AC type or link extraction) SVM classifier."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-175",
"text": "Neither of these classifiers enforce structural or global constraints."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-176",
"text": "Conversely, the ILP Joint Model (Stab & Gurevych, 2016) provides constrains by sharing prediction information between the base classifier."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-177",
"text": "For example, the model attempts to enforce a tree structure among ACs within a given paragraph, as well as using incoming link predictions to better predict the type class claim."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-178",
"text": "For the microtext corpus only, we have the following comparative models: Simple (Peldszus & Stede, 2015) is a feature-rich logistic regression classifier."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-179",
"text": "Best EG (Peldszus & Stede, 2015) creates an Evidence Graph (EG) from the predictions of a set of base classifier."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-180",
"text": "The EG models the potential argument structure, and offers a global optimization objective that the base classifiers attempt to optimize by adjusting their individual weights."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-181",
"text": "Lastly, MP+p (Peldszus & Stede, 2015) combines predictions from base classifiers with a MSTParser, which applies 1-best MIRA structured learning."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-182",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-183",
"text": "**DISCUSSION**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-184",
"text": "First, we point out that the PN model achieves state-of-the-art on 10 of the 13 metrics in Tables 1 and 2 , including the highest results in all metrics on the Persuasive Essay corpus, as well as link prediction on the Microtext corpus."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-185",
"text": "The performance on the Microtext corpus is very encouraging for several reasons."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-186",
"text": "First, the fact that the model can perform so well with only a hundred training examples is rather remarkable."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-187",
"text": "Second, although we motivate the use of a PN due to the fact that it partially enforces the tree structure in argumentation, other models explicitly contain further constraints."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-188",
"text": "For example, only premises can have outgoing links, and there can be only one claim in an AC."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-189",
"text": "As for the other neural models, the BLSTM model performs competitively with the ILP Joint Model on the persuasive essay corpus, but trails the performance of the PN model."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-190",
"text": "We believe this is because the PN model is able to create two different representations for each AC, one each in the encoding/decoding state, which benefits performance in the dual tasks, whereas the BLSTM model must encode information relating to type as well as link prediction in a single hidden representation."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-191",
"text": "On one hand, the BLSTM model outperforms the ILP model on link prediction, yet it is not able to match the ILP Joint Model's performance on type prediction, primarily due to the BLSTM's poor performance on predicting the major claim class."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-192",
"text": "Another interesting outcome is the importance of the fully-connected layer before the LSTM input."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-193",
"text": "The results show that this extra layer of depth is crucial for good performance on this task."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-194",
"text": "Without it, the PN model is only able to perform competitively with the Base Classifier."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-195",
"text": "The results dictate that even a simple fully-connected layer with sigmoid activation can provide a useful dimensionality reduction for feature representation."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-196",
"text": "Finally, the PN model that only extracts links suffers a large drop in performance, conveying that the joint aspect of the PN model is crucial for high performance in the link prediction task."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-197",
"text": "Table 3 shows the results of an ablation study for AC feature representation."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-198",
"text": "Regarding link prediction, BOW features are clearly the most important, as their absence results in the highest drop in performance."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-199",
"text": "Conversely, the presence of structural features provides the smallest boost in performance, as the model is still able to record state-of-the-art results compared to the ILP Joint Model."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-200",
"text": "This shows that, one one hand, the PN model is able to capture structural ques through sequence modeling and semantics (the ILP Joint Model directly integrates these structural features), however the PN model still does benefit from their explicit presence in the feature representation."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-201",
"text": "When considering type prediction, both BOW and structural features are important, and it is the embedding features that provide the least benefit."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-202",
"text": "The Ablation results also provide an interesting insight into the effectiveness of different 'pooling' strategies for using individual token embeddings to create a multi-word embedding."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-203",
"text": "The popular method of averaging embeddings (which is used by Stab & Gurevych (2016) in their system) is in fact the worst method, although its performance is still competitive with the previous state-of-the-art."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-204",
"text": "Conversely, max pooling produces results that are on par with the PN results from Table 1 ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-205",
"text": "Table 4 shows the results on the Persuasive Essay test set with the examples binned by sequence length."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-206",
"text": "First, it is not a surprise to see that the model performs best when the sequences are the shortest."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-207",
"text": "As the sequence length increases, the accuracy on link prediction drops."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-208",
"text": "This is possibly due to the fact that as the length increases, a given AC has more possibilities as to which other AC it can link to, making the task more difficult."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-209",
"text": "Conversely, there is actually a rise in no link prediction accuracy from the second to third row."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-210",
"text": "This is likely due to the fact that since the model predicts at most one outgoing link, it indirectly predicts no link for the remaining ACs in the sequence."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-211",
"text": "Since the chance probability is low for having a link between a given AC in a long sequence, the no link performance is actually better in longer sequences."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-213",
"text": "**CONCLUSION**"
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-214",
"text": "In this paper we have proposed how to use a modified PN (Vinyals et al., 2015b) to extract links between ACs in argumentative text."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-215",
"text": "We evaluate our models on two corpora: a corpus of persuasive essays (Stab & Gurevych, 2016) , and a corpus of microtexts (Peldszus, 2014) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-216",
"text": "The PN model records state-of-the-art results on the persuasive essay corpus, as well as achieving state-of-the-art results for link prediction on the microtext corpus, despite only having 90 training examples."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-217",
"text": "The results show that jointly modeling the two prediction tasks is crucial for high performance, as well as the presence of a fully-connected layer prior to the LSTM input."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-218",
"text": "Future work can attempt to learn the AC representations themselves, such as in Kumar et al. (2015) ."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-219",
"text": "Lastly, future work can integrate subtasks 1 and 4 into the model."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-220",
"text": "The representations produced by Equation 3 could potentially be used to predict the type of link connecting ACs, i.e. supporting or attacking; this is the fourth subtask in the pipeline."
},
{
"sent_id": "b87a8d14f1c2016caa7538aa08a33f-C001-221",
"text": "In addition, a segmenting technique, such as the one proposed by Weston et al. (2014) , can accomplish subtask 1."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"b87a8d14f1c2016caa7538aa08a33f-C001-14"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-22"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-31",
"b87a8d14f1c2016caa7538aa08a33f-C001-32"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-38"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-51"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-116"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-145"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-176"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-215"
]
],
"cite_sentences": [
"b87a8d14f1c2016caa7538aa08a33f-C001-14",
"b87a8d14f1c2016caa7538aa08a33f-C001-22",
"b87a8d14f1c2016caa7538aa08a33f-C001-32",
"b87a8d14f1c2016caa7538aa08a33f-C001-38",
"b87a8d14f1c2016caa7538aa08a33f-C001-51",
"b87a8d14f1c2016caa7538aa08a33f-C001-116",
"b87a8d14f1c2016caa7538aa08a33f-C001-145",
"b87a8d14f1c2016caa7538aa08a33f-C001-176",
"b87a8d14f1c2016caa7538aa08a33f-C001-215"
]
},
"@BACK@": {
"gold_contexts": [
[
"b87a8d14f1c2016caa7538aa08a33f-C001-38"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-58"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-176"
]
],
"cite_sentences": [
"b87a8d14f1c2016caa7538aa08a33f-C001-38",
"b87a8d14f1c2016caa7538aa08a33f-C001-58",
"b87a8d14f1c2016caa7538aa08a33f-C001-176"
]
},
"@MOT@": {
"gold_contexts": [
[
"b87a8d14f1c2016caa7538aa08a33f-C001-174",
"b87a8d14f1c2016caa7538aa08a33f-C001-175",
"b87a8d14f1c2016caa7538aa08a33f-C001-176"
],
[
"b87a8d14f1c2016caa7538aa08a33f-C001-203"
]
],
"cite_sentences": [
"b87a8d14f1c2016caa7538aa08a33f-C001-176",
"b87a8d14f1c2016caa7538aa08a33f-C001-203"
]
}
}
},
"ABC_dc6d4eb1870ed5b0bbcbbf6686e5be_8": {
"x": [
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-2",
"text": "Annotating temporal relations (TempRel) between events described in natural language is known to be labor intensive, partly because the total number of TempRels is quadratic in the number of events."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-3",
"text": "As a result, only a small number of documents are typically annotated, limiting the coverage of various lexical/semantic phenomena."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-4",
"text": "In order to improve existing approaches, one possibility is to make use of the readily available, partially annotated data (P as in partial) that cover more documents."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-5",
"text": "However, missing annotations in P are known to hurt, rather than help, existing systems."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-6",
"text": "This work is a case study in exploring various usages of P for TempRel extraction."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-7",
"text": "Results show that despite missing annotations, P is still a useful supervision signal for this task within a constrained bootstrapping learning framework."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-8",
"text": "The system described in this system is publicly available."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-10",
"text": "**1**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-13",
"text": "Understanding the temporal information in natural language text is an important NLP task (Verhagen et al., 2007 (Verhagen et al., , 2010 UzZaman et al., 2013; Minard et al., 2015; Bethard et al., 2016 Bethard et al., , 2017 ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-14",
"text": "A crucial component is temporal relation (TempRel; e.g., before or after) extraction (Mani et al., 2006; Bethard et al., 2007; Do et al., 2012; Mirza and Tonelli, 2016; Ning et al., 2017 Ning et al., , 2018a ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-15",
"text": "The TempRels in a document or a sentence can be conveniently modeled as a graph, where the nodes are events, and the edges are labeled by TempRels."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-16",
"text": "Given all the events in an instance, TempRel annotation is the process of manually labeling all the edges -a highly labor intensive task due to two reasons."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-17",
"text": "One is that many edges require extensive reasoning over multiple sentences and labeling them is time-consuming."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-18",
"text": "Perhaps more importantly, the other reason is that #edges is quadratic in #nodes."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-19",
"text": "If labeling an edge takes 30 seconds (already an optimistic estimation), a typical document with 50 nodes would take more than 10 hours to annotate."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-20",
"text": "Even if existing annotation schemes make a compromise by only annotating edges whose nodes are from a same sentence or adjacent sentences , it still takes more than 2 hours to fully annotate a typical document."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-21",
"text": "Consequently, the only fully annotated dataset, TB-Dense , contains only 36 documents, which is rather small compared with datasets for other NLP tasks."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-22",
"text": "A small number of documents may indicate that the annotated data provide a limited coverage of various lexical and semantic phenomena, since a document is usually \"homogeneous\" within itself."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-23",
"text": "In contrast to the scarcity of fully annotated datasets (denoted by F as in full), there are actually some partially annotated datasets as well (denoted by P as in partial); for example, TimeBank (Pustejovsky et al., 2003) and AQUAINT (Graff, 2002) cover in total more than 250 documents."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-24",
"text": "Since annotators are not required to label all the edges in these datasets, it is less labor intensive to collect P than to collect F. However, existing TempRel extraction methods only work on one type of datasets (i.e., either F or P), without taking advantage of both."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-25",
"text": "No one, as far as we know, has explored ways to combine both types of datasets in learning and whether it is helpful."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-26",
"text": "This work is a case study in exploring various usages of P in the TempRel extraction task."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-27",
"text": "We empirically show that P is indeed useful within a (constrained) bootstrapping type of learning approach."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-28",
"text": "This case study is interesting from two perspectives."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-29",
"text": "First, incidental supervision (Roth, 2017) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-30",
"text": "In practice, supervision signals may not always be perfect: they may be noisy, only partial, based on different annotation schemes, or even on different (but relevant) tasks; incidental supervision is a general paradigm that aims at making use of the abundant, naturally occurring data, as supervision signals."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-31",
"text": "As for the TempRel extraction task, the existence of many partially annotated datasets P is a good fit for this paradigm and the result here can be informative for future investigations involving other incidental supervision signals."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-32",
"text": "Second, TempRel data collection."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-33",
"text": "The fact that P is shown to provide useful supervision signals poses some further questions: What is the optimal data collection scheme for TempRel extraction, fully annotated, partially annotated, or a mixture of both?"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-34",
"text": "For partially annotated data, what is the optimal ratio of annotated edges to unannotated edges?"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-35",
"text": "The proposed method in this work can be readily extended to study these questions in the future, as we further discuss in Sec. 5."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-37",
"text": "**EXISTING DATASETS AND METHODS**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-38",
"text": "TimeBank (Pustejovsky et al., 2003 ) is a classic TempRel dataset, where the annotators were given a whole article and allowed to label TempRels between any pairs of events."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-39",
"text": "Annotators in this setup usually focus only on salient relations but overlook some others."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-40",
"text": "It has been reported that many event pairs in TimeBank should have been annotated with a specific TempRel but the annotators failed to look at them (Chambers, 2013; Ning et al., 2017) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-41",
"text": "Consequently, we categorize TimeBank as a partially annotated dataset (P)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-42",
"text": "The same argument applies to other datasets that adopted this setup, such as AQUAINT (Graff, 2002) , CaTeRs (Mostafazadeh et al., 2016) and RED (O'Gorman et al., 2016) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-43",
"text": "Most existing systems make use of P, including but not limited to, (Mani et al., 2006; Bramsen et al., 2006; Chambers et al., 2007; Bethard et al., 2007; Verhagen and Pustejovsky, 2008; Chambers and Jurafsky, 2008; Denis and Muller, 2011; Do et al., 2012) ; this applies also to the TempEval workshops systems, e.g., (Laokulrat et al., 2013; Bethard, 2013; Chambers, 2013) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-44",
"text": "To address the missing annotation issue, Cassidy et al. (2014) proposed a dense annotation scheme, TB-Dense."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-45",
"text": "Edges are presented one-byone and the annotator has to choose a label for it (note that there is a vague label in case the TempRel is not clear or does not exist)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-46",
"text": "As a result, edges in TB-Dense are considered as fully annotated in this paper."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-47",
"text": "The first system on TBDense was proposed in ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-48",
"text": "Two recent TempRel extraction systems (Mirza and Tonelli, 2016; Ning et al., 2017 ) also reported their performances on TB-Dense (F) and on TempEval-3 (P) separately."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-49",
"text": "However, there are no existing systems that jointly train on both."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-50",
"text": "Given that the annotation guidelines of F and P are obviously different, it may not be optimal to simply treat P and F uniformly and train on their union."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-51",
"text": "This situation necessitates further investigation as we do here."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-52",
"text": "Before introducing our joint learning approach, we have a few remarks about our choice of F and P datasets."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-53",
"text": "First, we note that TB-Dense is actually not fully annotated in the strict sense because only edges within a sliding, two-sentence window are presented."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-54",
"text": "That is, distant event pairs are intentionally ignored by the designers of TB-Dense."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-55",
"text": "However, since such distant pairs are consistently ruled out in the training and inference phase in this paper, it does not change the nature of the problem being investigated here."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-56",
"text": "At this point, TB-Dense is the only fully annotated dataset that can be adopted in this study, despite the aforementioned limitation."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-57",
"text": "Second, the partial annotations in datasets like TimeBank were not selected uniformly at random from all possible edges."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-58",
"text": "As described earlier, only salient and non-vague TempRels (which may often be those easy ones) are labeled in these datasets."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-59",
"text": "Using TimeBank as P might potentially create some bias and we will need to keep this in mind when analyzing the results in Sec. 4."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-60",
"text": "Recent advances in TempRel data annotation (Ning et al., 2018c) can be used in the future to collect both F and P more easily."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-61",
"text": "3 Joint Learning on F and P In this work, we study two learning paradigms that make use of both F and P. In the first, we simply treat those edges that are annotated in P as edges in F so that the learning process can be performed on top of the union of F and P. This is the most straightforward approach to using F and P jointly and it is interesting to see if it already helps."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-62",
"text": "In the second, we use bootstrapping: we use F as a starting point and learn a TempRel extraction system on it (denoted by S F ), and then fill those missing annotations in P based on S F (thus obtain \"fully\" annotatedP); finally, we treatP as F and learn from both."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-63",
"text": "Algorithm 1 is a meta-algorithm of the above."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-64",
"text": "Algorithm 1: Joint learning from F and P by bootstrapping Input: F, P, Learn, Inference"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-65",
"text": "In Algorithm 1, we consistently use the sparse averaged perceptron algorithm as the \"Learn\" function."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-66",
"text": "As for \"Inference\" (Line 6), we further investigate two different ways: (i) Look at every unannotated edge in p \u2208 P and use S F +P to label it; this local method ignores the existing annotated edges in P and is thus the standard bootstrapping."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-67",
"text": "(ii) Perform global inference on P with annotated edges being constraints, which is a constrained bootstrapping, motivated by the fact that temporal graphs are structured and annotated edges have influence on the missing edges: In Fig. 1 , the current annotation for (1, 2) and (2, 3) is before and vague."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-68",
"text": "We assume that the annotation (2, 3)=vague indicates that the relation cannot be determined even if the entire graph is considered."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-69",
"text": "Then with (1, 2)=before and (2, 3)=vague, we can see that (1, 3) cannot be uniquely determined, but it is restricted to be selected from {bef ore, vague} rather than the entire label set."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-70",
"text": "We believe that global inference makes better use of the information provided by P; in fact, as we show in Sec. 4, it does perform better than local inference."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-71",
"text": "Figure 1: Nodes 1-3 are three time points and let (i, j) be the edge from node i to node j, where (i, j) \u2208{before, after, equal, vague}. Assume the current annotation is (1, 2) = bef ore and (2, 3) = vague and (1, 3) is missing."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-72",
"text": "However, (1, 3) cannot be after because it leads to (2, 3) = af ter, conflicting with their current annotation; similarly, (1, 3) cannot be equal, either."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-73",
"text": "A standard way to perform global inference is to formulate it as an Integer Linear Programming (ILP) problem (Roth and Yih, 2004 ) and enforce transitivity rules as constraints."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-74",
"text": "Let R be the TempRel label set 2 , I r (ij) \u2208 {0, 1} be the indicator function of (i, j) = r, and f r (ij) \u2208 [0, 1] be the corresponding soft-max score obtained via S F +P ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-75",
"text": "Then the ILP objective is formulated a\u015d"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-76",
"text": "where {r m 3 } is selected based on the general transitivity proposed in (Ning et al., 2017) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-77",
"text": "With Eq. (1), different implementations of Line 6 in Algorithm 1 can be described concisely as follows: (i) Local inference is performed by ignoring \"transitivity constraints\"."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-78",
"text": "(ii) Global inference can be performed by adding annotated edges in P as additional constraints."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-79",
"text": "Note that Algorithm 1 is only for the learning step of TempRel extraction; as for the inference step of this task, we consistently adopt the standard method by solving Eq. (1), as was done by (Bramsen et al., 2006; Chambers and Jurafsky, 2008; Denis and Muller, 2011; Do et al., 2012; Ning et al., 2017) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-81",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-82",
"text": "In this work, we consistently used TB-Dense as the fully annotated dataset (F) and TBAQ as the partially annotated dataset (P)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-93",
"text": "However, when we see that System 5 was still worse than System 1, it is surprising because the annotated edges in P are correct and should have helped."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-83",
"text": "The corpus statistics of these two datasets are provided in Table 1 . Note that TBAQ is the union of TimeBank and AQUAINT and it originally contained 256 documents, but 36 out of them completely overlapped with TB-Dense, so we have excluded these when constructing P. In addition, the number of edges shown in Table 1 only counts the event-event relations (i.e., do not consider the event-time relations therein), which is the focus of this work."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-84",
"text": "We also adopted the original split of TB-Dense (22 documents for training, 5 documents for development, and 9 documents for test)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-85",
"text": "Learning parameters were tuned to maximize their corresponding F-metric on the development set."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-86",
"text": "Using the selected parameters, systems were retrained with development set incorporated and evaluated against the test split of TB-Dense (about 1.4K relations: 0.6K vague, 0.4K before, 0.3K after, and 0.1K for the rest)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-87",
"text": "Results are shown in Table 2 , where all systems were compared in terms of their performances on \"same sentence\" edges (both nodes are from the same sentence), \"nearby sentence\" edges, all edges, and the temporal awareness metric used by the TempEval3 workshop."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-88",
"text": "The first part of Table 2 (Systems 1-5) refers to the baseline method proposed at the beginning of Sec. 3, i.e., simply treating P as F and training on their union."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-89",
"text": "P F ull is a variant of P by filling its missing edges by vague."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-90",
"text": "Since it labels too many vague TempRels, System 2 suffered from a low recall."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-91",
"text": "In contrast, P does not contain any vague training examples, so System 3 would only predict specific TempRels, leading to a low precision."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-92",
"text": "Given the obvious difference in F and P F ull , System 4 expectedly performed worse than System 1."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-115",
"text": "**CONCLUSION**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-94",
"text": "This unexpected observation suggests that simply adding the annotated edges from P into F is not a proper approach to learn from both."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-95",
"text": "The second part (Systems 6-7) serves as an ablation study showing the effect of bootstrapping only."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-96",
"text": "P Empty is another variant of P we get by removing all the annotated edges (that is, only nodes are kept)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-97",
"text": "Thus, they did not get any information from the annotated edges in P and any improvement came from bootstrapping alone."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-98",
"text": "Specifically, System 6 is the standard bootstrapping and System 7 is the constrained bootstrapping."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-99",
"text": "Built on top of Systems 6-7, Systems 8-9 further took advantage of the annotations of P, which resulted in additional improvements."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-100",
"text": "Compared to System 1 (trained on F only) and System 5 (simply adding P into F), the proposed System 9 achieved much better performance, which is also statistically significant with p<0.005 (McNemar's test)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-101",
"text": "While System 7 can be regarded as a reproduction of Ning et al. (2017) , the original paper of Ning et al. (2017) achieved an overall score of P=43.0, R=46.4, F=44.7 and an awareness score of P=42.6, R=44.0, and F=43.3, and the proposed System 9 is also better than Ning et al. (2017) on all metrics."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-102",
"text": "3"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-104",
"text": "**DISCUSSION**"
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-105",
"text": "While incorporating transitivity constraints in inference is widely used, Ning et al. (2017) proposed to incorporate these constraints in the learning phase as well."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-106",
"text": "One of the algorithms proposed in Ning et al. (2017) is based on Chang et al. (2012) 's constraint-driven learning (CoDL), which is the same as our intermediate System 7 in Table 2 ; the fact that System 7 is better than System 1 can thus be considered as a reproduction of Ning et al. (2017) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-107",
"text": "Despite the technical similarity, this work is motivated differently and is set to achieve a different goal: Ning et al. (2017) tried to enforce the transitivity structure, while the current work attempts to use imperfect signals (e.g., partially annotated) taken from additional data, and learn in the incidental supervision framework."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-108",
"text": "The P used in this work is TBAQ, where only 12% of the edges are annotated."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-109",
"text": "In practice, every annotation comes at a cost, either time or the expenses paid to annotators, and as more edges are annotated, the marginal \"benefit\" of one edge is going down (an extreme case is that an edge is of no value if it can be inferred from existing edges)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-110",
"text": "Therefore, a more general question is to find out the optimal ratio of graph annotations."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-111",
"text": "Moreover, partial annotation is only one type of annotation imperfection."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-112",
"text": "If the annotation is noisy, we can alter the hard constraints derived from P and use soft regularization terms; if the annotation is for a different but relevant task, we can formulate corresponding constraints to connect that different task to the task at hand."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-113",
"text": "Being able to learn from these \"indirect\" signals is appealing because indirect signals are usually order of magnitudes larger than datasets dedicated to a single task."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-116",
"text": "Temporal relation (TempRel) extraction is important but TempRel annotation is labor intensive."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-117",
"text": "While fully annotated datasets (F) are relatively small, there exist more datasets with partial annotations (P)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-118",
"text": "This work provides the first investigation of learning from both types of datasets, and this preliminary study already shows promise."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-119",
"text": "Table 2 : Performance of various usages of the partially annotated data in training."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-120",
"text": "F: Fully annotated data."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-121",
"text": "P: Partially annotated data."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-122",
"text": "P F ull : P with missing annotations filled by vague."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-123",
"text": "P Empty : P with all annotations removed."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-124",
"text": "Bootstrap: referring to specific implementations of Line 6 in Algorithm 1, i.e., local or global."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-125",
"text": "Same/nearby sentence: edges whose nodes appear in the same/nearby sentences in text."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-126",
"text": "Overall: all edges."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-127",
"text": "Awareness: the temporal awareness metric used in the TempEval3 workshop, measuring how useful the predicted graphs are (UzZaman et al., 2013) ."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-128",
"text": "System 7 can also be considered as a reproduction of Ning et al. (2017) (see the discussion in Sec. 5 for details)."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-129",
"text": "Two bootstrapping algorithms (standard and constrained) are analyzed and the benefit of P, although with missing annotations, is shown on a benchmark dataset."
},
{
"sent_id": "dc6d4eb1870ed5b0bbcbbf6686e5be-C001-130",
"text": "This work may be a good starting point for further investigations of incidental supervision and data collection schemes of the TempRel extraction task."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-13",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-14"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-40"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-48"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-73",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-75",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-76"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-105"
]
],
"cite_sentences": [
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-14",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-40",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-48",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-76",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-105"
]
},
"@MOT@": {
"gold_contexts": [
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-39",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-40"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-48",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-49"
]
],
"cite_sentences": [
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-40",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-48"
]
},
"@USE@": {
"gold_contexts": [
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-75",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-76"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-79"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-106"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-128"
]
],
"cite_sentences": [
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-76",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-79",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-106",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-128"
]
},
"@DIF@": {
"gold_contexts": [
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-101",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-87",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-88",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-95"
],
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-107"
]
],
"cite_sentences": [
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-101",
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-107"
]
},
"@SIM@": {
"gold_contexts": [
[
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-107"
]
],
"cite_sentences": [
"dc6d4eb1870ed5b0bbcbbf6686e5be-C001-107"
]
}
}
},
"ABC_304773c64de1f0906f0246f2aa0d29_8": {
"x": [
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-205",
"text": "If a conflict happens, a third annotator will make judgment for final results."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-206",
"text": "The average inter-agreements is 0.74."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-2",
"text": "Mining opinion targets is a fundamental and important task for opinion mining from online reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-3",
"text": "To this end, there are usually two kinds of methods: syntax based and alignment based methods."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-4",
"text": "Syntax based methods usually exploited syntactic patterns to extract opinion targets, which were however prone to suffer from parsing errors when dealing with online informal texts."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-5",
"text": "In contrast, alignment based methods used word alignment model to fulfill this task, which could avoid parsing errors without using parsing."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-6",
"text": "However, there is no research focusing on which kind of method is more better when given a certain amount of reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-7",
"text": "To fill this gap, this paper empirically studies how the performance of these two kinds of methods vary when changing the size, domain and language of the corpus."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-8",
"text": "We further combine syntactic patterns with alignment model by using a partially supervised framework and investigate whether this combination is useful or not."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-9",
"text": "In our experiments, we verify that our combination is effective on the corpus with small and medium size."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-12",
"text": "With the rapid development of Web 2.0, huge amount of user reviews are springing up on the Web."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-13",
"text": "Mining opinions from these reviews become more and more urgent since that customers expect to obtain fine-grained information of products and manufacturers need to obtain immediate feedbacks from customers."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-14",
"text": "In opinion mining, extracting opinion targets is a basic subtask."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-15",
"text": "It is to extract a list of the objects which users express their opinions on and can provide the prior information of targets for opinion mining."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-16",
"text": "So this task has attracted many attentions."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-17",
"text": "To extract opinion targets, pervious approaches usually relied on opinion words which are the words used to express the opinions (Hu and Liu, 2004a; Popescu and Etzioni, 2005; Liu et al., 2005; Wang and Wang, 2008; Qiu et al., 2011; Liu et al., 2012) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-18",
"text": "Intuitively, opinion words often appear around and modify opinion targets, and there are opinion relations and associations between them."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-19",
"text": "If we have known some words to be opinion words, the words which those opinion words modify will have high probability to be opinion targets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-20",
"text": "Therefore, identifying the aforementioned opinion relations between words is important for extracting opinion targets from reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-21",
"text": "To fulfill this aim, previous methods exploited the words co-occurrence information to indicate them (Hu and Liu, 2004a; Hu and Liu, 2004b) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-22",
"text": "Obviously, these methods cannot obtain precise extraction because of the diverse expressions by reviewers, like long-span modified relations between words, etc."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-23",
"text": "To handle this problem, several methods exploited syntactic information, where several heuristic patterns based on syntactic parsing were designed (Popescu and Etzioni, 2005; Qiu et al., 2009; Qiu et al., 2011) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-24",
"text": "However, the sentences in online reviews usually have informal writing styles including grammar mistakes, typos, improper punctuation etc., which make parsing prone to generate mistakes."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-25",
"text": "As a result, the syntax-based methods which heavily depended on the parsing performance would suffer from parsing errors ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-26",
"text": "To improve the extraction performance, we can only employ some exquisite highprecision patterns. But this strategy is likely to miss many opinion targets and has lower recall with the increase of corpus size."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-27",
"text": "To resolve these problems, Liu et al. (2012) formulated identifying opinion relations between words as an monolingual alignment process."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-28",
"text": "A word can find its corresponding modifiers by using a word alignment (WAM) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-29",
"text": "Without using syntactic parsing, the noises from parsing errors can be effectively avoided."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-30",
"text": "Nevertheless, we notice that the alignment model is a statistical model which needs sufficient data to estimate parameters."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-31",
"text": "When the data is insufficient, it would suffer from data sparseness and may make the performance decline."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-32",
"text": "Thus, from the above analysis, we can observe that the size of the corpus has impacts on these two kinds of methods, which arises some important questions: how can we make selection between syntax based methods and alignment based method for opinion target extraction when given a certain amount of reviews? And which kind of methods can obtain better extraction performance with the variation of the size of the dataset?"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-33",
"text": "Although (Liu et al., 2012) had proved the effectiveness of WAM, they mainly performed experiments on the dataset with medium size."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-34",
"text": "We are still curious about that when the size of dataset is larger or smaller, can we obtain the same conclusion?"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-35",
"text": "To our best knowledge, these problems have not been studied before."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-36",
"text": "Moreover, opinions may be expressed in different ways with the variation of the domain and language of the corpus."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-37",
"text": "When the domain or language of the corpus is changed, what conclusions can we obtain? To answer these questions, in this paper, we adopt a unified framework to extract opinion targets from reviews, in the key component of which we vary the methods between syntactic patterns and alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-38",
"text": "Then we run the whole framework on the corpus with different size (from #500 to #1, 000, 000), domain (three domains) and language (Chinese and English) to empirically assess the performance variations and discuss which method is more effective."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-39",
"text": "Furthermore, this paper naturally addresses another question: is it useful for opinion targets extraction when we combine syntactic patterns and word alignment model into a unified model?"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-40",
"text": "To this end, we employ a partially supervised alignment model (PSWAM) like (Gao et al., 2010; Liu et al., 2013) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-41",
"text": "Based on the exquisitely designed high-precision syntactic patterns, we can obtain some precisely modified relations between words in sentences, which provide a portion of links of the full alignments."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-42",
"text": "Then, these partial alignment links can be regarded as the constrains for a standard unsupervised word alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-43",
"text": "And each target candidate would find its modifier under the partial supervision."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-44",
"text": "In this way, the errors generated in standard unsupervised WAM can be corrected."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-45",
"text": "For example in Figure 1 , \"kindly\" and \"courteous\" are incorrectly regarded as the modifiers for \"foods\" if the WAM is performed in an whole unsupervised framework."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-46",
"text": "However, by using some high-precision syntactic patterns, we can assert \"courteous\" should be aligned to \"services\", and \"delicious\" should be aligned to \"foods\"."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-47",
"text": "Through combination under partial supervision, we can see \"kindly\" and \"courteous\" are correctly linked to \"services\"."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-48",
"text": "Thus, it's reasonable to expect to yield better performance than traditional methods."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-49",
"text": "As mentioned in (Liu et al., 2013) , using PSWAM can not only inherit the advantages of WAM: effectively avoiding noises from syntactic parsing errors when dealing with informal texts, but also can improve the mining performance by using partial supervision."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-50",
"text": "However, is this kind of combination always useful for opinion target extraction?"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-51",
"text": "To access this problem, we also make comparison between PSWAM based method and the aforementioned methods in the same corpora with different size, language and domain."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-52",
"text": "The experimental results show the combination by using PSWAM can be effective on dataset with small and medium size."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-54",
"text": "**RELATED WORK**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-55",
"text": "Opinion target extraction isn't a new task for opinion mining."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-56",
"text": "There are much work focusing on this task, such as (Hu and Liu, 2004b; Ding et al., 2008; Li et al., 2010; Popescu and Etzioni, 2005; ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-57",
"text": "Totally, previous studies can be divided into two main categories: supervised and unsupervised methods."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-58",
"text": "In supervised approaches, the opinion target extraction task was usually regarded as a sequence labeling problem (Jin and Huang, 2009; Li et al., 2010; Ma and Wan, 2010; )."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-59",
"text": "It's not only to extract a lexicon or list of opinion targets, but also to find out each opinion target mentions in reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-60",
"text": "Thus, the contextual words are usually selected as the features to indicate opinion targets in sentences. And classical sequence labeling models are used to train the extractor, such as CRFs (Li et al., 2010) , HMM (Jin and Huang, 2009) etc.. Jin et al. (2009) proposed a lexicalized HMM model to perform opinion mining."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-61",
"text": "Both Li et al. (2010) and Ma et al. (2010) used CRFs model to extract opinion targets in reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-62",
"text": "Specially, Li et al. proposed a Skip-Tree CRF model for opinion target extraction, which exploited three structures including linear-chain structure, syntactic structure, and conjunction structure."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-63",
"text": "However, the main limitation of these supervised methods is the need of labeled training data."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-64",
"text": "If the labeled training data is insufficient, the trained model would have unsatisfied extraction performance."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-65",
"text": "Labeling sufficient training data is time and labor consuming. And for different domains, we need label data independently, which is obviously impracticable."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-66",
"text": "Thus, many researches focused on unsupervised methods, which are mainly to extract a list of opinion targets from reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-67",
"text": "Similar to ours, most approaches regarded opinion words as the indicator for opinion targets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-68",
"text": "(Hu and Liu, 2004a) regarded the nearest adjective to an noun/noun phrase as its modifier."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-69",
"text": "Then it exploited an association rule mining algorithm to mine the associations between them."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-70",
"text": "Finally, the frequent explicit product features can be extracted in a bootstrapping process by further combining item's frequency in dataset."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-71",
"text": "Only using nearest neighbor rule to mine the modifier for each candidate cannot obtain precise results."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-72",
"text": "Thus, (Popescu and Etzioni, 2005) used syntax information to extract opinion targets, which designed some syntactic patterns to capture the modified relations between words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-73",
"text": "The experimental results showed that their method had better performance than (Hu and Liu, 2004a) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-74",
"text": "Moreover, (Qiu et al., 2011) proposed a Double Propagation method to expand sentiment words and opinion targets iteratively, where they also exploited syntactic relations between words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-75",
"text": "Specially, (Qiu et al., 2011) didn't only design syntactic patterns for capturing modified relations, but also designed patterns for capturing relations among opinion targets and relations among opinion words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-76",
"text": "However, the main limitation of Qiu's method is that the patterns based on dependency parsing tree may miss many targets for the large corpora."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-77",
"text": "Therefore, Zhang et al. (2010) extended Qiu's method."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-78",
"text": "Besides the patterns used in Qiu's method, they adopted some other special designed patterns to increase recall."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-79",
"text": "In addition they used the HITS (Kleinberg, 1999) algorithm to compute opinion target confidences to improve the precision."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-80",
"text": "(Liu et al., 2012) formulated identifying opinion relations between words as an alignment process."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-81",
"text": "They used a completely unsupervised WAM to capture opinion relations in sentences."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-82",
"text": "Then the opinion targets were extracted in a standard random walk framework where two factors were considered: opinion relevance and target importance."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-83",
"text": "Their experimental results have shown that WAM was more effective than traditional syntax-based methods for this task."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-84",
"text": "(Liu et al., 2013 ) extend Liu's method, which is similar to our method and also used a partially supervised alignment model to extract opinion targets from reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-85",
"text": "We notice these two methods ( (Liu et al., 2012) and (Liu et al., 2013) ) only performed experiments on the corpora with a medium size."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-86",
"text": "Although both of them proved that WAM model is better than the methods based on syntactic patterns, they didn't discuss the performance variation when dealing with the corpora with different sizes, especially when the size of the corpus is less than 1,000 and more than 10,000."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-87",
"text": "Based on their conclusions, we still don't know which kind of methods should be selected for opinion target extraction when given a certain amount of reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-89",
"text": "**OPINION TARGET EXTRACTION METHODOLOGY**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-90",
"text": "To extract opinion targets from reviews, we adopt the framework proposed by (Liu et al., 2012) , which is a graph-based extraction framework and has two main components as follows."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-91",
"text": "1) The first component is to capture opinion relations in sentences and estimate associations between opinion target candidates and potential opinion words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-92",
"text": "In this paper, we assume opinion targets to be nouns or noun phrases, and opinion words may be adjectives or verbs, which are usually adopted by (Hu and Liu, 2004a; Qiu et al., 2011; Wang and Wang, 2008; Liu et al., 2012) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-93",
"text": "And a potential opinion relation is comprised of an opinion target candidate and its corresponding modified word."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-94",
"text": "2) The second component is to estimate the confidence of each candidate."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-95",
"text": "The candidates with higher confidence scores than a threshold will be extracted as opinion targets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-96",
"text": "In this procedure, we formulate the associations between opinion target candidates and potential opinion words in a bipartite graph."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-97",
"text": "A random walk based algorithm is employed on this graph to estimate the confidence of each target candidate."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-98",
"text": "In this paper, we fix the method in the second component and vary the algorithms in the first component."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-99",
"text": "In the first component, we respectively use syntactic patterns and unsupervised word alignment model (WAM) to capture opinion relations."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-100",
"text": "In addition, we employ a partially supervised word alignment model (PSWAM) to incorporate syntactic information into WAM."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-101",
"text": "In experiments, we run the whole framework on the different corpora to discuss which method is more effective."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-102",
"text": "In the following subsections, we will present them in detail."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-104",
"text": "**THE FIRST COMPONENT: CAPTURING**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-105",
"text": "Opinion Relations and Estimating Associations between Words"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-107",
"text": "**SYNTACTIC PATTERNS**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-108",
"text": "To capture opinion relations in sentences by using syntactic patterns, we employ the manual designed syntactic patterns proposed by (Qiu et al., 2011) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-109",
"text": "Similar to Qiu, only the syntactic patterns based on the direct dependency are employed to guarantee the extraction qualities."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-110",
"text": "The direct dependency has two types."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-111",
"text": "The first type indicates that one word depends on the other word without any additional words in their dependency path."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-112",
"text": "The second type denotes that two words both depend on a third word directly."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-113",
"text": "Specifically, we employ Minipar 1 to parse sentences."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-114",
"text": "To further make syn-1 http://webdocs.cs.ualberta.ca/lindek/minipar.htm tactic patterns precisely, we only use a few dependency relation labels outputted by Minipar, such as mod, pnmod, subj, desc etc."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-115",
"text": "To make a clear explanation, we give out some syntactic pattern examples in Table 1 ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-116",
"text": "In these patterns, OC is a potential opinion word which is an adjective or a verb."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-117",
"text": "T C is an opinion target candidate which is a noun or noun phrase."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-118",
"text": "The item on the arrows means the dependency relation type."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-119",
"text": "The item in parenthesis denotes the part-of-speech of the other word."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-120",
"text": "In these examples, the first three patterns are based on the first direct dependency type and the last two patterns are based on the second direct dependency type."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-121",
"text": "In this subsection, we present our method for capturing opinion relations using unsupervised word alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-122",
"text": "Similar to (Liu et al., 2012) , every sentence in reviews is replicated to generate a parallel sentence pair, and the word alignment algorithm is applied to the monolingual scenario to align a noun/noun phase with its modifiers."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-123",
"text": "We select IBM-3 model (Brown et al., 1993) as the alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-124",
"text": "Formally, given a sentence S = {w 1 , w 2 , ..., w n }, we have"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-125",
"text": "where t(w j |w a j ) models the co-occurrence information of two words in dataset."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-126",
"text": "d(j|a j , n) models word position information, which describes the probability of a word in position a j aligned with a word in position j. And n(\u03c6 i |w i ) describes the ability of a word for modifying (being modified by) several words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-127",
"text": "\u03c6 i denotes the number of words that are aligned with w i ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-128",
"text": "In our experiments, we set \u03c6 i = 2."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-129",
"text": "Since we only have interests on capturing opinion relations between words, we only pay attentions on the alignments between opinion target candidates (nouns/noun phrases) and potential opinion words (adjectives/verbs)."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-130",
"text": "If we directly use the alignment model, a noun (noun phrase) may align with other unrelated words, like prepositions or conjunctions and so on."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-131",
"text": "Thus, we set constrains on the model: 1) Alignment links must be assigned among nouns/noun phrases, adjectives/verbs and null words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-132",
"text": "Aligning to null words means that this word has no modifier or modifies nothing; 2) Other unrelated words can only align with themselves."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-133",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-134",
"text": "**COMBINING SYNTAX-BASED METHOD WITH**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-135",
"text": "Alignment-based Method"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-136",
"text": "In this subsection, we try to combine syntactic information with word alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-137",
"text": "As mentioned in the first section, we adopt a partially supervised alignment model to make this combination."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-138",
"text": "Here, the opinion relations obtained through the high-precision syntactic patterns (Section 3.1.1) are regarded as the ground truth and can only provide a part of full alignments in sentences."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-139",
"text": "They are treated as the constrains for the word alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-140",
"text": "Given some partial align-"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-141",
"text": ", where (i, a i ) means that a noun (noun phrase) at position i is aligned with its modifier at position a i ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-142",
"text": "Since the labeled data provided by syntactic patterns is not a full alignment, we adopt a EM-based algorithm, named as constrained hill-climbing algorithm (Gao et al., 2010) , to estimate the parameters in the model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-143",
"text": "In the training process, the constrained hill-climbing algorithm can ensure that the final model is marginalized on the partial alignment links."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-144",
"text": "Particularly, in the E step, their method aims to find out the alignments which are consistent to the alignment links provided by syntactic patterns, where there are main two steps involved."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-145",
"text": "1) Optimize towards the constraints."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-146",
"text": "This step aims to generate an initial alignments for alignment model (IBM-3 model in our method), which can be close to the constraints."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-147",
"text": "First, a simple alignment model (IBM-1, IBM-2, HMM etc.) is trained."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-148",
"text": "Then, the evidence being inconsistent to the partial alignment links will be got rid of by using the move operator operator m i,j which changes a j = i and the swap operator s j 1 ,j 2 which exchanges a j 1 and a j 2 ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-149",
"text": "The alignment is updated iteratively until no additional inconsistent links can be removed."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-150",
"text": "2) Towards the optimal alignment under the constraints."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-151",
"text": "This step aims to optimize towards the optimal alignment under the constraints which starts from the aforementioned initial alignments."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-152",
"text": "Gao et.al."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-153",
"text": "(2010) set the corresponding cost value of the invalid move or swap operation in M and S to be negative, where M and S are respectively called Moving Matrix and Swapping Matrix, which record all possible move and swap costs between two different alignments."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-154",
"text": "In this way, the invalid operators will never be picked which can guarantee that the final alignment links to have high probability to be consistent with the partial alignment links provided by high-precision syntactic patterns."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-155",
"text": "Then in M-step, evidences from the neighbor of final alignments are collected so that we can produce the estimation of parameters for the next iteration."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-156",
"text": "In the process, those statistics which come from inconsistent alignment links aren't be picked up."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-157",
"text": "Thus, we have"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-158",
"text": "(2) where \u03bb means that we make soft constraints on the alignment model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-207",
"text": "We also perform a significant test, i.e., a t-test with a default significant level of 0.05."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-208",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-209",
"text": "**COMPARED METHODS**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-210",
"text": "We select three methods for comparison as follows."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-253",
"text": "This indicates that our method based on PSWAM is effective for opinion target extraction."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-159",
"text": "As a result, we expect some errors generated through high-precision patterns (Section 3.1.1) may be revised in the alignment process."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-160",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-161",
"text": "**ESTIMATING ASSOCIATIONS BETWEEN WORDS**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-162",
"text": "After capturing opinion relations in sentences, we can obtain a lot of word pairs, each of which is comprised of an opinion target candidate and its corresponding modified word."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-163",
"text": "Then the conditional probabilities between potential opinion target w t and potential opinion word w o can be estimated by using maximum likelihood estimation."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-164",
"text": "Thus, we have P (w t |w o ) = Count(wt,wo) Count(wo) , where Count(\u00b7) means the item's frequency information."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-165",
"text": "P (w t |w o ) means the conditional probabilities between two words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-166",
"text": "At the same time, we can obtain conditional probability P (w o |w t )."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-167",
"text": "Then, similar to (Liu et al., 2012) , the association between an opinion target candidate and its modifier is estimated as follows."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-168",
"text": "Association(w t , w o ) = (\u03b1 \u00d7 P (w t |w o ) + (1 \u2212 \u03b1) \u00d7 P (w o |w t )) \u22121 , where \u03b1 is the harmonic factor."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-169",
"text": "We set \u03b1 = 0.5 in our experiments."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-171",
"text": "**THE SECOND COMPONENT: ESTIMATING CANDIDATE CONFIDENCE**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-172",
"text": "In the second component, we adopt a graph-based algorithm used in (Liu et al., 2012) to compute the confidence of each opinion target candidate, and the candidates with higher confidence than the threshold will be extracted as the opinion targets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-173",
"text": "Here, opinion words are regarded as the important indicators."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-174",
"text": "We assume that two target candidates are likely to belong to the similar category, if they are modified by similar opinion words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-175",
"text": "Thus, we can propagate the opinion target confidences through opinion words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-176",
"text": "To model the mined associations between words, a bipartite graph is constructed, which is defined as a weighted undirected graph G = (V, E, W )."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-177",
"text": "It contains two kinds of vertex: opinion target candidates and potential opinion words, respectively denoted as v t \u2208 V and v o \u2208 V ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-178",
"text": "As shown in Figure 2 , the white vertices represent opinion target candidates and the gray vertices represent potential opinion words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-179",
"text": "An edge e vt,vo \u2208 E between vertices represents that there is an opinion relation, and the weight w on the edge represents the association between two words."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-180",
"text": "To estimate the confidence of each opinion target candidate, we employ a random walk algorithm on our graph, which iteratively computes the weighted average of opinion target confidences from neighboring vertices."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-181",
"text": "Thus we have"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-182",
"text": "where C i+1 and C i respectively represent the opinion target confidence vector in the (i + 1) th and i th iteration."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-183",
"text": "M is the matrix of word associations, where M i,j denotes the association between the opinion target candidate i and the potential opinion word j. And I is defined as the prior confidence of each candidate for opinion target."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-184",
"text": "Similar to (Liu et al., 2012) , we set each item in"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-185",
"text": ", where tf (v) is the term frequency of v in the corpus, and df (v) is computed by using the Google n-gram corpus 2 ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-186",
"text": "\u03b2 \u2208 [0, 1] represents the impact of candidate prior knowledge on the final estimation results."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-187",
"text": "In experiments, we set \u03b2 = 0.4."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-188",
"text": "The algorithm run until convergence which is achieved when the confidence on each node ceases to change in a tolerance value."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-190",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-192",
"text": "**DATASETS AND EVALUATION METRICS**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-193",
"text": "In this section, to answer the questions mentioned in the first section, we collect a large collection named as LARGE, which includes reviews from three different domains and different languages."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-194",
"text": "This collection was also used in (Liu et al., 2012) ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-195",
"text": "In the experiments, reviews are first segmented into sentences according to punctuation."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-196",
"text": "The detailed statistical information of the used collection is shown in Table 2, where Restaurant is crawled from the Chinese Web site: www.dianping.com."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-197",
"text": "The Hotel and MP3 are used in (Wang et al., 2011) , which are respectively crawled from www.tripadvisor.com and www.amazon.com."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-198",
"text": "For each dataset, we perform random sampling to generate testing set with different sizes, where we use sampled subsets with #sentences = 5 \u00d7 10 2 , 10 3 , 5 \u00d7 10 3 , 10 4 , 5 \u00d7 10 4 , 10 5 and 10 6 sentences respectively."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-199",
"text": "Each sentence is tokenized, part-of-speech tagged by using Stanford NLP tool 3 , and parsed by using Minipar toolkit. And the method of (Zhu et al., 2009 ) is used to identify noun phrases."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-200",
"text": "We select precision and recall as the metrics."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-201",
"text": "Specifically, to obtain the ground truth, we manually label all opinion targets for each subset."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-202",
"text": "In this process, three annotators are involved."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-203",
"text": "First, every noun/noun phrase and its contexts in review sentences are extracted."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-204",
"text": "Then two annotators were required to judge whether every noun/noun phrase is opinion target or not."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-211",
"text": "\u2022 Syntax: It uses syntactic patterns mentioned in Section 3.1.1 in the first component to capture opinion relations in reviews."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-212",
"text": "Then the associations between words are estimated and the graph based algorithm proposed in the second component (Section 3.3) is performed to extract opinion targets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-213",
"text": "\u2022 WAM: It is similar to Syntax, where the only difference is that WAM uses unsupervised WAM (Section 3.1.2) to capture opinion relations."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-214",
"text": "\u2022 PSWAM is similar to Syntax and WAM, where the difference is that PSWAM uses the method mentioned in Section 3.1.3 to capture opinion relations, which incorporates syntactic information into word alignment model by using partially supervised framework."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-215",
"text": "The experimental results on different domains are respectively shown in Figure 3 , 4 and 5."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-217",
"text": "**SYNTAX BASED METHODS VS. ALIGNMENT BASED METHODS**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-218",
"text": "Comparing Syntax with WAM and PSWAM, we can obtain the following observations: 1) When the size of the corpus is small, Syntax has better precision than alignment based methods (WAM and PSWAM)."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-219",
"text": "We believe the reason is that the high-precision syntactic patterns employed in Syntax can effectively capture opinion relations in a small amount of texts."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-220",
"text": "In contrast, the methods based on word alignment model may suffer from data sparseness for parameter estimation, so the precision is lower."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-221",
"text": "2) However, when the size of the corpus increases, the precision of Syntax decreases, even worse than alignment based methods."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-222",
"text": "We believe it's because more noises were introduced from parsing errors with the increase of the size of the corpus , which will have more negative impacts on extraction results."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-223",
"text": "In contrast, for estimating the parameters of alignment based methods, the data is more sufficient, so the precision is better compared with syntax based method."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-224",
"text": "3) We also observe that recall of Syntax is worse than other two methods."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-225",
"text": "It's because the human expressions of opinions are diverse and the manual designed syntactic patterns are limited to capture all opinion relations in sentences, which may miss an amount of correct opinion targets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-226",
"text": "4) It's interesting that the performance gap between these three methods is smaller with the increase of the size of the corpus (more than 50,000)."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-227",
"text": "We guess the reason is that when the data is sufficient enough, we can obtain sufficient statistics for each opinion target."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-228",
"text": "In such situation, the graphbased ranking algorithm in the second component will be apt to be affected by the frequency information, so the final performance could not be sensitive to the performance of opinion relations iden-tification in the first component."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-229",
"text": "Thus, in this situation, we can get conclusion that there is no obviously difference on performance between syntaxbased approach and alignment-based approach."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-230",
"text": "5) From the results on dataset with different languages and different domains, we can obtain the similar observations."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-252",
"text": "PSWAM outperforms other methods in most datasets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-231",
"text": "It indicates that choosing either syntactic patterns or word alignment model for extracting opinion targets can take a few consideration on the language and domain of the corpus."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-232",
"text": "Thus, based on the above observations, we can draw the following conclusions: making chooses between different methods is only related to the size of the corpus."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-233",
"text": "The method based on syntactic patterns is more suitable for small corpus (#sentences < 5 \u00d7 10 3 shown in our experiments)."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-234",
"text": "And word alignment model is more suitable for medium corpus (5 \u00d7 10 3 < #sentences < 5 \u00d7 10 4 )."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-235",
"text": "Moreover, when the size of the corpus is big enough, the performance of two kinds of methods tend to become the same (#sentences \u2265 10 5 shown in our experiments)."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-236",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-237",
"text": "**IS IT USEFUL COMBINING SYNTACTIC PATTERNS WITH WORD ALIGNMENT MODEL**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-238",
"text": "In this subsection, we try to see whether combining syntactic information with alignment model by using PSWAM is effective or not for opinion target extraction."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-239",
"text": "From the results in Figure 3 , 4 and 5, we can see that PSWAM has the similar recall compared with WAM in all datasets."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-240",
"text": "PSWAM outperforms WAM on precision in all dataset."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-241",
"text": "But the precision gap between PSWAM and WAM decreases when the size of the corpus increases."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-242",
"text": "When the size is larger than 5 \u00d7 10 4 , the performance of these two methods is almost the same."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-243",
"text": "We guess the reason is that more noises from parsing errors will be introduced by syntactic patterns with the increase of the size of corpus , which have negative impacts on alignment performance."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-244",
"text": "At the same time, as mentioned above, a great deal of reviews will bring sufficient statistics for estimating parameters in alignment model, so the roles of partial supervision from syntactic information will be covered by frequency information used in our graph based ranking algorithm."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-245",
"text": "Compared with State-of-the-art Methods."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-246",
"text": "However, it's not say that this combination is not useful."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-247",
"text": "From the results, we still see that PSWAM outperforms WAM in all datasets on precision when size of corpus is smaller than 5 \u00d7 10 4 ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-248",
"text": "To further prove the effectiveness of our combination, we compare PSWAM with some state-of-the-art methods, including Hu (Hu and Liu, 2004a) , which extracted frequent opinion target words based on association mining rules, DP (Qiu et al., 2011) , which extracted opinion targets through syntactic patterns, and LIU (Liu et al., 2012) , which fulfilled this task by using unsupervised WAM."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-249",
"text": "The parameter settings in these baselines are the same as the settings in the original papers."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-250",
"text": "Because of the space limitation, we only show the results on Restaurant and Hotel, as shown in Figure 6 and 7."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-251",
"text": "From the experimental results, we can obtain the following observations."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-254",
"text": "Especially compared PSWAM with LIU, both of which are based on word alignment model, we can see PSWAM identifies opinion relations by performing WAM under partial supervision, which can effectively improve the precision when dealing with small and medium corpus."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-255",
"text": "However, these improvements are limited when the size of the corpus increases, which has the similar observations obtained above."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-256",
"text": "The Impact of Syntactic Information on Word Alignment Model."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-257",
"text": "Although we have prove the effectiveness of PSWAM in the corpus with small and medium size, we are still curious about how the performance varies when we incor-porate different amount of syntactic information into WAM."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-258",
"text": "In this experiment, we rank the used syntactic patterns mentioned in Section 3.1.1 according to the quantities of the extracted alignment links by these patterns."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-259",
"text": "Then, to capture opinion relations, we respectively use top N syntactic patterns according to frequency mentioned above to generate partial alignment links for PSWAM in section 3.1.3."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-260",
"text": "We respectively define N= [1, 7] ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-261",
"text": "The larger is N , the more syntactic information is incorporated."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-262",
"text": "Because of the space limitation, only the average performance of all dataset is shown in Figure 8 ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-263",
"text": "In Figure 8 , we can observe that the syntactic information mainly have effect on precision."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-264",
"text": "When the size of the corpus is small, the opinion relations mined by high-precision syntactic patterns are usually correct, so incorporating more syntactic information can improve the precision of word alignment model more."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-265",
"text": "However, when the size of the corpus increases, incorporating more syntactic information has little impact on precision."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-266",
"text": "----------------------------------"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-267",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-268",
"text": "This paper discusses the performance variation of syntax based methods and alignment based methods on opinion target extraction task for the dataset with different sizes, different languages and different domains."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-269",
"text": "Through experimental results, we can see that choosing which method is not related with corpus domain and language, but strongly associated with the size of the corpus ."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-270",
"text": "We can conclude that syntax-based method is likely to be more effective when the size of the corpus is small, and alignment-based methods are more useful for the medium size corpus."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-271",
"text": "We further verify that incorporating syntactic information into word alignment model by using PSWAM is effective when dealing with the corpora with small or medium size."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-272",
"text": "When the size of the corpus is larger and larger, the performance gap between syntax based, WAM and PSWAM will decrease."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-273",
"text": "In future work, we will extract opinion targets based on not only opinion relations."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-274",
"text": "Other semantic relations, such as the topical associations between opinion targets (or opinion words) should also be employed."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-275",
"text": "We believe that considering multiple semantic associations will help to improve the performance."
},
{
"sent_id": "304773c64de1f0906f0246f2aa0d29-C001-276",
"text": "In this way, how to model heterogenous relations in a unified model for opinion targets extraction is worthy to be studied."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"304773c64de1f0906f0246f2aa0d29-C001-17"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-27"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-80"
]
],
"cite_sentences": [
"304773c64de1f0906f0246f2aa0d29-C001-17",
"304773c64de1f0906f0246f2aa0d29-C001-27",
"304773c64de1f0906f0246f2aa0d29-C001-80"
]
},
"@MOT@": {
"gold_contexts": [
[
"304773c64de1f0906f0246f2aa0d29-C001-33"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-85"
]
],
"cite_sentences": [
"304773c64de1f0906f0246f2aa0d29-C001-33",
"304773c64de1f0906f0246f2aa0d29-C001-85"
]
},
"@USE@": {
"gold_contexts": [
[
"304773c64de1f0906f0246f2aa0d29-C001-90"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-122"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-167"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-172"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-184",
"304773c64de1f0906f0246f2aa0d29-C001-185"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-248"
]
],
"cite_sentences": [
"304773c64de1f0906f0246f2aa0d29-C001-90",
"304773c64de1f0906f0246f2aa0d29-C001-122",
"304773c64de1f0906f0246f2aa0d29-C001-167",
"304773c64de1f0906f0246f2aa0d29-C001-172",
"304773c64de1f0906f0246f2aa0d29-C001-184",
"304773c64de1f0906f0246f2aa0d29-C001-248"
]
},
"@SIM@": {
"gold_contexts": [
[
"304773c64de1f0906f0246f2aa0d29-C001-92"
],
[
"304773c64de1f0906f0246f2aa0d29-C001-193",
"304773c64de1f0906f0246f2aa0d29-C001-194"
]
],
"cite_sentences": [
"304773c64de1f0906f0246f2aa0d29-C001-92",
"304773c64de1f0906f0246f2aa0d29-C001-194"
]
}
}
},
"ABC_9426b2faf2ba633033c7dfcee4118b_8": {
"x": [
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-2",
"text": "Sequence to sequence learning has recently emerged as a new paradigm in supervised learning."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-3",
"text": "To date, most of its applications focused on only one task and not much work explored this framework for multiple tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-4",
"text": "This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the oneto-many setting -where the encoder is shared between several tasks such as machine translation and syntactic parsing, (b) the many-to-one setting -useful when only the decoder can be shared, as in the case of translation and image caption generation, and (c) the many-to-many setting -where multiple encoders and decoders are shared, which is the case with unsupervised objectives and translation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-5",
"text": "Our results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-6",
"text": "Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F 1 ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-7",
"text": "Lastly, we reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-10",
"text": "Multi-task learning (MTL) is an important machine learning paradigm that aims at improving the generalization performance of a task using other related tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-11",
"text": "Such framework has been widely studied by Thrun (1996) ; Caruana (1997) ; Evgeniou & Pontil (2004) ; Ando & Zhang (2005) ; Argyriou et al. (2007) ; Kumar & III (2012) , among many others."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-12",
"text": "In the context of deep neural networks, MTL has been applied successfully to various problems ranging from language (Liu et al., 2015) , to vision (Donahue et al., 2014) , and speech (Heigold et al., 2013; Huang et al., 2013) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-13",
"text": "Recently, sequence to sequence (seq2seq) learning, proposed by Kalchbrenner & Blunsom (2013) , Sutskever et al. (2014) , and Cho et al. (2014) , emerges as an effective paradigm for dealing with variable-length inputs and outputs."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-14",
"text": "seq2seq learning, at its core, uses recurrent neural networks to map variable-length input sequences to variable-length output sequences."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-15",
"text": "While relatively new, the seq2seq approach has achieved state-of-the-art results in not only its original application -machine translation - (Luong et al., 2015b; Jean et al., 2015a; Luong et al., 2015a; Jean et al., 2015b; Luong & Manning, 2015) , but also image caption generation , and constituency parsing (Vinyals et al., 2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-16",
"text": "Despite the popularity of multi-task learning and sequence to sequence learning, there has been little work in combining MTL with seq2seq learning."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-17",
"text": "To the best of our knowledge, there is only one recent publication by Dong et al. (2015) which applies a seq2seq models for machine translation, where the goal is to translate from one language to multiple languages."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-41",
"text": "Depending which tasks involved, we propose to categorize multi-task seq2seq learning into three general settings."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-121",
"text": "**RESULTS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-18",
"text": "In this work, we propose three MTL approaches that complement one another: (a) the one-to-many approach -for tasks that can have an encoder in common, such as translation and parsing; this applies to the multi-target translation setting in (Dong et al., 2015) as well, (b) the many-to-one approach -useful for multisource translation or tasks in which only the decoder can be easily shared, such as translation and image captioning, and lastly, (c) the many-to-many approach -which share multiple encoders and decoders through which we study the effect of unsupervised learning in translation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-19",
"text": "We show that syntactic parsing and image caption generation improves the translation quality between English (Sutskever et al., 2014) and (right) constituent parsing (Vinyals et al., 2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-20",
"text": "and German by up to +1.5 BLEU points over strong single-task baselines on the WMT benchmarks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-21",
"text": "Furthermore, we have established a new state-of-the-art result in constituent parsing with 93.0 F 1 ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-22",
"text": "We also explore two unsupervised learning objectives, sequence autoencoders (Dai & Le, 2015) and skip-thought vectors , and reveal their interesting properties in the MTL setting: autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-23",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-24",
"text": "**SEQUENCE TO SEQUENCE LEARNING**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-25",
"text": "Sequence to sequence learning (seq2seq) aims to directly model the conditional probability p(y|x) of mapping an input sequence, x 1 , . . . , x n , into an output sequence, y 1 , . ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-26",
"text": ". , y m ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-27",
"text": "It accomplishes such goal through the encoder-decoder framework proposed by Sutskever et al. (2014) and Cho et al. (2014) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-28",
"text": "As illustrated in Figure 1 , the encoder computes a representation s for each input sequence."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-29",
"text": "Based on that input representation, the decoder generates an output sequence, one unit at a time, and hence, decomposes the conditional probability as:"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-30",
"text": "A natural model for sequential data is the recurrent neural network (RNN), which is used by most of the recent seq2seq work."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-31",
"text": "These work, however, differ in terms of: (a) architecture -from unidirectional, to bidirectional, and deep multi-layer RNNs; and (b) RNN type -which are long-short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) and the gated recurrent unit (Cho et al., 2014) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-32",
"text": "Another important difference between seq2seq work lies in what constitutes the input representation s. The early seq2seq work (Sutskever et al., 2014; Cho et al., 2014; Luong et al., 2015b; Vinyals et al., 2015b) uses only the last encoder state to initialize the decoder and sets s = [ ] in Eq. (1)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-33",
"text": "Recently, Bahdanau et al. (2015) proposes an attention mechanism, a way to provide seq2seq models with a random access memory, to handle long input sequences."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-34",
"text": "This is accomplished by setting s in Eq."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-35",
"text": "(1) to be the set of encoder hidden states already computed."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-36",
"text": "On the decoder side, at each time step, the attention mechanism will decide how much information to retrieve from that memory by learning where to focus, i.e., computing the alignment weights for all input positions."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-37",
"text": "Recent work such as Jean et al., 2015a; Luong et al., 2015a; Vinyals et al., 2015a) has found that it is crucial to empower seq2seq models with the attention mechanism."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-39",
"text": "**MULTI-TASK SEQUENCE-TO-SEQUENCE LEARNING**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-40",
"text": "We generalize the work of Dong et al. (2015) to the multi-task sequence-to-sequence learning setting that includes the tasks of machine translation (MT), constituency parsing, and image caption generation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-42",
"text": "In addition, we will discuss the unsupervised learning tasks considered as well as the learning process."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-44",
"text": "**ONE-TO-MANY SETTING**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-45",
"text": "This scheme involves one encoder and multiple decoders for tasks in which the encoder can be shared, as illustrated in Figure 2 ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-46",
"text": "The input to each task is a sequence of English words."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-47",
"text": "A separate decoder is used to generate each sequence of output units which can be either (a) a sequence of tags"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-49",
"text": "**ENGLISH (UNSUPERVISED)**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-50",
"text": "German (translation)"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-52",
"text": "**TAGS (PARSING) ENGLISH**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-53",
"text": "Figure 2: One-to-many Setting -one encoder, multiple decoders."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-54",
"text": "This scheme is useful for either multi-target translation as in Dong et al. (2015) or between different tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-55",
"text": "Here, English and German imply sequences of words in the respective languages."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-56",
"text": "The \u03b1 values give the proportions of parameter updates that are allocated for the different tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-57",
"text": "for constituency parsing as used in (Vinyals et al., 2015a) , (b) a sequence of German words for machine translation (Luong et al., 2015a) , and (c) the same sequence of English words for autoencoders or a related sequence of English words for the skip-thought objective ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-58",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-59",
"text": "**MANY-TO-ONE SETTING**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-60",
"text": "This scheme is the opposite of the one-to-many setting."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-61",
"text": "As illustrated in Figure 3 , it consists of multiple encoders and one decoder."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-62",
"text": "This is useful for tasks in which only the decoder can be shared, for example, when our tasks include machine translation and image caption generation ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-63",
"text": "In addition, from a machine translation perspective, this setting can benefit from a large amount of monolingual data on the target side, which is a standard practice in machine translation system and has also been explored for neural MT by Gulcehre et al. (2015) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-65",
"text": "**ENGLISH (UNSUPERVISED) IMAGE (CAPTIONING) ENGLISH**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-66",
"text": "German (translation) Figure 3 : Many-to-one setting -multiple encoders, one decoder."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-67",
"text": "This scheme is handy for tasks in which only the decoders can be shared."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-69",
"text": "**MANY-TO-MANY SETTING**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-70",
"text": "Lastly, as the name describes, this category is the most general one, consisting of multiple encoders and multiple decoders."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-71",
"text": "We will explore this scheme in a translation setting that involves sharing multiple encoders and multiple decoders."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-72",
"text": "In addition to the machine translation task, we will include two unsupervised objectives over the source and target languages as illustrated in Figure 4 ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-74",
"text": "**UNSUPERVISED LEARNING TASKS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-75",
"text": "Our very first unsupervised learning task involves learning autoencoders from monolingual corpora, which has recently been applied to sequence to sequence learning (Dai & Le, 2015) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-76",
"text": "However, in Dai & Le (2015) 's work, the authors only experiment with pretraining and then finetuning, but not joint training which can be viewed as a form of multi-task learning (MTL)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-77",
"text": "As such, we are very interested in knowing whether the same trend extends to our MTL settings."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-78",
"text": "Additionally, we investigate the use of the skip-thought vectors in the context of our MTL framework."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-79",
"text": "Skip-thought vectors are trained by training sequence to sequence models on pairs of consecutive sentences, which makes the skip-thought objective a natural seq2seq learning candidate."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-80",
"text": "A minor technical difficulty with skip-thought objective is that the training data must"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-82",
"text": "**GERMAN (TRANSLATION)**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-83",
"text": "English (unsupervised) German (unsupervised) English Figure 4 : Many-to-many setting -multiple encoders, multiple decoders."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-84",
"text": "We consider this scheme in a limited context of machine translation to utilize the large monolingual corpora in both the source and the target languages."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-85",
"text": "Here, we consider a single translation task and two unsupervised autoencoder tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-86",
"text": "consist of ordered sentences, e.g., paragraphs."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-87",
"text": "Unfortunately, in many applications that include machine translation, we only have sentence-level data where the sentences are unordered."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-88",
"text": "To address that, we split each sentence into two halves; we then use one half to predict the other half."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-89",
"text": "3.5 LEARNING Dong et al. (2015) adopted an alternating training approach, where they optimize each task for a fixed number of parameter updates (or mini-batches) before switching to the next task (which is a different language pair)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-90",
"text": "In our setting, our tasks are more diverse and contain different amounts of training data."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-91",
"text": "As a result, we allocate different numbers of parameter updates for each task, which are expressed with the mixing ratio values \u03b1 i (for each task i)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-92",
"text": "Each parameter update consists of training data from one task only."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-93",
"text": "When switching between tasks, we select randomly a new task i with probability \u03b1i j \u03b1j ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-94",
"text": "Our convention is that the first task is the reference task with \u03b1 1 = 1.0 and the number of training parameter updates for that task is prespecified to be N ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-95",
"text": "A typical task i will then be trained for \u03b1i \u03b11 \u00b7N parameter updates."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-96",
"text": "Such convention makes it easier for us to fairly compare the same reference task in a single-task setting which has also been trained for exactly N parameter updates."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-97",
"text": "When sharing an encoder or a decoder, we share both the recurrent connections and the corresponding embeddings."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-99",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-100",
"text": "We evaluate the multi-task learning setup on a wide variety of sequence-to-sequence tasks: constituency parsing, image caption generation, machine translation, and a number of unsupervised learning as summarized in Table 1."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-101",
"text": "4.1 DATA Our experiments are centered around the translation task, where we aim to determine whether other tasks can improve translation and vice versa."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-102",
"text": "We use the WMT'15 data (Bojar et al., 2015) for the English\u21c6German translation problem."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-103",
"text": "Following Luong et al. (2015a) , we use the 50K most frequent words for each language from the training corpus."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-104",
"text": "1 These vocabularies are then shared with other tasks, except for parsing in which the target \"language\" has a vocabulary of 104 tags."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-105",
"text": "We use newstest2013 (3000 sentences) as a validation set to select our hyperparameters, e.g., mixing coefficients."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-106",
"text": "For testing, to be comparable with existing results in (Luong et al., 2015a) For the unsupervised tasks, we use the English and German monolingual corpora from WMT'15."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-107",
"text": "4 Since in our experiments, unsupervised tasks are always coupled with translation tasks, we use the same validation and test sets as the accompanied translation tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-108",
"text": "For constituency parsing, we experiment with two types of corpora:"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-109",
"text": "1. a small corpus -the widely used Penn Tree Bank (PTB) dataset (Marcus et al., 1993) and, 2. a large corpus -the high-confidence (HC) parse trees provided by Vinyals et al. (2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-110",
"text": "The two parsing tasks, however, are evaluated on the same validation (section 22) and test (section 23) sets from the PTB data."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-111",
"text": "Note also that the parse trees have been linearized following Vinyals et al. (2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-112",
"text": "Lastly, for image caption generation, we use a dataset of image and caption pairs provided by Vinyals et al. (2015b) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-114",
"text": "**TRAINING DETAILS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-115",
"text": "In all experiments, following Sutskever et al. (2014) and Luong et al. (2015b) , we train deep LSTM models as follows: (a) we use 4 LSTM layers each of which has 1000-dimensional cells and embeddings, 5 (b) parameters are uniformly initialized in [-0.06, 0.06] , (c) we use a mini-batch size of 128, (d) dropout is applied with probability of 0.2 over vertical connections (Pham et al., 2014) , (e) we use SGD with a fixed learning rate of 0.7, (f) input sequences are reversed, and lastly, (g) we use a simple finetuning schedule -after x epochs, we halve the learning rate every y epochs."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-116",
"text": "The values x and y are referred as finetune start and finetune cycle in Table 1 together with the number of training epochs per task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-117",
"text": "As described in Section 3, for each multi-task experiment, we need to choose one task to be the reference task (which corresponds to \u03b1 1 = 1)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-118",
"text": "The choice of the reference task helps specify the number of training epochs and the finetune start/cycle values which we also when training that reference task alone for fair comparison."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-119",
"text": "To make sure our findings are reliable, we run each experimental configuration twice and report the average performance in the format mean (stddev)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-122",
"text": "We explore several multi-task learning scenarios by combining a large task (machine translation) with: (a) a small task -Penn Tree Bank (PTB) parsing, (b) a medium-sized task -image caption generation, (c) another large task -parsing on the high-confidence (HC) corpus, and (d) lastly, unsupervised tasks, such as autoencoders and skip-thought vectors."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-123",
"text": "In terms of evaluation metrics, we report both validation and test perplexities for all tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-124",
"text": "Additionally, we also compute test BLEU scores (Papineni et al., 2002) for the translation task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-125",
"text": "4 The training sizes reported for the unsupervised tasks are only 10% of the original WMT'15 monolingual corpora which we randomly sample from."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-126",
"text": "Such reduced sizes are for faster training time and already about three times larger than that of the parallel data."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-127",
"text": "We consider using all the monolingual data in future work."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-128",
"text": "5 For image caption generation, we use 1024 dimensions, which is also the size of the image embeddings."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-130",
"text": "**LARGE TASKS WITH SMALL TASKS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-131",
"text": "In this setting, we want to understand if a small task such as PTB parsing can help improve the performance of a large task such as translation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-132",
"text": "Since the parsing task maps from a sequence of English words to a sequence of parsing tags (Vinyals et al., 2015a) , only the encoder can be shared with an English\u2192German translation task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-133",
"text": "As a result, this is a one-to-many MTL scenario ( \u00a73.1)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-134",
"text": "To our surprise, the results in Table 2 suggest that by adding a very small number of parsing minibatches (with mixing ratio 0.01, i.e., one parsing mini-batch per 100 translation mini-batches), we can improve the translation quality substantially."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-135",
"text": "More concretely, our best multi-task model yields a gain of +1.5 BLEU points over the single-task baseline."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-136",
"text": "It is worth pointing out that as shown in Table 2 , our single-task baseline is very strong, even better than the equivalent non-attention model reported in (Luong et al., 2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-137",
"text": "Larger mixing coefficients, however, overfit the small PTB corpus; hence, achieve smaller gains in translation quality."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-138",
"text": "For parsing, as Vinyals et al. (2015a) have shown that attention is crucial to achieve good parsing performance when training on the small PTB corpus, we do not set a high bar for our attention-free systems in this setup (better performances are reported in Section 4.3.3)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-139",
"text": "Nevertheless, the parsing results in Table 2 indicate that MTL is also beneficial for parsing, yielding an improvement of up to +8.9 F 1 points over the baseline."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-140",
"text": "6 It would be interesting to study how MTL can be useful with the presence of the attention mechanism, which we leave for future work."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-142",
"text": "**TASK**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-144",
"text": "**LARGE TASKS WITH MEDIUM TASKS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-145",
"text": "We investigate whether the same pattern carries over to a medium task such as image caption generation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-146",
"text": "Since the image caption generation task maps images to a sequence of English words Xu et al., 2015) , only the decoder can be shared with a German\u2192English translation task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-147",
"text": "Hence, this setting falls under the many-to-one MTL setting ( \u00a73.2)."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-148",
"text": "The results in Table 3 show the same trend we observed before, that is, by training on another task for a very small fraction of time, the model improves its performance on its main task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-149",
"text": "Specifically, with 5 parameter updates for image caption generation per 100 updates for translation (so the mixing ratio of 0.05), we obtain a gain of +0.7 BLEU scores over a strong single-task baseline."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-150",
"text": "Our baseline is almost a BLEU point better than the equivalent non-attention model reported in Luong et al. (2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-151",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-152",
"text": "**LARGE TASKS WITH LARGE TASKS**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-153",
"text": "Our first set of experiments is almost the same as the one-to-many setting in Section 4.3.1 which combines translation, as the reference task, with parsing."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-154",
"text": "The only difference is in terms of parsing Table 2 ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-155",
"text": "Reference tasks are in italic with mixing ratios in parentheses."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-156",
"text": "The average results of 2 runs are in mean (stddev) format."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-157",
"text": "data."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-158",
"text": "Instead of using the small Penn Tree Bank corpus, we consider a large parsing resource, the high-confidence (HC) corpus, which is provided by Vinyals et al. (2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-159",
"text": "As highlighted in Table 4 , the trend is consistent; MTL helps boost translation quality by up to +0.9 BLEU points."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-160",
"text": "Table 4 : English\u2192German WMT'14 translation -shown are perplexities (ppl) and BLEU scores of various translation models."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-161",
"text": "Our multi-task systems combine translation and parsing on the highconfidence corpus together."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-162",
"text": "Mixing ratios are in parentheses and the average results over 2 runs are in mean (stddev) format."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-163",
"text": "Best results are bolded."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-165",
"text": "**TASK TRANSLATION**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-166",
"text": "The second set of experiments shifts the attention to parsing by having it as the reference task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-167",
"text": "We show in Table 5 results that combine parsing with either (a) the English autoencoder task or (b) the English\u2192German translation task."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-168",
"text": "Our models are compared against the best attention-based systems in (Vinyals et al., 2015a) , including the state-of-the-art result of 92.8 F 1 ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-169",
"text": "Before discussing the multi-task results, we note a few interesting observations."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-170",
"text": "First, very small parsing perplexities, close to 1.1, can be achieved with large training data."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-171",
"text": "7 Second, our baseline system can obtain a very competitive F 1 score of 92.2, rivaling Vinyals et al. (2015a) 's systems."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-172",
"text": "This is rather surprising since our models do not use any attention mechanism."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-196",
"text": "Furthermore, we have established a new state-of-the-art result in constituent parsing with an ensemble of multi-task models."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-173",
"text": "A closer look into these models reveal that there seems to be an architectural difference: Vinyals et al. (2015a) use 3-layer LSTM with 256 cells and 512-dimensional embeddings; whereas our models use 4-layer LSTM with 1000 cells and 1000-dimensional embeddings."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-174",
"text": "This further supports findings in (Jozefowicz et al., 2016) that larger networks matter for sequence models."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-175",
"text": "For the multi-task results, while autoencoder does not seem to help parsing, translation does."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-176",
"text": "At the mixing ratio of 0.05, we obtain a non-negligible boost of 0.2 F 1 over the baseline and with 92.4 F 1 , our multi-task system is on par with the best single system reported in (Vinyals et al., 2015a) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-177",
"text": "Furthermore, by ensembling 6 different multi-task models (trained with the translation task at mixing ratios of 0.1, 0.05, and 0.01), we are able to establish a new state-of-the-art result in English constituent parsing with 93.0 F 1 score."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-179",
"text": "**MULTI-TASKS AND UNSUPERVISED LEARNING**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-180",
"text": "Our main focus in this section is to determine whether unsupervised learning can help improve translation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-181",
"text": "Specifically, we follow the many-to-many approach described in Section 3.3 to couple the German\u2192English translation task with two unsupervised learning tasks on monolingual corpora, one per language."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-182",
"text": "The results in Tables 6 show a similar trend as before, a small amount of other tasks, in this case the autoencoder objective with mixing coefficient 0.05, improves the translation quality by +0.5 BLEU scores."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-183",
"text": "However, as we train more on the autoencoder task, i.e. with larger mixing ratios, the translation performance gets worse."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-184",
"text": "Skip-thought objectives, on the other hand, behave differently."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-185",
"text": "If we merely look at the perplexity metric, the results are very encouraging: with more skip-thought data, we perform better consistently across both the translation and the unsupervised tasks."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-186",
"text": "However, when computing the BLEU scores, the translation quality degrades as we increase the mixing coefficients."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-187",
"text": "We anticipate that this is due to the fact that the skip-thought objective changes the nature of the translation task when using one half of a sentence to predict the other half."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-188",
"text": "It is not a problem for the autoencoder objectives, however, since one can think of autoencoding a sentence as translating into the same language."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-190",
"text": "**TASK**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-191",
"text": "We believe these findings pose interesting challenges in the quest towards better unsupervised objectives, which should satisfy the following criteria: (a) a desirable objective should be compatible with the supervised task in focus, e.g., autoencoders can be viewed as a special case of translation, and (b) with more unsupervised data, both intrinsic and extrinsic metrics should be improved; skip-thought objectives satisfy this criterion in terms of the intrinsic metric but not the extrinsic one."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-193",
"text": "**CONCLUSION**"
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-194",
"text": "In this paper, we showed that multi-task learning (MTL) can improve the performance of the attention-free sequence to sequence model of (Sutskever et al., 2014) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-195",
"text": "We found it surprising that training on syntactic parsing and image caption data improved our translation performance, given that these datasets are orders of magnitude smaller than typical translation datasets."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-197",
"text": "We also show that the two unsupervised learning objectives, autoencoder and skip-thought, behave differently in the MTL context involving translation."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-198",
"text": "We hope that these interesting findings will motivate future work in utilizing unsupervised data for sequence to sequence learning."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-199",
"text": "A criticism of our work is that our sequence to sequence models do not employ the attention mechanism (Bahdanau et al., 2015) ."
},
{
"sent_id": "9426b2faf2ba633033c7dfcee4118b-C001-200",
"text": "We leave the exploration of MTL with attention for future work."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"9426b2faf2ba633033c7dfcee4118b-C001-15"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-37"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-132"
]
],
"cite_sentences": [
"9426b2faf2ba633033c7dfcee4118b-C001-15",
"9426b2faf2ba633033c7dfcee4118b-C001-37",
"9426b2faf2ba633033c7dfcee4118b-C001-132"
]
},
"@MOT@": {
"gold_contexts": [
[
"9426b2faf2ba633033c7dfcee4118b-C001-15",
"9426b2faf2ba633033c7dfcee4118b-C001-16"
]
],
"cite_sentences": [
"9426b2faf2ba633033c7dfcee4118b-C001-15"
]
},
"@DIF@": {
"gold_contexts": [
[
"9426b2faf2ba633033c7dfcee4118b-C001-19"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-172",
"9426b2faf2ba633033c7dfcee4118b-C001-173"
]
],
"cite_sentences": [
"9426b2faf2ba633033c7dfcee4118b-C001-19",
"9426b2faf2ba633033c7dfcee4118b-C001-173"
]
},
"@USE@": {
"gold_contexts": [
[
"9426b2faf2ba633033c7dfcee4118b-C001-108",
"9426b2faf2ba633033c7dfcee4118b-C001-109"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-111"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-158"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-168"
]
],
"cite_sentences": [
"9426b2faf2ba633033c7dfcee4118b-C001-109",
"9426b2faf2ba633033c7dfcee4118b-C001-111",
"9426b2faf2ba633033c7dfcee4118b-C001-158",
"9426b2faf2ba633033c7dfcee4118b-C001-168"
]
},
"@SIM@": {
"gold_contexts": [
[
"9426b2faf2ba633033c7dfcee4118b-C001-138",
"9426b2faf2ba633033c7dfcee4118b-C001-139"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-171"
],
[
"9426b2faf2ba633033c7dfcee4118b-C001-176"
]
],
"cite_sentences": [
"9426b2faf2ba633033c7dfcee4118b-C001-138",
"9426b2faf2ba633033c7dfcee4118b-C001-171",
"9426b2faf2ba633033c7dfcee4118b-C001-176"
]
}
}
},
"ABC_74568758fe5fef3727d94e7597f305_8": {
"x": [
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-2",
"text": "This paper explores the segmentation of tutorial dialogue into cohesive topics."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-3",
"text": "A latent semantic space was created using conversations from human to human tutoring transcripts, allowing cohesion between utterances to be measured using vector similarity."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-4",
"text": "Previous cohesionbased segmentation methods that focus on expository monologue are reapplied to these dialogues to create benchmarks for performance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-5",
"text": "A novel moving window technique using orthonormal bases of semantic vectors significantly outperforms these benchmarks on this dialogue segmentation task."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-8",
"text": "Ever since Morris and Hirst (1991) 's groundbreaking paper, topic segmentation has been a steadily growing research area in computational linguistics, with applications in summarization (Barzilay and Elhadad, 1997) , information retrieval (Salton and Allan, 1994) , and text understanding (Kozima, 1993) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-9",
"text": "Topic segmentation likewise has multiple educational applications, such as question answering, detecting student initiative, and assessing student answers."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-10",
"text": "There have been essentially two approaches to topic segmentation in the past."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-11",
"text": "The first of these, lexical cohesion, may be used for either linear segmentation (Morris and Hirst, 1991; Hearst, 1997) or hierarchical segmentation (Yarri, 1997; Choi, 2000) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-12",
"text": "The essential idea behind the lexical cohesion approaches is that different topics will have different vocabularies."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-13",
"text": "Therefore the lexical cohesion within topics will be higher than the lexical cohesion between topics, and gaps in cohesion may mark topic boundaries."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-14",
"text": "The second major approach to topic segmentation looks for distinctive textual or acoustic markers of topic boundaries, e.g. referential noun phrases or pauses (Passonneau and Litman, 1993; Passonneau and Litman, 1997) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-15",
"text": "By using multiple markers and machine learning methods, topic segmentation algorithms may be developed using this second approach that have a higher accuracy than methods using a single marker alone (Passonneau and Litman, 1997) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-16",
"text": "The primary technique used in previous studies, lexical cohesion, is no stranger to the educational NLP community."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-17",
"text": "Lexical cohesion measured by latent semantic analysis (LSA) (Landauer and Dumais, 1997; Dumais, 1993; Manning and Sch\u00fctze, 1999) has been used in automated essay grading (Landauer, Foltz, and Laham, 1998) and in understanding student input during tutorial dialogue (Graesser et al., 2001 )."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-18",
"text": "The present paper investigates an orthonormal basis of LSA vectors, currently used by the AutoTutor ITS to assess student answers (Hu et al., 2003) , and how it may be used to segment tutorial dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-19",
"text": "The focus on dialogue distinguishes our work from virtually all previous work on topic segmentation: prior studies have focused on monologue rather than dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-20",
"text": "Without dialogue, previous approaches have only limited relevance to interactive educational applications such as intelligent tutoring systems (ITS)."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-21",
"text": "The only existing work on topic segmentation in dialogue, Galley et al. (2003) , segments recorded speech between multiple persons using both lexical cohesion and dis-tinctive textual and acoustic markers."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-22",
"text": "The present work differs from Galley et al. (2003) in two respects, viz. we focus solely on textual information and we directly address the problem of tutorial dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-23",
"text": "In this study we apply the methods of Foltz et al. (1998) , Hearst (1994 Hearst ( , 1997 , and a new technique utilizing an orthonormal basis to topic segmentation of tutorial dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-24",
"text": "All three are vector space methods that measure lexical cohesion to determine topic shifts."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-25",
"text": "Our results show that the new using an orthonormal basis significantly outperforms the other methods."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-26",
"text": "Section 2 reviews previous work, and Section 3 reviews the vector space model."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-27",
"text": "Section 4 introduces an extension of the vector space model which uses an orthonormal basis."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-28",
"text": "Section 5 outlines the task domain of tutorial dialogue, and Section 6 presents the results of previous and the current method on this task domain."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-29",
"text": "A discussion and comparison of these results takes place in Section 7."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-30",
"text": "Section 8 concludes."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-32",
"text": "**PREVIOUS WORK**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-33",
"text": "Though the idea of using lexical cohesion to segment text has the advantages of simplicity and intuitive appeal, it lacks a unique implementation."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-34",
"text": "An implementation must define how to represent units of text, compare the cohesion between units, and determine whether the results of comparison indicate a new text segment."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-35",
"text": "Both Hearst (1994 Hearst ( , 1997 and Foltz et al. (1998) use vector space methods discussed below to represent and compare units of text."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-36",
"text": "The comparisons can be characterized by a moving window, where successive overlapping comparisons are advanced by one unit of text."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-37",
"text": "However, Hearst (1994 Hearst ( , 1997 and Foltz et al. (1998) differ on how text units are defined and on how to interpret the results of a comparison."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-38",
"text": "The text unit's definition in Hearst (1994 Hearst ( , 1997 and Foltz et al. (1998) is generally task dependent, depending on what size gives the best results."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-39",
"text": "For example, when measuring comprehension, use the unit of the sentence, as opposed to the more standard unit of the proposition, because LSA is most correlated with comprehension at that level."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-40",
"text": "However, when using LSA to segment text, Foltz et al. (1998) use the paragraph as the unit, to \"smooth out\" the local changes in cohesion and become more sensitive to more global changes of cohesion."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-41",
"text": "Hearst likewise chooses a large unit, 6 token-sequences of 20 tokens (Hearst, 1994) , but varies these parameters dependent on the characteristics of the text to be segmented, e.g. paragraph size."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-42",
"text": "Under a vector space model, comparisons are performed by calculating the cosine of vectors representing text."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-43",
"text": "As stated previously, these comparisons reflect the cohesion between units of text."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-44",
"text": "In order to use these comparisons to segment text, however, one must have a criterion in place."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-45",
"text": "Foltz et al. (1998) , noting mean cosines of .16 for boundaries and .43 for non-boundaries, choose a threshold criterion of .15, which is two standard deviations below the boundary mean of .43."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-46",
"text": "Using LSA and this criterion, Foltz et al. (1998) detected chapter boundaries with an F-measure of .33 (see Manning and Sch\u00fctze (1999) for a definition of Fmeasure)."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-47",
"text": "Hearst (1994 Hearst ( , 1997 in contrast uses a relative comparison of cohesion, by recasting vector comparisons as depth scores."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-48",
"text": "A depth score is computed as the difference between a given vector comparison and its surrounding peaks, i.e. the local maxima of vector comparisons on either side of the given vector comparison."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-49",
"text": "The greater the difference between a given comparison and its surrounding peaks, the higher the depth score."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-50",
"text": "Once all the depth scores are calculated for a text, those that are higher than one standard deviation below the mean are taken as topic boundaries."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-51",
"text": "Using a vector space method without singular value decomposition, Hearst (1997) reports an F-measure of .70 when detecting topic shifts between paragraphs."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-52",
"text": "Thus previous work suggests that the Hearst (1997) method is superior to that of Foltz et al. (1998) , having roughly twice the accuracy indicated by F-measure."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-53",
"text": "Although these two results used different data sets and are therefore not directly comparable, one would predict based on this limited evidence that the Hearst algorithm would outperform the Foltz algorithm on other topic segmentation tasks."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-55",
"text": "**THE VECTOR SPACE MODEL**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-56",
"text": "The vector space model is a statistical technique that represents the similarity between collections of words as a cosine between vectors (Manning and Sch\u00fctze, 1999) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-57",
"text": "The process begins by collecting text into a corpus."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-58",
"text": "A matrix is created from the corpus, having one row for each unique word in the corpus and one column for each document or paragraph."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-59",
"text": "The cells of the matrix consist of a simple count of the number of times word i appeared in document j. Since many words do not appear in any given document, the matrix is often sparse."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-60",
"text": "Weightings are applied to the cells that take into account the frequency of word i in document j and the frequency of word i across all documents, such that distinctive words that appear infrequently are given the most weight."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-61",
"text": "Two collections of words of arbitrary size are compared by creating two vectors."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-62",
"text": "Each word is associated with a row vector in the matrix, and the vector of a collection is simply the sum of all the row vectors of words in that collection."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-63",
"text": "Vectors are compared geometrically by the cosine of the angle between them."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-64",
"text": "LSA (Landauer and Dumais, 1997; Dumais 1993 ) is an extension of the vector space model that uses singular value decomposition (SVD)."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-65",
"text": "SVD is a technique that creates an approximation of the original word by document matrix."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-66",
"text": "After SVD, the original matrix is equal to the product of three matrices, word by singular value, singular value by singular value, and singular value by document."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-67",
"text": "The size of each singular value corresponds to the amount of variance captured by a particular dimension of the matrix."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-68",
"text": "Because the singular values are ordered in decreasing size, it is possible to remove the smaller dimensions and still account for most of the variance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-69",
"text": "The approximation to the original matrix is optimal, in the least squares sense, for any number of dimensions one would choose."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-70",
"text": "In addition, the removal of smaller dimensions introduces linear dependencies between words that are distinct only in dimensions that account for the least variance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-71",
"text": "Consequently, two words that were distant in the original space can be near in the compressed space, causing the inductive machine learning and knowledge acquisition effects reported in the literature (Landauer and Dumais, 1997) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-73",
"text": "**AN ORTHONORMAL BASIS**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-74",
"text": "Cohesion can be measured by comparing the cosines of two successive sentences or paragraphs (Foltz, Kintsch, and Landauer, 1998) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-75",
"text": "However, cohesion is a crude measure: repetitions of a single sentence will be highly cohesive (cosine of 1) even though no new information is introduced."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-76",
"text": "A variation of the LSA algorithm using orthonormalized vectors provides two new measures, \"informativity\" and \"relevance\", which can detect how much new information is added and how relevant it is in a context (Hu et al., 2003) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-77",
"text": "The essential idea is to represent context by an orthonormalized basis of vectors, one vector for each utterance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-78",
"text": "The basis is a subspace of the higher dimensional LSA space, in the same way as a plane or line is a subspace of 3D space."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-79",
"text": "The basis is created by projecting each utterance vector onto the basis of previous utterance vectors using a method known as the GramSchmidt process (Anton, 2000) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-80",
"text": "Each projected utterance vector has two components, a component parallel to the basis and a component perpendicular to the basis."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-81",
"text": "These two components represent \"informativity\" and \"relevance\", respectively. Let us first consider \"relevance\"."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-82",
"text": "Since each vector in the basis is orthogonal, the basis represents all linear combinations of what has been previously said."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-83",
"text": "Therefore the component of a new utterance vector that is parallel to the basis is already represented by a linear combination of the existing vectors."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-84",
"text": "\"Informativity\" follows similarly: it is the perpendicular component of a new utterance vector that can not be represented by the existing basis vectors."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-85",
"text": "For example, in Figure 1"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-87",
"text": "**PROCEDURE**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-88",
"text": "The task domain is a subset of conversations from human-human computer mediated tutoring sessions on Newton's Three Laws of Motion, in which tutor and tutee engaged in a chat room-style conversation."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-89",
"text": "The benefits of this task domain are twofold."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-90",
"text": "Firstly, the conversations are already transcribed."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-91",
"text": "Additionally, tutors were instructed to introduce problems using a fixed set of scripted problem statements."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-92",
"text": "Therefore each topic shift corresponds to a distinct problem introduced by the tutor."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-93",
"text": "Clearly this problem would be trivial for a cue phrase based approach, which could learn the finite set of problem introductions."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-94",
"text": "However, the current lexical approach does not have this luxury: words in the problem statements recur throughout the following dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-95",
"text": "Human to human computer mediated physics tutoring transcripts first were removed of all markup, translated to lower case, and each utterance was broken into a separate paragraph."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-96",
"text": "An LSA space was made with these paragraphs alone, approximately one megabyte of text."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-97",
"text": "The conversations were then randomly assigned to training (21 conversations) and testing (22 conversations)."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-98",
"text": "The average number of utterances per topic, 16 utterances, and the average number of words per utterance, 32 words, were calculated to determine the parameters of the segmentation methods."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-99",
"text": "For example, a moving window size greater than 16 utterances implies that, in the majority of occurrences, the moving window straddles three topics as opposed to the desired two."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-100",
"text": "To replicate Foltz et al. (1998) , software was written in Java that created a moving window of varying sizes on the input text, and the software retrieved the LSA vector and calculated the cosine of each window."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-101",
"text": "Hearst (1994 Hearst ( , 1997 was replicated using the JTextTile (Choi, 1999 ) Java software."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-102",
"text": "A variant of Hearst (1994 Hearst ( , 1997 was created by using LSA instead of the standard vector space method."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-103",
"text": "The orthonormal basis method also used a moving window; however, in contrast to the previous methods, the window is not treated just as a large block of text."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-104",
"text": "Instead, the window consists of two orthonormal bases, one on either side of an utterance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-105",
"text": "That is, a region of utterances above the test utterance is projected, utterance by utterance, into an orthonormal basis, and likewise a region of utterances below the test utterance is projected into another orthonormal basis."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-106",
"text": "Then the test utterance is projected into each orthonormal basis, yielding measures of \"relevance\" and \"informativity\" with respect to each."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-107",
"text": "Next the elements that make up each orthonormal basis are aggregated into a block, and a cosine is calculated between the test utterance and the blocks on either side, producing a total of six measures."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-108",
"text": "Each tutoring session consists of the same 10 problems, discussed between one of a set of 4 tutors and one of 18 subjects."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-109",
"text": "The redundancy provides a variety of speaking and interaction styles on the same topic."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-110",
"text": "Tutor: A clown is riding a unicycle in a straight line."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-111",
"text": "She accidentally drops an egg beside her as she continues to move with constant velocity."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-112",
"text": "Where will the egg land relative to the point where the unicycle touches the ground? Explain."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-113",
"text": "Student: The egg should land right next to the unicycle."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-114",
"text": "The egg has a constant horizontal velocity."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-115",
"text": "The vertical velocity changes and decreases as gravity pulls the egg downward at a rate of 9.8m/s^2."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-116",
"text": "The egg should therefore land right next to the unicycle."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-117",
"text": "Tutor: Good! There is only one thing I would like to know. What can you say about the horizontal velocity of the egg compared to the horizontal velocity of the clown? Student: Aren't they the same?"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-118",
"text": "All of the 10 problems are designed to require application of Newton's Laws to be solved, and therefore conversations share many terms such as force, velocity, acceleration, gravity, etc."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-120",
"text": "**RESULTS**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-121",
"text": "For each method, the development set was first used to establish the parameters such as text unit size and classification criterion."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-122",
"text": "The methods, tuned to these parameters, were then applied to the testing data."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-124",
"text": "**FOLTZ ET AL. (1998)**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-125",
"text": "In order to replicate Foltz et al.'s results, a text unit size and window size needed to be chosen."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-126",
"text": "The utterance was chosen as the text unit size, which included single word utterances, full sentences, and multi-sentence utterances."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-127",
"text": "To determine the most appropriate window size, results from all sizes between 1 and 16 (the average number of utterances between topic shifts) were gathered."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-128",
"text": "The greatest difference between the means for utterances that introduce a topic shift versus non-shift utterances occurs when the window contains four utterances."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-129",
"text": "The standard deviation is uniformly low for windows containing more than two utterances and therefore can be disregarded in choosing a window size."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-130",
"text": "The optimal cosine threshold for classification was found using logistic regression (Garson, 2003) which establishes a relationship between the cosine threshold and the log odds of classification."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-131",
"text": "The optimal cutoff was found to be shift odds = .17 with associated F-measure of .49."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-132",
"text": "The logistic equation of best fit is: cosine) (-13.345 1.887 odds) ln(shift \u22c5 + = F-measure of .49 is 48% higher than the Fmeasure reported by Foltz et al. (1998) for segmenting monologue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-133",
"text": "On the testing corpus the Fmeasure is .52, which demonstrates good generalization for the logistic equation given."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-134",
"text": "Compared the F-measure of .33 reported by Foltz et al. (1998) , the current result is 58% higher."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-136",
"text": "**HEARST (1994, 1997)**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-137",
"text": "The JTextTile software was used to implement Hearst (1994) on dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-138",
"text": "As with Foltz et al. (1998) , a text unit and window size had to be determined for dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-139",
"text": "Hearst (1994) recommends using the average paragraph size as the window size."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-140",
"text": "Using the development corpus's average topic length of 16 utterances as a reference point, F-measures were calculated for the combinations of window size and text unit size in Table 1 ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-141",
"text": "The optimal combination of parameters (Fmeasure = .17) is a unit size of 16 words and a window size of 16 units."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-142",
"text": "This combination matches Hearst (1994) 's heuristic of choosing the window size to be the average paragraph length."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-143",
"text": "On the test set, this combination of parameters yielded an F-measure of .14 as opposed to the Fmeasure for monologue reported by Hearst (1997) , .70."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-144",
"text": "For dialogue, the algorithm is 20% as effective as it is for monologue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-145",
"text": "It is unclear, however, exactly what part of the algorithm contributes to this poor performance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-146",
"text": "The two most obvious possibilities are the segmentation criterion, i.e. depth scores, or the standard vector space method."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-147",
"text": "To further explore these possibilities, the Hearst method was augmented with LSA."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-148",
"text": "Again, the unit size and window size had to be calculated."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-149",
"text": "As with Foltz, the unit size was taken to be the utterance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-150",
"text": "The window size was determined by computing F-measures on the development corpus for all sizes between 1 and 16."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-151",
"text": "The optimal window size is 9, F-measure = .22."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-152",
"text": "Given the smaller number of test cases, 22, this F-measure of .22 is not significantly different from .17."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-153",
"text": "However, the Foltz method is significantly higher than both of these, p < .10."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-155",
"text": "**ORTHONORMAL BASIS**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-156",
"text": "The text unit used in the orthonormal basis is the single utterance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-157",
"text": "The optimal window size, i.e. the orthonormal basis size, was determined by creating a logistic regression to calculate the maximum Fmeasure for several orthonormal basis sizes."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-158",
"text": "The findings of this procedure are listed in Table 2 ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-159",
"text": "F-measure for orthonormal basis sizes F-measure monotonically increases until the orthonormal basis holds six elements and holds relatively steady for larger orthonormal basis sizes."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-160",
"text": "Since F-measure does not increase much over .72 for greater orthonormal basis sizes, 6 was chosen as the most computationally efficient size for the strength of the effect."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-161",
"text": "The logistic equation of best fit is: Where the index of 1 indicates a measure on the window preceding the utterance, and an index of 2 indicates a measure on the window following the utterance."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-162",
"text": "In the regression, the cosine between the utterance and the preceding window was not significant, p = .86."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-163",
"text": "This finding reflects the intuition that the cosine to the following window varies according to whether the following window is on a new topic, whereas the cosine to the preceding window is always high."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-164",
"text": "Additionally, measures of \"relevance\" and \"informativity\" correspond to vector length; all other measures did not contribute significantly to the model and so were not included."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-165",
"text": "The sign of the metrics illuminates their role in the model."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-166",
"text": "The negative sign on the coefficients for relevance 1 , informativity 1 , and relevance 2 indicates that they are inversely correlated with an utterance signaling the start of a new topic."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-167",
"text": "The only surprising feature is that informativity 1 is negatively correlated instead of positively correlated: one would expect a topic shift to introduce new information."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-168",
"text": "There is possibly some edge effect here, since the last move of a topic is often a summarizing move that shares many of the physics terms present in the introduction of a new topic."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-169",
"text": "On the other hand, the positive sign on cosine 2 and informativity 2 indicates that the start of a new topic should have elements in common with the following material and add new information to that material, as an overview would."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-170",
"text": "Beyond the sign, the exponentials of these values indicate how the two basis metrics are weighted."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-171",
"text": "For example, when informativity 2 is raised by one unit, a topic shift is 16 times more likely."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-172",
"text": "On the testing corpus the F-measure of the orthonormal basis method is .67, which is significantly different from the performance of all three methods mentioned above, p < .05."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-174",
"text": "**DISCUSSION**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-175",
"text": "The relative ranking of these results is not altogether surprising given the relationships between inferencing and LSA and between inferencing and dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-176",
"text": "Foltz et al. (1998) found that LSA makes simple bridging inferences in addition to detecting lexical cohesion."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-177",
"text": "These bridging inferences are a kind of collocational cohesion (Halliday and Hassan, 1976) whereby words that cooccur in similar contexts become highly related in the LSA space."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-178",
"text": "Therefore in applications where this kind of inferencing is required, one might expect an LSA based method to excel."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-179",
"text": "Similarly to van Dijk and Kintsch's model of comprehension (van Dijk and Kintsch, 1983) , dialogue can require inferences to maintain coherence."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-180",
"text": "According to Grice's Co-operative Principle, utterances lacking semantic coherence flout the Maxim of Relevance and license an inference (Grice, 1975) : S1: Let's go dancing."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-181",
"text": "S2: I have an exam tomorrow."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-182",
"text": "The \"inference\" in the sense of Foltz, Kintsch, and Landauer (1998) would be represented by a high cosine between these utterances, even though they don't share any of the same words."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-183",
"text": "Dialogue generally tends to be less lexically cohesive and require more inferencing than expository mono- logue, so one might predict that LSA would excel in dialogue applications."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-184",
"text": "However, LSA has a weakness: the cosine measure between two vectors does not change monotonically as new word vectors are added to either of the two vectors."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-185",
"text": "Accordingly, the addition of a word vector can cause the cosine between two text units to dramatically increase or decrease."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-186",
"text": "Therefore the distinctive properties of individual words can be lost with the addition of more words to a text unit."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-187",
"text": "This problem can be addressed by using an orthonormal basis (Hu et al., 2003) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-188",
"text": "By using a basis, each utterance is kept independent, so \"inferencing\" can extend over both the entire set of utterances and the linear combination of any of its subsets."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-189",
"text": "Accordingly, when \"inferencing\" over the entire text unit is required, one would expect a basis method using LSA vectors to outperform a standard LSA method."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-190",
"text": "This expectation has been put to the test recently by Olney & Cai (2005) , who find that an orthonormal basis can significantly predict entailment on test data supplied by the PASCAL Textual Entailment Challenge (PASCAL, 2004) ."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-191",
"text": "Beyond relative performance rankings, more support for the above reasoning can be found in the difference between Hearst and Hearst + LSA."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-192",
"text": "Recall that in monologue, Hearst (1997) reports a much larger F-measure than Foltz et al. (1998) , .70 vs. .33, albeit on different data sets."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-193",
"text": "In the present dialogue corpus, these roles are reversed, .14 vs. .52."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-194",
"text": "Possible reasons for this reversal are the segmentation criterion, the vector space method, or the fact that Foltz has been trained on similar data via regression and Hearst has not."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-195",
"text": "However, comparing the Hearst algorithm with the Hearst + LSA algorithm indicates that a 57% improvement stems from the addition of LSA, keeping all other factors constant."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-196",
"text": "While this result is not statistically significant, the direction of the result supports the use of an \"inferencing\" vector space method for segmenting dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-197",
"text": "Unfortunately, the large difference in F-measure between the Foltz algorithm and the Hearst + LSA algorithm is more difficult to explain."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-198",
"text": "These two methods differ by their segmentation criterion and by their training (Foltz is a regression model and Hearst is not)."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-199",
"text": "It may be that Hearst (1994 Hearst ( , 1997 )'s segmentation criterion, i.e. depth scores, do not translate well to dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-200",
"text": "Perhaps the assignment of segment boundaries based on the relative difference between a candidate score and its surrounding peaks is highly sensitive to cohesion gaps created by conversational implicatures."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-201",
"text": "On the other hand the differences between these two methods may be entirely attributable to the amount of training they received."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-202",
"text": "One way to separate the contributions of the segmentation criterion and training would be to create a logistic model using the Hearst + LSA method and to compare this to Foltz."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-203",
"text": "The increased effectiveness of the orthonormal basis method over the Foltz algorithm can also be explained in terms of \"inferencing\"."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-204",
"text": "Since \"inferencing\" is overwhelmed by lexical cohesion , the increase in window size for the Foltz algorithm deteriorates performance for a window size greater than 4."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-205",
"text": "In contrast, the orthonormal basis method becomes most effective as the orthonormal basis size increases past 4."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-206",
"text": "This dichotomy illustrates that the Foltz algorithm is not complementary to an \"inferencing\" approach in general."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-207",
"text": "Use of an orthonormal basis, on the other hand, increases sensitivity to collocational cohesion without sacrificing lexical cohesion."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-208",
"text": "----------------------------------"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-209",
"text": "**CONCLUSION**"
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-210",
"text": "This study explored the segmentation of tutorial dialogue using techniques that have previously been applied to expository monologue and using a new orthonormal basis technique."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-211",
"text": "The techniques previously applied to monologue reversed their roles of effectiveness when applied to dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-212",
"text": "This role reversal suggests the predominance of collocational cohesion, requiring \"inferencing\", present in this tutorial dialogue."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-213",
"text": "The orthonormal basis method, which we suggest has an increased capacity for \"inferencing\", outperformed both of the techniques previously applied to monologue, and demonstrates that segmentation of these tutorial dialogues most benefits from a method sensitive to lexical and collocational cohesion over large text units."
},
{
"sent_id": "74568758fe5fef3727d94e7597f305-C001-214",
"text": "conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DoD, ONR, or NSF."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"74568758fe5fef3727d94e7597f305-C001-10",
"74568758fe5fef3727d94e7597f305-C001-11"
],
[
"74568758fe5fef3727d94e7597f305-C001-35"
],
[
"74568758fe5fef3727d94e7597f305-C001-37"
],
[
"74568758fe5fef3727d94e7597f305-C001-38"
],
[
"74568758fe5fef3727d94e7597f305-C001-47"
],
[
"74568758fe5fef3727d94e7597f305-C001-51"
],
[
"74568758fe5fef3727d94e7597f305-C001-52"
],
[
"74568758fe5fef3727d94e7597f305-C001-192"
],
[
"74568758fe5fef3727d94e7597f305-C001-199"
]
],
"cite_sentences": [
"74568758fe5fef3727d94e7597f305-C001-11",
"74568758fe5fef3727d94e7597f305-C001-35",
"74568758fe5fef3727d94e7597f305-C001-37",
"74568758fe5fef3727d94e7597f305-C001-38",
"74568758fe5fef3727d94e7597f305-C001-47",
"74568758fe5fef3727d94e7597f305-C001-51",
"74568758fe5fef3727d94e7597f305-C001-52",
"74568758fe5fef3727d94e7597f305-C001-192",
"74568758fe5fef3727d94e7597f305-C001-199"
]
},
"@USE@": {
"gold_contexts": [
[
"74568758fe5fef3727d94e7597f305-C001-23"
],
[
"74568758fe5fef3727d94e7597f305-C001-100",
"74568758fe5fef3727d94e7597f305-C001-101",
"74568758fe5fef3727d94e7597f305-C001-102"
]
],
"cite_sentences": [
"74568758fe5fef3727d94e7597f305-C001-23",
"74568758fe5fef3727d94e7597f305-C001-101",
"74568758fe5fef3727d94e7597f305-C001-102"
]
},
"@DIF@": {
"gold_contexts": [
[
"74568758fe5fef3727d94e7597f305-C001-47"
],
[
"74568758fe5fef3727d94e7597f305-C001-140",
"74568758fe5fef3727d94e7597f305-C001-143"
],
[
"74568758fe5fef3727d94e7597f305-C001-192",
"74568758fe5fef3727d94e7597f305-C001-193"
]
],
"cite_sentences": [
"74568758fe5fef3727d94e7597f305-C001-47",
"74568758fe5fef3727d94e7597f305-C001-143",
"74568758fe5fef3727d94e7597f305-C001-192"
]
}
}
},
"ABC_e3dd013c944cf8dcb6ff90124a0e01_8": {
"x": [
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-145",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-2",
"text": "In this work, we present an empirical study of generation order for machine translation."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-3",
"text": "Building on recent advances in insertion-based modeling, we first introduce a soft orderreward framework that enables us to train models to follow arbitrary oracle generation policies."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-4",
"text": "We then make use of this framework to explore a large variety of generation orders, including uninformed orders, location-based orders, frequency-based orders, content-based orders, and model-based orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-5",
"text": "Curiously, we find that for the WMT'14 English \u2192 German translation task, order does not have a substantial impact on output quality, with unintuitive orderings such as alphabetical and shortestfirst matching the performance of a standard Transformer."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-6",
"text": "This demonstrates that traditional left-to-right generation is not strictly necessary to achieve high performance."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-7",
"text": "On the other hand, results on the WMT'18 English \u2192 Chinese task tend to vary more widely, suggesting that translation for less well-aligned language pairs may be more sensitive to generation order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-10",
"text": "Neural sequence models (Sutskever et al., 2014; Cho et al., 2014) have been successfully applied to a broad range of tasks in recent years."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-11",
"text": "While these models typically generate their outputs using a fixed left-to-right order, there has also been some investigation into non-left-to-right and order-independent generation in pursuit of quality or speed."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-12",
"text": "For example, explored the problem of predicting sets using sequence models."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-13",
"text": "While this is a domain where generation order should intuitively be unimportant, they nevertheless found it to make a substantial difference in practice."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-14",
"text": "Ford et al. (2018) explored treating language modeling as a two-pass * Equal contribution."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-15",
"text": "process, where words from certain classes are generated first, and the remaining words are filled in during the second pass."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-16",
"text": "They found that generating function words first followed by content words second yielded the best results."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-17",
"text": "Separately, Gu et al. (2018) and Lee et al. (2018) developed non-autoregressive approaches to machine translation where the entire output can be generated in parallel in constant time."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-18",
"text": "These models do away with order selection altogether but typically lag behind their autoregressive counterparts in translation quality."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-19",
"text": "More recently, a number of novel insertionbased architectures have been developed for sequence generation (Gu et al., 2019; Stern et al., 2019; Welleck et al., 2019) ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-20",
"text": "These frameworks license a diverse set of generation orders, including uniform (Welleck et al., 2019) , random (Gu et al., 2019) , or balanced binary trees (Stern et al., 2019) ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-21",
"text": "Some of them also match the quality of state-of-the-art left-to-right models (Stern et al., 2019) ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-118",
"text": "Repeated tokens are handled via greedy left or right alignment to the true output."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-22",
"text": "In this paper, we utilize one such framework to explore an extensive collection of generation orders, evaluating them on the WMT'14 English-German and WMT'18 English-Chinese translation tasks."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-23",
"text": "We find that a number of nonstandard choices achieve BLEU scores comparable to those obtained with the classical approach, suggesting that left-to-right generation might not be a necessary ingredient for high-quality translation."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-24",
"text": "Our contributions are as follows: \u2022 We introduce a general soft order-reward framework that can be used to teach insertion-based models to follow any specified ordering."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-25",
"text": "\u2022 We perform a thorough empirical study of various orders, including: uniform, random, left-to-right, right-to-left, common-first, rarefirst, shortest-first, longest-first, alphabetical, and model-adaptive."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-26",
"text": "\u2022 On the WMT 2014 English \u2192 German task, we show that there is surprisingly little variation in BLEU for different generation orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-27",
"text": "We further find that many orders are able to match the performance of a standard base Transformer."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-29",
"text": "**BACKGROUND**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-30",
"text": "Neural sequence models have traditionally been designed with left-to-right prediction in mind."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-31",
"text": "In the classical setting, output sequences are produced by repeatedly appending tokens to the rightmost end of the hypothesis until an endof-sequence token is generated."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-32",
"text": "Though highperforming across a wide range of application areas, this approach lacks the flexibility to accommodate other types of inference such as parallel generation, constrained decoding, infilling, etc."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-33",
"text": "Moreover, it also leaves open the possibility that a non-left-to-right factorization of the joint distribution over output sequences could outperform the usual monotonic ordering."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-34",
"text": "To address these concerns, several recent approaches have been proposed for insertion-based sequence modeling, in which sequences are con-structed by repeatedly inserting tokens at arbitrary locations in the output rather than only at the right-most position."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-35",
"text": "We use one such insertion-based model, the Insertion Transformer (Stern et al., 2019) , for our empirical study."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-36",
"text": "We give a brief overview of the model in this section before moving on to the details of our investigation."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-38",
"text": "**INSERTION TRANSFORMER**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-39",
"text": "The Insertion Transformer (Stern et al., 2019 ) is a sequence-to-sequence model in which the output is formed by successively inserting one or more tokens at arbitrary locations into a partial hypothesis."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-40",
"text": "This type of generation is made possible through the use of a joint distribution over tokens and slots."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-41",
"text": "More formally, given an input x and a partial output\u0177 t at time t, the Insertion Transformer gives the joint distribution"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-42",
"text": "where c \u2208 V is the content being selected from the vocabulary V and 0 \u2264 l \u2264 |\u0177 t | is the insertion location."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-43",
"text": "As its name suggests, the Insertion Transformer extends the Transformer model (Vaswani et al., 2017) with a few key modifications to generalize from ordinary next-token modeling to joint tokenand-slot modeling."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-44",
"text": "First, the Insertion Transformer removes the causal attention mask from the decoder, allowing for fully contextualized output representations to be derived after each insertion."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-45",
"text": "Second, the Insertion Transformer pads the lengthn decoder input on both ends so that n + 2 output vectors are produced."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-46",
"text": "It then concatenates adjacent pairs of output vectors to obtain n + 1 slot representations, which in turn inform the conditional distributions over tokens within each slot, p(c | l)."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-47",
"text": "Lastly, it performs an additional attention step over the slot representations to obtain a location distribution p(l), which is multiplied with the conditional content distributions to obtain the full joint distribution: p(c, l) = p(c | l)p(l)."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-48",
"text": "A schematic of the architecture is given in Figure 1 for reference."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-49",
"text": "We note that Stern et al. (2019) also experimented with a number of other architectural variants, but we use the baseline version of the model described above in our experiments for simplicity."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-51",
"text": "**DECODING**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-52",
"text": "Once the model has been trained, it can be used for greedy autoregressive sequence generation as follows."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-53",
"text": "At each step of decoding, we compute the joint argmax"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-54",
"text": "to determine what content\u0109 t should be inserted at which locationl t ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-55",
"text": "We then apply this insertion, increasing the sequence length by one, and repeat this process until an end-of-sequence token is produced."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-56",
"text": "This is the serial decoding procedure shown in the left half of Figure 2 ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-57",
"text": "The model can also be used for parallel partially-autoregressive decoding."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-58",
"text": "Instead of computing the joint argmax across all locations, we instead compute the best content for each location:"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-59",
"text": "We then insert the highest-scoring tokens in parallel for all slots that are not yet finished, increasing the sequence length by anywhere between one and n + 1 tokens."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-60",
"text": "This strategy visualized in the right half of Figure 2 ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-61",
"text": "The rank terms are computed with respect to the set of words from the valid action set A * ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-63",
"text": "**SOFT ORDER-REWARD FRAMEWORK**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-64",
"text": "Having presented our model of interest, we now describe a general soft order-reward framework that can be used to train the model to follow any oracle ordering for sequence generation."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-65",
"text": "Let O(a) be an order function mapping insertion actions to real numbers, where lower values correspond to better actions, and let p(a) be the probability assigned by the model to action a. From these, we construct a reward function R(a), an oracle policy q oracle , and a per-slot loss L:"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-66",
"text": "Here, A * is the set of all valid actions."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-67",
"text": "The temperature \u03c4 \u2208 (0, \u221e) controls the sharpness of the distribution, where \u03c4 \u2192 0 results in a one-hot distribution with all mass on the best-scoring action under the order function O(a), and \u03c4 \u2192 \u221e results in a uniform distribution over all valid actions."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-68",
"text": "Intermediate values of \u03c4 result in distributions which are biased towards better-scoring actions but allow for other valid actions to be taken some of the time."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-69",
"text": "Having defined the target distribution, we take the slot loss L for insertions within a particular slot to be the KL-divergence between the oracle distribution q oracle and the model distribution p. Substituting L in for the slot loss within the training framework of Stern et al. (2019) then gives the full sequence generation loss, which we can use to train an Insertion Transformer under any oracle policy rather than just the specific one they propose."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-70",
"text": "We describe a wide variety of generation orders which can be characterized by different order functions O(a) in the subsections that follow."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-71",
"text": "A summary is given in Table 1 ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-73",
"text": "**UNINFORMED ORDERS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-74",
"text": "We evaluate two uninformed orders, uniform and random."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-75",
"text": "The uniform order O(a) = 0 gives equal reward or equivalently probability mass to any valid action."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-76",
"text": "Consequently, this means we give each order a uniform probability treatment."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-77",
"text": "We also experiment with random order O(a) = rank(hash(w)), wherein we hash each word and use the sorted hash ID as the generation order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-78",
"text": "The random order forces the model to follow a specific random path, whereas the uniform order gives equal probability mass to any order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-80",
"text": "**LOCATION-BASED ORDERS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-81",
"text": "We explore two types of location-based orders, balanced binary tree and monotonic orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-82",
"text": "The balanced binary tree order O(a) = |s \u2212 (i + j)/2| encourages the model to place most of its probability mass towards the middle tokens in a missing span."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-83",
"text": "Consequently, this encourages the model to generate text in a balanced binary tree order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-84",
"text": "We also experiment with soft monotonic orders O(a) = \u00b1s, or soft left-to-right and soft right-to-left, which differ slightly from the left-toright teacher forcing traditionally used in seq2seq."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-85",
"text": "First, we still maintain a uniform roll-in policy (see Section 3.6), which increases diversity during training and helps avoid label bias."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-86",
"text": "Additionally, this endows the model with the ability to \"look back\" and insert missing tokens in the middle of the sequence during inference, as opposed to always being forced to append only at one end of the sequence."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-87",
"text": "The order reward is also soft (as described by the \u03c4 term above), wherein we do not place all the probability mass on the next monotonic token, but merely encourage it to generate in a monotonic fashion."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-89",
"text": "**FREQUENCY-BASED ORDERS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-90",
"text": "We evaluate two frequency-based orders: rare words first via O(a) = rank(frequency(w)) and common words first via O(a) = \u2212rank(frequency(w))."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-91",
"text": "For these orders, we simply sort the words based on their frequencies and used their rank as the order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-92",
"text": "We note the most frequent words tend to be punctuation and stop words, such as commas, periods, and \"the\" in English."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-94",
"text": "**CONTENT-BASED ORDERS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-95",
"text": "We also explore content-based orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-96",
"text": "One class of orders is based on the word length: O(a) = \u00b1rank(length(w))."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-97",
"text": "This encourages the model to either emit all the shortest words first or all the longest words first."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-98",
"text": "We also explore alphabetical orderings O(a) = \u00b1rank(w), where sorting is based on Unicode order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-99",
"text": "We note that in Unicode, uppercase letters occur before lower case letters."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-100",
"text": "This biases the model to produce words which are capitalized first (or last), typically corresponding to nouns in German."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-101",
"text": "Additionally, for Chinese, the characters are roughly sorted by radical and stroke count, which bears a loose relation to the complexity and frequency of the character."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-103",
"text": "**MODEL-BASED ORDERS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-104",
"text": "The orders presented thus far are static, meaning they are independent of the model."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-105",
"text": "We also explore orders which are adaptive based on the model's posterior."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-106",
"text": "We also introduce \"easy\" and \"hard\" adaptive orders induced by O(a) = \u00b1 log p(a)."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-107",
"text": "The adaptive orders look at the model's posterior to determine the oracle policy."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-108",
"text": "Consequently the loss is adaptive, as when the model updates after each gradient step, the order adapts to the model's posterior."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-109",
"text": "In the \"easy\" version, we use O(a) = + log p(a), which is similar to a local greedy soft EM loss."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-110",
"text": "We renormalize our current model's posterior over valid actions and optimize towards that distribution."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-111",
"text": "This pushes the model's posterior to what is correct and where it has already placed probability mass."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-112",
"text": "Intuitively, this reinforces the model to select what it thinks are the easiest actions first."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-113",
"text": "Conversely, the \"hard\" variant uses O(a) = \u2212 log p(a) which encourages the model to place probability mass on what it thinks are the hardest valid actions."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-114",
"text": "This is akin to a negative feedback system whose stationary condition is the uniform distribution."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-116",
"text": "**ROLL-IN POLICY**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-117",
"text": "We follow Stern et al. (2019) and use a uniform roll-in policy when sampling partial outputs at training time in which we first select a subset size uniformly at random, then select a random subset of the output of that size."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-143",
"text": "Es wird von allen K\u00fcnstler n bei allen drei Konzert en gleichzeitig ges ungen ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-119",
"text": "Input: It would of course be a little simpler for the Germans if there were a coherent and standardised European policy, which is currently not the case."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-120",
"text": "Output: Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher, wenn es eine koh\u00e4rente und einheitliche europ\u00e4ische Politik g\u00e4be, was derzeit nicht der Fall ist."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-122",
"text": "**PARALLEL DECODE (ALPHABETICAL):**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-123",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-124",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-125",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-126",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-127",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-128",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-129",
"text": "Es w\u00e4re f\u00fcr die Deutschen nat\u00fcrlich ein wenig einfacher , wenn es eine koh\u00e4rent e und einheitliche europ\u00e4ische Politik g\u00e4be , was derzeit nicht der Fall ist ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-130",
"text": "Input: according to the data of National Bureau of Statistics , the fixed asset investment growth , total imports and other data in July have come down ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-131",
"text": "Output: \u6839\u636e\u56fd\u5bb6\u7edf\u8ba1\u5c40\u7684\u6570\u636e\uff0c7 \u6708\u4efd\u7684\u56fa\u5b9a\u8d44\u4ea7\u6295\u8d44\u589e\u957f\u3001\u8fdb\u53e3\u603b\u989d\u548c\u5176\u4ed6\u6570\u636e\u6709\u6240\u4e0b\u964d\u3002 Parallel decode (alphabetical): Figure 3 : Example decodes for models trained to generate tokens in alphabetical (Unicode) order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-132",
"text": "Blue tokens correspond those being inserted at the current time step, and gray tokens correspond to those not yet generated."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-133",
"text": "Note that the desired ordering applies on a per-slot basis rather than a global basis."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-134",
"text": "Input: It will be sung by all the artists at all the three concerts at the same time."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-135",
"text": "Output: Es wird von allen K\u00fcnstlern bei allen drei Konzerten gleichzeitig gesungen."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-137",
"text": "**PARALLEL DECODE (LONGEST-FIRST):**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-138",
"text": "Es wird von allen K\u00fcnstler n bei allen drei Konzert en gleichzeitig ges ungen ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-139",
"text": "Es wird von allen K\u00fcnstler n bei allen drei Konzert en gleichzeitig ges ungen ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-140",
"text": "Es wird von allen K\u00fcnstler n bei allen drei Konzert en gleichzeitig ges ungen ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-141",
"text": "Es wird von allen K\u00fcnstler n bei allen drei Konzert en gleichzeitig ges ungen ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-142",
"text": "Es wird von allen K\u00fcnstler n bei allen drei Konzert en gleichzeitig ges ungen ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-146",
"text": "For our experiments, we train and evaluate models for each order on two standard machine translation datasets: WMT14 En-De and WMT18 En-Zh."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-147",
"text": "For WMT14 En-De, we follow the standard setup with newstest2013 as our development set and newstest2014 as our test set."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-148",
"text": "For WMT18 En-Zh, we use the official preprocessed data 1 with no additional data normalization or filtering, taking newstest2017 to be our development set and new-stest2018 our test set."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-149",
"text": "En-Zh evaluation is carried 1 http://data.statmt.org/wmt18/translationtask/preprocessed/zh-en/ Input: imagine eating enough peanuts to serve as your dinner ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-151",
"text": "**OUTPUT: \u60f3\u8c61\u4e00\u4e0b\uff0c\u5403\u8db3\u591f\u7684\u82b1\u751f\u4f5c\u4e3a\u4f60\u7684\u665a\u9910\u3002**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-152",
"text": "Parallel decode (common-first): out using sacreBLEU 2 (Post, 2018) ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-153",
"text": "In both cases, we train all models for 1M steps using sequencelevel knowledge distillation (Hinton et al., 2015; Kim and Rush, 2016 ) from a base Transformer (Vaswani et al., 2017) ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-154",
"text": "We perform a sweep over temperatures \u03c4 \u2208 {0.5, 1, 2} and EOS penalties \u2208 {0, 0.5, 1, 1.5, . . . , 8} (Stern et al., 2019)"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-156",
"text": "**ABILITY TO LEARN DIFFERENT ORDERS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-157",
"text": "By and large, we find that the Insertion Transformer is remarkably capable of learning to generate according to whichever order it was trained for."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-158",
"text": "We give example decodes for three different generation orders in Figures 3, 4 , and 5."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-159",
"text": "In the first example, we see that the alphabetical En-De model adheres to the Unicode ordering for Latin characters (punctuation \u2192 uppercase \u2192 lowercase), and that the En-Zh model similarly adheres to the Unicode order for Chinese (punctuation \u2192 CJK characters sorted by radical and stroke count)."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-160",
"text": "In the second example, the longest-first En-De model generates subwords in decreasing order of length as expected."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-161",
"text": "Finally, in the third example, the common-first En-Zh model begins with common particles and punctuation before generating the main content words."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-162",
"text": "We give a quantitative measurement of the success of each model in Table 2 , computing the percentage of insertions across the development set that adhered to the best-scoring action under the desired ordering."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-163",
"text": "Most models exhibit similar trends, with the majority of En-De models achieving accuracies in excess of 90% when a low temperature is used, and with corresponding results in the mid-to-upper 80% range for En-Zh."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-164",
"text": "Even the random order based on token hashes has accuracies exceeding 80% for both languages, demonstrating that the model has a strong capacity to adapt to any oracle policy."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-166",
"text": "**TEST RESULTS**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-167",
"text": "Next, we measure the quality of our models by evaluating their performance on their respective test sets."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-168",
"text": "The BLEU scores are reported in Table 3 ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-169",
"text": "The uniform loss proposed by Stern et al. (2019) serves as a strong baseline for both language pairs, coming within 0.6 points of the original Transformer for En-De at 26.72 BLEU, and attaining a respectable score of 33.1 BLEU on En-Zh."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-170",
"text": "We note that there is a slightly larger gap between the normal Transformer and the Insertion Transformer for the latter of 2.7 points, which we hypothesize is a result of the larger discrepancy between word orders in the two languages combined with the more difficult nature of the Insertion Transformer training objective."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-171",
"text": "Most of the content-based orderings (frequency-based, length-based, alphabetical) perform comparably to the uniform loss, and even the random order is not far behind."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-172",
"text": "The adaptive orders perform similarly well, with easy-first attaining one of the highest scores on En-De."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-173",
"text": "Curiously, in our model adaptive easy-order, we were unable to identify any strong patterns in the generation order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-174",
"text": "The model did have a slight preference towards functional words (i.e., \",\" and \"der\"), but the preference was weak."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-175",
"text": "As for location-based losses, the binary tree loss is notable in that it achieves the highest score across all losses for both languages."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-176",
"text": "On the other hand, we note that while the soft left-to-right and right-to-left losses perform substantially better than the hard loss employed in the original work by Stern et al. (2019) , performance does suffer when using parallel decoding for those models, which is generally untrue of the other orderings."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-177",
"text": "We believe this is due in part to exposure bias issues arising from the monotonic ordering as compared with the uniform roll-in policy that are not shared by the other losses."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-179",
"text": "**PERFORMANCE VS. SENTENCE LENGTH**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-180",
"text": "For additional analysis, we consider how well our models perform relative to one another conditional on the length of the source sentence."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-181",
"text": "Sentence length can be seen as a rough proxy measurement of the difficulty of translating a sentence."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-182",
"text": "This is to determine if whether some order variations are able to achieve improved BLEU scores over other models depending on the source sentence's length."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-183",
"text": "For each sentence in the En-De and En-Zh development sets, we compute their lengths and bin them into groups of size 5, up to a maximum length of 50."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-184",
"text": "Within each bin, we compute sentence-level BLEU and take the mean score across all sentences."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-185",
"text": "This is done for each of our model variants."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-186",
"text": "Figure 6 illustrates the results of this experiment."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-187",
"text": "We observe a surprisingly small model variance across all bin lengths."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-188",
"text": "This suggests that sentences that are difficult to translate are difficult across all orderings, and no particular ordering appears strictly better or worse than others."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-189",
"text": "One small exception to this is a performance fall-off of hard-first orderings for very long sentences across both datasets."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-190",
"text": "We also observe a different distribution of BLEU scores across bin lengths for En-De and En-Zh."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-191",
"text": "In particular, En-De models are approximately monotonic-decreasing in performance as source length increases, while on En-Zh performance is roughly flat across sentence length."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-192",
"text": "This also highlights the importance of taking additional diverse language pairs into consideration, as translation properties on one language pair may not be observed in others."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-193",
"text": "Ultimately, given the similarity of the development scores across sentence lengths and the test scores for the various models, we come to the surprising conclusion that for single-sentence English-German translation, generation order is relatively unimportant."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-194",
"text": "However, for English-Chinese, it is unclear and we leave further analysis to future work."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-195",
"text": "Under the Insertion Transformer framework, it appears order also does not matter much, however there is a 2.7 BLEU gap between the results in the Insertion Transformer and our Transformer baseline."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-197",
"text": "**RELATED WORK**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-198",
"text": "In recent work, several insertion-based frameworks have been proposed for the genera- tion of sequences in a non-left-to-right fashion for machine translation (Stern et al., 2019; Welleck et al., 2019; Gu et al., 2019) ."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-199",
"text": "Stern et al. (2019) introduced the Insertion Transformer and explored uniform and balanced binary tree orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-200",
"text": "We built upon and generalized this approach in order to explore a much broader set of orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-201",
"text": "Welleck et al. (2019) explored insertions using a binary-tree formulation."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-202",
"text": "They also explored uniform and model-based orders, but found them to lag significantly behind their left-to-right baselines."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-203",
"text": "Additionally, despite using a binary-tree formulation for generation, they did not explore treebased orders."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-204",
"text": "Gu et al. (2019) introduced a model which did not explicitly represent the output canvas arising from insertions, but rather used an implicit representation through conditioning on the insertion sequence."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-205",
"text": "They also performed an exploration of different generation orders, including random, odd-even, common-first, rare-first, and a search-adaptive order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-206",
"text": "Their search-adaptive order can be seen as a global version of our local model adaptive order, where we use the local greedy posterior as the reward function, and they use the sequence level log-probability as the reward function."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-207",
"text": "Curiously, in their framework, the random order fell significantly behind the leftto-right baseline, while they showed small gains in their search adaptive order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-208",
"text": "One key differ-ence between our work and Welleck et al. (2019) and Gu et al. (2019) is that we use a soft orderreward framework as opposed to teacher forcing."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-209",
"text": "This might explain some of the performance differences, as our framework allows for a more flexible training objective."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-210",
"text": "Additionally, since we use a uniform roll-in policy, our models may have less of a label bias problem, as they are trained to be able to continue from any partial output rather than just those arising from the target policy."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-211",
"text": "----------------------------------"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-212",
"text": "**CONCLUSION**"
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-213",
"text": "In this work, we investigated a broad array of generation orders for machine translation using an insertion-based sequence generation model, the Insertion Transformer."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-214",
"text": "We found that regardless of the type of strategy selected, be it locationbased, frequency-based, length-based, alphabetical, model-based, or even random, the Insertion Transformer is able to learn it with high fidelity and produce high-quality output in the selected order."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-215",
"text": "This is especially true for English-German single sentence translation, in which we by and large found order to not matter."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-216",
"text": "This opens a wide range of possibilities for generation tasks where monotonic orderings are not the most natural choice, and we would be excited to explore some of these areas in future work."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-217",
"text": "Table 4 : Development BLEU results for WMT14 En-De newstest2013 and WMT18 En-Zh newstest2017."
},
{
"sent_id": "e3dd013c944cf8dcb6ff90124a0e01-C001-218",
"text": "The first number in each column is the result obtained without an EOS penalty, while the second number in parentheses is the score obtained with the best EOS penalty for that setting."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-19"
]
],
"cite_sentences": [
"e3dd013c944cf8dcb6ff90124a0e01-C001-19"
]
},
"@SIM@": {
"gold_contexts": [
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-21"
]
],
"cite_sentences": [
"e3dd013c944cf8dcb6ff90124a0e01-C001-21"
]
},
"@USE@": {
"gold_contexts": [
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-35"
],
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-49"
],
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-69"
],
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-117"
],
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-154"
],
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-169"
]
],
"cite_sentences": [
"e3dd013c944cf8dcb6ff90124a0e01-C001-35",
"e3dd013c944cf8dcb6ff90124a0e01-C001-49",
"e3dd013c944cf8dcb6ff90124a0e01-C001-69",
"e3dd013c944cf8dcb6ff90124a0e01-C001-117",
"e3dd013c944cf8dcb6ff90124a0e01-C001-154",
"e3dd013c944cf8dcb6ff90124a0e01-C001-169"
]
},
"@DIF@": {
"gold_contexts": [
[
"e3dd013c944cf8dcb6ff90124a0e01-C001-176"
]
],
"cite_sentences": [
"e3dd013c944cf8dcb6ff90124a0e01-C001-176"
]
}
}
},
"ABC_59b6eaca400342159b867d018d4042_8": {
"x": [
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-2",
"text": "We propose a framework to improve performance of distantly-supervised relation extraction, by jointly learning to solve two related tasks: concept-instance extraction and relation extraction."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-3",
"text": "We combine this with a novel use of document structure: in some small, well-structured corpora, sections can be identified that correspond to relation arguments, and distantly-labeled examples from such sections tend to have good precision."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-4",
"text": "Using these as seeds we extract additional relation examples by applying label propagation on a graph composed of noisy examples extracted from a large unstructured testing corpus."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-5",
"text": "Combined with the soft constraint that concept examples should have the same type as the second argument of the relation, we get significant improvements over several state-of-theart approaches to distantly-supervised relation extraction."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-8",
"text": "In distantly-supervised information extraction (IE), a knowledge base (KB) of relation or concept instances is used to train an IE system."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-9",
"text": "For example, a set of facts like sideEffect(meloxicam, stomachBleeding), interacts-With(meloxicam, ibuprofen), etc are matched against a corpus, and the matching sentences are then used to generate training data consisting of labeled relation mentions."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-10",
"text": "Distant supervision is less expensive to obtain than directly supervised labels, but produces noisy training data whenever matching errors occur."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-11",
"text": "Hence distant supervision is often coupled with learning methods that allow for this sort of noise, e.g., by introducing latent variables for each entity mention (Hoffmann et al., 2011; Riedel et al., 2010; Surdeanu et al., 2012) ; by carefully selecting the entity mentions from contexts likely to include specific KB facts (Wu and Weld, 2010) ; or by careful filtering of the KB strings used as seeds (Movshovitz-Attias and Cohen, 2012) ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-12",
"text": "Another recently-introduced approach to reducing the noise in distant supervision is to combine distant labeling with label propagation (LP) (Bing et al., 2015; Bing et al., 2016) ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-13",
"text": "Label propagation is a family of graph-based semi-supervised learning (SSL) methods in which instances that are \"nearby\" in the graph are encouraged to have similar labels."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-14",
"text": "Depending on the LP method used, agreement with seed labels can be imposed as a hard constraint (Zhu et al., 2003) or a soft constraint (Lin and Cohen, 2010; Talukdar and Cohen, 2014) ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-15",
"text": "When seed-label agreement is a soft constraint, then LP can be viewed as a way of smoothing the seed labels, so that labels for groups of \"similar\" instances (i.e., instances nearby in the graph) are upweighted if they agree, and downweighted if they disagree."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-16",
"text": "In combining distant supervision with LP, one must build a graph that connects instances that are likely to have the same label."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-17",
"text": "Previously, systems have constructed graphs which connect mentions appearing in the same coordinate-list structuree.g., the underlined noun phrases in \"Get medical help if you experience chest pain, weakness, or shortness of breath\" (Bing et al., 2015) ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-18",
"text": "This approach was shown to improve performance in recognizing instances of certain medical noun-phrase (NP) categories, such as drug names and disease names."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-19",
"text": "An extension of this approach (Bing et al., 2016) learned to classify NP pairs as relations, using a more complex graph structure."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-20",
"text": "Figure 1 : A structured document in WebMD describing the drug meloxicam."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-21",
"text": "All documents in this corpora have the same seven sections."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-22",
"text": "This paper presents three new contributions extending this line of work."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-23",
"text": "First, we combine the concept-instance extraction and relation-extraction tasks, in the process greatly simplifying the relationextraction LP step."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-24",
"text": "The combination of the tasks is simple but effective."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-25",
"text": "In (Bing et al., 2016) , relation extraction was performed on an \"entity centric\" corpus, where each document is primarily concerned with a particular \"title entity\", and the first argument of each relation is always the title entity: hence relation extraction can be viewed as classification, where an entity mention is labeled with its slot filling role, i.e., its relation to the title entity."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-26",
"text": "The intuition behind combining concept extraction and relation extraction is that relation arguments are often constrained to be of a particular type; for example, the sideEffect of a drug is necessarily of the type symptom."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-27",
"text": "The second contribution is a novel use of document structure; in particular, we exploit the fact that in some small, well-structured corpora, sections can be identified that correspond fairly accurately to relation arguments."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-28",
"text": "Figure 1 shows a document from such a structured corpus (discussed below) which contains sections labeled \"Side Effects\"."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-29",
"text": "If \"nausea\" is distantly labeled as a sideEffect of meloxicam in this well-structured document, it is very likely to be a correct mention for the sideEffect relation."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-30",
"text": "Used naively, extending a corpus with a small well-structured one needs not to lead to improvements, but when combined with LP, we show a consistent and sometimes substantial improvement in performance."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-31",
"text": "We thus illustrate a novel and effective way to make use of a small wellstructured corpus, a commonly available resource that is intermediate in structure between a KB and an ordinary text corpus."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-32",
"text": "The third contribution is experimental."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-33",
"text": "We perform extensive experiments comparing this approach to state-of-the-art distant labeling methods based on latent variables, and show substantial improvements in two domains: the relative improvements under F1 measure are from 72% to 110% on one domain, and 22% to 30% on a second domain."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-34",
"text": "Below we present our method, in outline and then in detail; present experimental results; discuss related work; and finally conclude."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-35",
"text": "2 DIEJOB: Distant IE by JOint Bootstrapping 2.1 Overview DIEJOB, our system for distantly-supervised relation extraction, is shown in Figure 2 ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-36",
"text": "We consider a common case, in which most information is found in relatively unstructured free text, but some smaller corpora exist that are well-structured."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-37",
"text": "DIEJOB thus assumes at least two corpora exist for the domain of interest: a large target corpus and a smaller structured corpus."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-38",
"text": "Further, it assumes that every document in these two corpora is associated with a particular entity, called title entity or subject entity."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-39",
"text": "Many widely-used corpora have this structure, including Wikipedia and the authoritative consumer-oriented websites we use, DailyMed and WebMD."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-40",
"text": "From each corpus, DIEJOB produces two types of mention sets: relation mention set R and concept mention set C. For the example of Figure 1 , R contains a sideEffect relation mention for \"stomach upset\" from the first sentence, and C may contain mentions of the Symptom concept, like \"stomach upset\" and \"nausea\" from the same sentence."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-41",
"text": "The tail argument values (such as \"nausea\" in sideEffect(meloxicam, nausea)) of a relation are often from a particular unary con-cept."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-42",
"text": "This is especially true in the biomedical domain, where for example, sideEffect takes instances of Symptom as the value range of its second argument."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-43",
"text": "Naively, those concept mentions in C could serve as a source to generate relation examples, but not all concept mentions are relation mentions: e.g., the Symptom mentions of \"confusion\" and \"mood changes\" from \"Symptoms of overdose may include: confusion, mood changes ...\" are not mentions of the sideEffect relation (or any other relation we currently extract)."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-65",
"text": "These corpora are all entity centric, i.e., each pages discusses a single entity."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-44",
"text": "For the structured corpus, the relation and concept mention sets are referred to as R s and C s , and for the target corpus as R t and C t ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-45",
"text": "Some special treatments (discussed in Section 2.3) are done while preparing R s and C s ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-46",
"text": "After producing R s , R t , C s and C t , DIEJOB builds a bipartite graph, following prior work (Lin, 2012) , in which the nodes are either mentions in the four sets, or features of these mentions, with edges between a mention and its features."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-47",
"text": "To distill a cleaner set of relation training examples, DIEJOB performs LP on the bipartite graph."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-48",
"text": "Only the mentions from R s are used as seed relation examples in this LP stage (because they are more accurate, see Section 2.3)."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-49",
"text": "Finally the distilled relation examples are used to train an ordinary SVM classifier over their extracted features."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-50",
"text": "DIEJOB thus finally learns to classify an unseen mention by the relation which holds between the mention and its corresponding title entity based on features of the mention-a convenient architecture to use for large-scale extraction."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-51",
"text": "Below we will describe the components of DIEJOB and the experiments in more detail."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-53",
"text": "**RELATIONS AND CORPORA**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-54",
"text": "Even large curated KBs are often incomplete and the situation is worse in the medical domain where the coverage of large KBs like Freebase is fairly limited."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-55",
"text": "We focus on extracting instances of eight relations, defined in Freebase, about drugs and diseases."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-56",
"text": "The drug relations are usedToTreat, condi-tionsThisMayPrevent, and sideEffect, and the concept types of their second arguments are DiseaseOrMedicalCondition, Dis-easeOrMedicalCondition, and Symptom."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-57",
"text": "The disease relations are hasTreatment, has-Symptom, riskFactor, hasCause, and preventionFactor, with corresponding concept types as MedicalTreatment, Symptom, RiskFactor, DiseaseCause, and Condi-tionPreventionFactor."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-58",
"text": "We are primarily concerned with extraction from large, authoritative sources."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-59",
"text": "Our target drug corpus, called DailyMed, is downloaded from dailymed.nlm.nih.gov and contains 28,590 XML documents."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-60",
"text": "Our target disease corpus, called WikiDisease, is extracted from a Wikipedia dump of May 2015 and contains 8,596 disease articles."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-61",
"text": "The structured drug corpus 1 , called WebMD, contains 2,096 pages collected from www.webmd.com."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-62",
"text": "Each page has the same sections, such as Uses and Side Effects, corresponding to usedToTreat/condi-tionsThisMayPrevent andsideEffect relations, respectively."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-63",
"text": "The structured disease corpus, called MayoClinic, contains 1,117 pages collected from www.mayoclinic.org."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-64",
"text": "Each page also has regular sections, such as Symptoms, Causes, Risk Factors, Treatments/Drugs, and Prevention, corresponding to hasSymptom, hasCause, risk-Factor, hasTreatment, and prevention-Factor, respectively."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-66",
"text": "We use GDep (Sagae and Tsujii, 2007), a dependency parser trained on GENIA Treebank, to parse the corpora, followed by a simple POS-tag based chunker to extract NPs."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-67",
"text": "We also extract a list (e.g. \"stomach upset, nausea, and dizziness\") for each coordinating conjunction that modifies a nominal."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-68",
"text": "For each NP mention, we extract features (described below) from its sentence; and for each coordinate list, we extract the similar features and the NP chunks included in it."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-69",
"text": "A mention not inside a list is regarded as a singleton list that contains only one item."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-71",
"text": "**MENTION PREPARATION**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-72",
"text": "Relation mention sets, i.e. R s and R t , are prepared with distant supervision."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-73",
"text": "The extracted NP mentions are distantly labeled using relation seed triples from Freebase (e.g. sideEffect(meloxican,nausea))."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-74",
"text": "Specifically, we require that the title entity matches the first argument value of the relation, and the NP mention matches the second argument value."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-75",
"text": "To improve the quality of R s , we also require that the section from which the mention was taken is relevant to the relation."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-76",
"text": "E.g., a mention labeled with the sideEffect relation must appear in a section entitled Side Effects."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-77",
"text": "Such constraint limits the number of mentions in R s ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-78",
"text": "In the next section, we will show how to extend this small but accurate example set to a larger training set of examples, with reasonable quality."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-79",
"text": "The concept mentions are designed to have high recall with respect to possible argument values for a relation."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-80",
"text": "For each relation r, we generate a set of concept mentions which lie in the range of r's second argument."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-81",
"text": "Following the DIEL system (Bing et al., 2015) , we extract concept instances from Freebase as seeds, and extend the seed set using LP in each corpus."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-82",
"text": "The reached coordinate-term lists and singleton lists (NPs) are collected as concept mentions."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-83",
"text": "Thus, we get two concept mention sets: C s from the structured corpus, and C t from the target corpus."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-84",
"text": "Note that some mentions in C s may come from unrelated sections; for instance, C s for the Symptom concept may contain mentions from the Overdose section, which cannot be examples of the sideEffect relation."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-85",
"text": "Therefore, we filter out the mentions in C s that are not from the appropriate section for this concept."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-86",
"text": "We emphasize that the section-specific processing is only done on the structured corpus, i.e. for C s and R s ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-87",
"text": "Our target corpora have thousands of section titles, most of which are not related in any way to the relations being extracted."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-88",
"text": "Thus the target relation mentions (R t ) and target concept mentions (C t ) are collected without considering section information."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-90",
"text": "**RELATION LABEL PROPAGATION**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-91",
"text": "With the relation mentions and the concept mentions lying in the range of the corresponding relation, we are able to distill a cleaner set of training relation examples to learn extractors."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-92",
"text": "R s contains more confident relation examples because of constraints by document structure, but it is limited in size."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-93",
"text": "In contrast, the number of R t mentions is larger, but they are noisier."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-94",
"text": "In general, the degree to which R t mentions will be useful may be domain-and corpus-specific."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-95",
"text": "C s and C t are generated with respect to the type of the men- tions, but not their relationship with the title entity: e.g., a mention in C t corresponding to the NP \"dizziness\" would not be associated with the triple sideEffect(meloxican,dizziness); and indeed, dizziness might be a condition treated by, not caused by, the title entity \"meloxican\"."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-96",
"text": "Therefore, C t itself cannot be directly used as relation examples, however, it can serve as a resource to distill relation examples."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-97",
"text": "In our experiments, R s mentions are always used as seed relation examples in LP, but we build bipartite propagation graphs with different combinations of the four sets of mentions and study their performance."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-98",
"text": "In total, we have 7 bipartite graphs, each with a different set of mentions from the following combinations:"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-99",
"text": "In a bipartite graph, one set of nodes are mentions, and the other set of nodes are features of mentions."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-100",
"text": "An edge is added between each feature and each mention containing that feature."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-101",
"text": "The edges are TFIDF-weighted (treating the features as words and the mentions as documents)."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-102",
"text": "Figure 3 shows such a bipartite graph (edge weights are omitted), which has four mentions on the left-hand side, and eight features on the right-hand side."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-103",
"text": "We use an existing multi-class label propagation method, namely, MultiRankWalk (MRW) (Lin and Cohen, 2010) , which is a graph-based SSL method related to personalized PageRank (PPR) (Haveliwala et al., 2003) (aka random walk with restart (Tong et al., 2006) )."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-104",
"text": "MRW can be viewed simply as computing a personalized PageRank vector for each class, each of which is computed using a personalization vector that is initially uniform over the seeds, and finally assigning to each node the class associ-ated with its highest-scoring vector."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-105",
"text": "MRW's final scores depend on the centrality of nodes, as well as their proximity to seeds."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-106",
"text": "The MRW implementation we use is based on ProPPR (Wang et al., 2013)."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-108",
"text": "**CLASSIFIER LEARNING**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-175",
"text": "One exception is DS Struct on the drug domain, where the recall is only 0.072."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-109",
"text": "Given the ranked mention lists of these relation labels from the above LP, we pick the top N to train binary classifiers, which can then be used to classify the entity mentions (singleton lists) and coordinate lists in a new document."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-110",
"text": "We use the same feature generator for both mentions and lists."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-111",
"text": "Shallow features include: tokens in the NPs, and character prefixes/suffixes of these tokens; BOW from the sentence containing the NP; and tokens and bigrams from a window around the NPs."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-112",
"text": "From dependency parsing, we find the verb which is closest ancestor of the head of current NP, all modifiers of this verb, and the path to this verb."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-113",
"text": "For lists, the dependency features are computed relative to the head of the list."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-114",
"text": "We use SVMs (Chang and Lin, 2001) and discard singleton features, as well as the most frequent 5% of features (as a stop-wording variant)."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-115",
"text": "Specifically, binary classifiers are trained with examples of one relation as the positives, and examples of the other classes as negatives."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-116",
"text": "We also add N general negative examples, randomly picked from those that are not distantly labeled by any relation."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-117",
"text": "A linear kernel and default values for all other parameters are used 2 ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-118",
"text": "A threshold 0.5 is used to cut positive and negative predictions."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-119",
"text": "If a new list or mention is not classified as positive by any classifier, it is predicted as \"other\"."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-121",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-123",
"text": "**EVALUATION DATASET**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-124",
"text": "Our evaluation dataset contains 20 manually labeled pages, 10 pages each from the disease corpus WikiDisease and the drug corpus DailyMed."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-125",
"text": "This data was originally generated in (Bing et al., 2016) ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-126",
"text": "The annotated text fragments are manually chunked NPs which are the second argument values of any of the eight relations considered here, with the title drug or disease entity of the corresponding document as the relation subject."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-127",
"text": "The evaluation data contains 436 triples for the disease domain and 320 triples for the drug domain."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-128",
"text": "A system's task then is 2 https://www.csie.ntu.edu.tw/ cjlin/libsvm/ to extract all correct values of the second argument of a given relation from a test document."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-130",
"text": "**EXPERIMENTAL COMPARISONS**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-131",
"text": "The first three baselines are distant supervision (DS) systems."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-132",
"text": "They classify each testing NP mention into one of the interested relation types or \"other\", using naive matching to the Freebase seed triples as distant supervision."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-133",
"text": "Each sentence in the corpus is processed with the same preprocessing pipeline to detect NPs."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-134",
"text": "Then, these NPs are labeled with the Freebase seed triples."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-135",
"text": "The features are defined and extracted in the same way as we did for DIEJOB, and binary classifiers are trained with the same method."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-136",
"text": "We also compare against two latent variable learners."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-137",
"text": "The first is MultiR (Hoffmann et al., 2011) which models each relation mention separately and aggregates their labels using a deterministic OR."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-138",
"text": "The second one is MIML- RE (Surdeanu et al., 2012) which has a similar structure to MultiR, but uses a classifier to aggregate the mention level predictions into an entity pair prediction."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-139",
"text": "We used the publicly available code from the authors 3 for our experiments."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-140",
"text": "Since these methods do not distinguish between structured and unstructured corpora, we used the union of these corpora in our experiments, and the feature set used in the bipartite graph."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-141",
"text": "We found that the performance of these methods varies significantly with the number of negative examples used during training, and hence we tuned these and other parameters 4 directly on the evaluation data, and report their best performance."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-142",
"text": "Another distantsupervision baseline we compare to is the Mintz++ model from (Surdeanu et al., 2012) , which improves on the original model from (Mintz et al., 2009 ) by training multiple classifiers, and allowing multiple labels per entity pair."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-143",
"text": "We also compare with DIEBOLDS (Bing et al., 2016) , which uses LP on a graph containing entity mention pairs."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-144",
"text": "The graph used by DIEBOLDS is more complex than the mention-feature graph used here, in DIEJOB."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-145",
"text": "One set of vertices correspond to (title-entity, mention-entity) pairs."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-146",
"text": "The other set of vertices are identifiers for coordinate lists: a mention pair is connected with the lists from any document describing the subject, and containing the mention."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-147",
"text": "Additional edges are also introduced based on document structure and BOW context features."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-148",
"text": "DIEBOLDS performs label propagation from the mention pairs distantly labeled with Freebase relation triples."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-150",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-151",
"text": "We extracted triples of the eight relations from Freebase as distant labeling seeds."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-152",
"text": "Specifically, if the subject of a triple matches with a drug or disease name in a corpus and its object value also appears in that document, it is extracted."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-153",
"text": "For the disease domain, we get 2022, 2453, 905, 753, and 164 triples for hasTreatment, hasSymptom, riskFactor, hasCause, and prevention-Factor, respectively."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-154",
"text": "For the drug domain, we get 3112, 315, and 265 triples for usedTo-Treat, conditionsThisMayPrevent, and sideEffect, respectively."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-155",
"text": "We have two strategies to pick the top N lists for classifier learning."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-156",
"text": "One strategy picks the top N directly, without distinguishing if they come from the structured corpus or the target corpus."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-157",
"text": "It is referred to as DIEJOB Both."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-158",
"text": "The other strategy picks the top N examples only from the target corpus, and it is referred to as DIEJOB Target."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-159",
"text": "Here our concern is the difference between the feature distributions of the two corpora."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-160",
"text": "We evaluate the performance of different systems from an IR perspective: a title entity (i.e., document name) and a relation together act as a query, and the extracted NPs as retrieval results."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-162",
"text": "**RESULTS ON LABELED PAGES**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-163",
"text": "The results for precision, recall and F1 measure are given in Table 1 ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-164",
"text": "The results for DIEBOLDS are from (Bing et al., 2016) ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-165",
"text": "The systems with \"*\" are directly tuned on the evaluation data and should be considered as upper bounds on true per- (Note that for the disease domain, DIEJOB Both and DIEJOB Both* get the same results, because they use the same parameters, although they are tuned with different data.)"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-166",
"text": "DIEJOB Both outperforms all the other systems."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-167",
"text": "Compared with MultiR, Mintz++, and MIML-RE, the relative improvements under the F1 measure are 22% to 30% in the disease domain, and 72% to 110% in the drug domain."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-168",
"text": "The precision values of DIEJOB Both are much higher than previous work."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-169",
"text": "For recall, DIEBOLDS and DIEJOB Both's performance are comparable to the latent-variable systems on the disease domain and much better on the drug domain."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-170",
"text": "One reason may be that our method predicts one label for a coordinate-term list (lists are common in the drug domain), which implicitly coordinates the labels of list items, while MultiR, Mintz++, and MIML-RE break a list into individual items which are predicted separately."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-171",
"text": "The precision values of DIEBOLDS are much lower than DIEJOB, especially for the drug domain."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-172",
"text": "Unlike DIEJOB, DIEBOLDS builds an LP graph containing all singleton and coordinate lists of noun phrases in the corpus, which introduces many irrelevant examples."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-173",
"text": "DIEBOLDS achieves the highest recall values, but in practice, it is also likely to predict a testing mention as belonging to one of the eight relations, but not \"other\"."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-174",
"text": "On these tasks, the simple DS baselines' performance is competitive with MIML-RE and the other complex models."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-176",
"text": "This is For the disease domain, DIEJOB Both performs better than DIEJOB Target, no matter how they are tuned (i.e. on tuning or evaluation data)."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-177",
"text": "This shows that the mentions from R s and C s of MayoClinic corpus provide good training examples."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-178",
"text": "For the drug domain, DIEJOB Both and DIEJOB Target achieve similar results."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-179",
"text": "This may be because DIEJOB Both is more sensitive to the difference in feature distributions of structured and target corpora, since it uses examples from the structured corpus to learn classifiers as well."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-180",
"text": "Among the four corpora we use, WebMD, MayoClinic, and WikiDisease are written to be readable by a large audience, while DailyMed articles are more difficult in terms of readability: hence the difference between the structured and unstructured corpora is larger in the drug domain."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-181",
"text": "Precision-recall curves are given in Figure 4 ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-182",
"text": "For the drug domain, DIEJOB's precision is consistently better, at the same recall level, than any of the other methods."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-183",
"text": "For the disease domain, our system's precision is generally better after the recall level 0.05."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-185",
"text": "**TUNING AND VARIANT COMPARISON**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-186",
"text": "Here we examine the performance of different variants, and the effect of the parameter N ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-187",
"text": "The performance of all graph variants on a tuning dataset (containing 10 labeled pages) is given in Figure 5 ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-188",
"text": "Combined with the strategies for picking top N (i.e. DIEJOB Target and DIEJOB Both), there are 13 variants: shown in Figures 5a and 5b For the disease domain, the same variant under DIEJOB Both and DIEJOB Target performs similarly, and on average, DIEJOB Both is slightly better than DIEJOB Target."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-189",
"text": "For the drug domain, on average, DIEJOB Target is better than DIEJOB Both."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-190",
"text": "One explanation is that the two corpora in disease domain are similar in the aspect of feature distribution, so in general, mixing the examples from them are beneficial."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-191",
"text": "However, the effect of such a mixture is negative for drug domain, whose structured and target corpora are dissimilar."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-192",
"text": "In Table 1 , the reported results of the tuned DIEJOB Both and DIEJOB Target for the disease domain are from the variants R s C s and R s C s R t respectively, while for drug domain, both are from R s R t ."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-193",
"text": "One explanation could be: (1) if the structured corpus is similar to the target corpus, it is better to use DIEJOB Both, and including examples of the structured corpus (e.g., R s C s and R s C s R t , both have C s used) generally performs well with a larger N value; (2) if the structured and target corpora are dissimilar, DIEJOB Target is better and R s R t has an advantage over other variants where the main focus is distilling good training examples from R t and a smaller number of top N examples is preferred."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-194",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-195",
"text": "**RELATED WORK**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-196",
"text": "To overcome the noise in distantly-labeled examples, (Riedel et al., 2010) introduced an \"at least one\" heuristic, where instead of taking all mentions for a pair as correct examples only at least one of them is assumed to express that relation."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-197",
"text": "MultiR (Hoffmann et al., 2011) and MIML-RE (Surdeanu et al., 2012) extend this approach to support multi- ple relations expressed by different sentences in a bag."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-198",
"text": "Unlike these approaches, DIEJOB improves the quality of training data with a bootstrapping step before feeding the noisy examples into a learner, by using the confident examples from a structured corpus as seeds."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-199",
"text": "The benefit of this step is twofold."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-200",
"text": "First, it distills the distantly-labeled examples by propagating labels from good seed examples, and downweights the noisy ones."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-201",
"text": "Second, the propagation will walk to more relation examples in the concept mention set that cannot be distantly labeled with triples from knowledge bases."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-202",
"text": "Document structure was previously explored by (Bing et al., 2016) , which used the structure to enrich an LP graph by adding coupling edges between mentions in the same section of particular documents."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-203",
"text": "In this work, we explore the semantic association between section titles and relation arguments."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-204",
"text": "Furthermore, we perform a joint bootstrapping on relation and type mentions to collect training examples with better quality."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-205",
"text": "Technically, the propagation graphs used are different: DIEJOB's graph has carefully produced mention nodes (from those four sets) and their feature nodes, while DIEBOLDS's graph has triple nodes (i.e., subject-NP pairs) and all singleton and coordinate lists of noun phrases of the corpora."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-206",
"text": "Accordingly, their propagation seeds are different: DIEJOB uses confident examples as seeds (labeled from particular sections of a structured cor-pus) to propagate labels to more examples via feature similarity, while DIEBOLDS directly uses Freebase triples as seeds and propagates labels through edges built from coordinate lists and sections."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-207",
"text": "In the classic bootstrap learning scheme (Riloff and Jones, 1999; Agichtein and Gravano, 2000; Bunescu and Mooney, 2007) , a small number of seed instances are used to extract new patterns from a large corpus, which are then used to extract more instances."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-208",
"text": "Then in an iterative fashion, new instances are used to extract more patterns."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-209",
"text": "DIEJOB departs from earlier bootstrapping methods in combining label propagation with a standard classification learner, and it can improve the quality of distant examples and collect new examples simultaneously."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-210",
"text": "----------------------------------"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-211",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-212",
"text": "We proposed the DIEJOB framework to generate good examples for distantly-supervised IE."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-213",
"text": "It exploits the document structure of a small wellstructured corpus to collect seed relation examples, and it also collects concept mentions that could be the second argument values of relations."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-214",
"text": "DIEJOB then conducts label propagation to find mentions that can be confidently used as training examples to train classifiers for labeling new entity mentions."
},
{
"sent_id": "59b6eaca400342159b867d018d4042-C001-215",
"text": "The experimental results show that this approach consistently and significantly outperforms state-ofthe-art approaches."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"59b6eaca400342159b867d018d4042-C001-10",
"59b6eaca400342159b867d018d4042-C001-11"
],
[
"59b6eaca400342159b867d018d4042-C001-12"
],
[
"59b6eaca400342159b867d018d4042-C001-14"
],
[
"59b6eaca400342159b867d018d4042-C001-19"
],
[
"59b6eaca400342159b867d018d4042-C001-25"
],
[
"59b6eaca400342159b867d018d4042-C001-197"
],
[
"59b6eaca400342159b867d018d4042-C001-202"
],
[
"59b6eaca400342159b867d018d4042-C001-207"
]
],
"cite_sentences": [
"59b6eaca400342159b867d018d4042-C001-11",
"59b6eaca400342159b867d018d4042-C001-12",
"59b6eaca400342159b867d018d4042-C001-14",
"59b6eaca400342159b867d018d4042-C001-19",
"59b6eaca400342159b867d018d4042-C001-25",
"59b6eaca400342159b867d018d4042-C001-197",
"59b6eaca400342159b867d018d4042-C001-202",
"59b6eaca400342159b867d018d4042-C001-207"
]
},
"@USE@": {
"gold_contexts": [
[
"59b6eaca400342159b867d018d4042-C001-103"
],
[
"59b6eaca400342159b867d018d4042-C001-114"
],
[
"59b6eaca400342159b867d018d4042-C001-124",
"59b6eaca400342159b867d018d4042-C001-125"
],
[
"59b6eaca400342159b867d018d4042-C001-136",
"59b6eaca400342159b867d018d4042-C001-137"
],
[
"59b6eaca400342159b867d018d4042-C001-143"
],
[
"59b6eaca400342159b867d018d4042-C001-163",
"59b6eaca400342159b867d018d4042-C001-164"
]
],
"cite_sentences": [
"59b6eaca400342159b867d018d4042-C001-103",
"59b6eaca400342159b867d018d4042-C001-114",
"59b6eaca400342159b867d018d4042-C001-125",
"59b6eaca400342159b867d018d4042-C001-137",
"59b6eaca400342159b867d018d4042-C001-143",
"59b6eaca400342159b867d018d4042-C001-164"
]
}
}
},
"ABC_a0bd41c3653073dd79e19d3ddc8d14_8": {
"x": [
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-2",
"text": "A ubiquitous task in processing electronic medical data is the assignment of standardized codes representing diagnoses and/or procedures to free-text documents such as medical reports."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-21",
"text": "Moreover, the non-uniformity of distributions of diseases and procedures results in large number of sparse classes, for which very few positive training cases are available."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-22",
"text": "Data sparsity can also be a problem when code set revisions introduce new codes for which no annotated data is initially available."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-23",
"text": "Hence, there is a need for machine learning approaches which are generally more robust to data sparsity."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-24",
"text": "We propose a new end-to-end neural network model to solve the prediction of codes from medical reports as a multi-task classification problem, achieving a new state-of-the-art result on the MIMIC-III corpus, which is the largest publicly available dataset for this task."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-25",
"text": "Our model benefits from two novel contributions, namely multi-view CNN channels and label-dependent attention layers tuned to label descriptions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-26",
"text": "That is, our model exploits the description of the codes for regularizing the attention for each individual classifier, reducing the effect of data sparsity."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-27",
"text": "We also demonstrate the benefit of using all notes, in contrast to only using the discharge summaries as in previous studies."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-29",
"text": "**RELATED WORK**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-30",
"text": "There has been significant work towards the automated coding problem (Perotte et al., 2013; Kavuluru et al., 2015; Wang et al., 2016; Scheurwegs et al., 2017; Prakash et al., 2017; Rajkomar et al., 2018; Amoia et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-31",
"text": "We review some of the recent relevant work."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-32",
"text": "Perotte et al. (2013) relied on the MIMIC-II dataset and experimented with flat and hierarchical support vector machines (SVMs) on tfidf features."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-33",
"text": "For hierarchical SVMs, they exploit the knowledge of ICD-9 code hierarchies."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-34",
"text": "They demonstrated that using hierarchical SVMs increased the recall for sparse classes and achieved superior performance compared to flat SVMs."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-35",
"text": "Kavuluru et al. (2015) built classifiers for ICD-9 diagnosis codes over three datasets, the biggest of which included around 71K EMR discharge summaries and 1,231 distinct codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-36",
"text": "They performed feature selection and used a variety of methods such as SVM, na\u00efve Bayes, and logistic regression for this problem."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-37",
"text": "Using an ensemble of these classifiers, they achieved a micro F1-score of 0.479 on their biggest corpus."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-38",
"text": "Baumel et al. (2018) used the publicly available MIMIC-III and MIMIC-II datasets and proposed new deep models including a CNN model and a hierarchical GRU model with label-dependent attention layer for ICD-9 diagnosis code prediction."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-39",
"text": "Their best performance on MIMIC-III was achieved by a CNN model, obtaining a micro F1-score of 40.7% for diagnosis codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-40",
"text": "Mullenbach et al. (2018) presented a model capable of predicting full codes for both ICD-9 diagnoses and procedures composed of shared embedding and CNN layers between all codes and an individual attention layer for each code."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-41",
"text": "They also proposed adding regularization to this model using code descriptions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-42",
"text": "Their best model on MIMIC-III reached a micro F1-score of 53.9%, which was achieved by their base model without regularization."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-43",
"text": "Wang et al. (2018) proposed a model which jointly captures the words and the label embeddings and exploits the cosine similarity between them in predicting the labels."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-44",
"text": "They applied this model to the task of predicting only the most frequent 50 codes in MIMIC-III, which they accomplished with a micro F1-score of 61.9%."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"text": "Our approach is most similar to the current state-of-the-art model by Mullenbach et al. (2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-46",
"text": "As in their study, we use a CNN layer with attention modules by code and approach regularization using code descriptions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-47",
"text": "Our model has notable departures from theirs, however: firstly, we use multi-view CNN channels with max pooling across the channels, which in itself leads to improvements over their model (even before attention regularization)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-48",
"text": "Secondly, they did not demonstrate any improvements by using code descriptions in regularizing their model on MIMIC-III, when predicting full codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-49",
"text": "Whereas they regularize the last layer, we regularize the attention layer, leading to improvements over our base model."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-50",
"text": "Our use of independent attention layer for each code is also similar to the approach used by Baumel et al. (2018) , where they used shared GRU layers over the sentences across the labels and then performed label-dependent attention pooling for each class."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-51",
"text": "However, their model is RNN based and ours is CNN based, and they did not achieve superior performance for this model compared to a fully shared CNN model with max pooling when predicting all diagnosis codes in MIMIC-III."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-53",
"text": "**DATABASE**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-54",
"text": "We rely on the publicly available MIMIC-III dataset (Johnson et al., 2016) for ICD-9 code predictions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-55",
"text": "2 This dataset includes the elec-tronic medical records (EMR) of inpatient stays in a hospital critical care unit."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-56",
"text": "MIMIC-III includes raw notes for each hospital stay in different categories-discharge summary report, discharge summary addendum, radiology note, nursing notes, etc."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-57",
"text": "The number of notes varies between different hospital stays."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-58",
"text": "Also, some of the hospital stays do not have discharge summaries; following previous studies for automated coding, we only consider those that do (Perotte et al., 2013; Baumel et al., 2018; Mullenbach et al., 2018; Wang et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-59",
"text": "We have three sets for our experiments: one including only the discharge summaries, which allows us to compare our results with previous studies on this corpus (Mullenbach et al., 2018; Perotte et al., 2013) , hereon the Dis set; one on the concatenation of all patient notes, hereon, Full set; and one on another set which includes only discharge summary samples with the 50 most frequent codes, hereon, Dis-50 set, for comparison to previous studies (Mullenbach et al., 2018; Wang et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-60",
"text": "The dataset includes 8,929 unique ICD codes (2,011 procedures, 6,918 diagnoses) for the patients who have discharge summaries."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-61",
"text": "3 We follow the train, test, and development splits publicly shared by the recent study on this dataset (Mullenbach et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-62",
"text": "These splits are patient independent."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-63",
"text": "The statistical properties of all the sets are shown in Table 1 ; note that there are around three times more tokens for the each hospital admission for the Full set compared to the Dis set."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-64",
"text": "Note that Dis-50 includes far fewer training instances because any instances which do not include any of the 50 most frequent codes are discarded."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-65",
"text": "For preprocessing the text, we convert all characters to lower case and remove tokens which only include numbers."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-66",
"text": "We build the vocabulary from the training set and consider words occurring in fewer than three training samples as out of vocabulary (OOV)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-67",
"text": "This results in 51,917 unique words for the Dis and Dis-50 set and 72,891 for the Full set."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-69",
"text": "**METHOD**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-70",
"text": "We approach the task of predicting ICD codes from medical notes as a multi-task binary classification problem in which each code in each hos-pital admission can be present (labeled 1) or absent (labeled 0)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-71",
"text": "We build our model with an embedding layer stacked with multi-view CNNs to selectively capture the relationships between a set of n-gram embeddings and ICD codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-72",
"text": "We use max pooling across these CNN channels and rely on attention spatial pooling."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-73",
"text": "While the embedding layer and multi-view CNNs are shared between all codes, we consider individual attentions for different codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-74",
"text": "Separately modeling the attention can help in interpreting the predicted labels."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-75",
"text": "This constitutes our base model, hereon, multi-view convolution with label-dependent attention pooling (MVC-LDA)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-76",
"text": "We enhance this model by using the natural-language descriptions of ICD codes to regularize the attention layers during training, enforcing similar codes in description embedding space to have similar attentions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-77",
"text": "We call this model multi-view convolution with regularized label-dependent attention pooling (MVC-RLDA)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-78",
"text": "The architectures of these models are visualized in Figure 1 ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-80",
"text": "**EMBEDDING LAYER**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-81",
"text": "The first layer of our model maps words to their continuous embedding space."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-82",
"text": "Each word w \u2208 R dv is mapped to x \u2208 R de using the embedding weight W e \u2208 R dv\u00d7de , where d v is the vocab size and d e is the embedding dimensionality."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-83",
"text": "We consider all embedded words from one input note with length l as X = [x 0 , x 1 , ..., x l\u22121 ]"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-84",
"text": "T \u2208 R l\u00d7de ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-85",
"text": "Our pilot experiments demonstrated enhancement of the classification results when we used pre-trained embeddings compared to random initialization."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-86",
"text": "Hence, we pre-train the embedding layer on all text in the training set using the Gensim implementation of the continuous bag-of-words (CBOW) word2vec approach (Mikolov et al., 2013a,b) , with an embedding size of 100, trained over a window size of 5, with no minimum count for 5 epochs."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-88",
"text": "**MULTI-VIEW CONVOLUTIONAL LAYER**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-89",
"text": "Our goal behind using multiple channels with different kernel sizes is that the underlying informative n-gram in the input for each code can vary in length according to the word neighborhood, and using multiple different field views within a CNN has the potential to capture that."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-90",
"text": "Our multi-view convolutional layer consists of 4 convolutional channels with different kernel sizes (s, s \u2212 2, s \u2212 4 and s \u2212 6, where s is the biggest kernel size), and the same number of filters with stride of 1 (Fig. 1) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-91",
"text": "To preserve the input length, we perform zero-padding on the input to this layer, X ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-92",
"text": "We apply max pooling across the outputs of these four channels."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-93",
"text": "For the n th word of the input, and assuming an odd kernel size for the i th channel, Equation 1 calculates the output of this layer, where s i is the kernel size, and W i \u2208 R s i \u00d7de\u00d7dc is the convolution weight."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-94",
"text": "After cross-channel max pooling for the whole input, the output of this layer is C = [c 0 , c 1 , ..., c l\u22121 ] \u2208 R l\u00d7dc ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-95",
"text": "This convolutional layer is shared across classes, and therefore is assumed to capture the relevant n-grams for all of them."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-96",
"text": "Note that multi view CNNs have been used before by Kim (2014) for sentence classification."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-97",
"text": "However, Kim used spatial max pooling over the CNN channels and concatenated them, whereas we use max pooling across the channels to select the most relevant n-gram for each filter."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-98",
"text": "Therefore, our method flexibly picks the most salient channels according to the input."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-99",
"text": "(1)"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-101",
"text": "**ATTENTION LAYER**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-102",
"text": "For spatial pooling of the convolutional outputs, we rely on an attention mechanism."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-103",
"text": "We consider separate attention layers for each class of output (Fig. 1) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-104",
"text": "Since there is a large number of output classes (8, 929) , this helps the model in attending to relevant parts of input for each output separately."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-105",
"text": "For modeling the attention for each class, we use a linear layer with weight V j \u2208 R dc for the j th class."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-106",
"text": "The attention for the input C and class j is calculated by CV j ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-107",
"text": "The derived attentions are used to weight each frame from the convolutional layer output before pooling them."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-108",
"text": "Equation 2 shows how this function is performed, where P j \u2208 R dc is the pooled output (see attention pooling in Figure 1 )."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-110",
"text": "**OUTPUT LAYER**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-111",
"text": "The output for each class is a dense layer with sigmoid nonlinearity."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-112",
"text": "During testing, output values greater than 0.5 are assigned as present (1) and the rest are assigned as absent (0)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-113",
"text": "Note that since for each class the majority of the training samples have output of 0 (e.g., even the most frequent code in the training examples occurs in only 37% of training instances), the network is strongly biased towards negative predictions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-114",
"text": "In our preliminary experiments, we found that the network usually tends to under-code-i.e., the cardinality of predictions were lower than the ground truth."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-115",
"text": "We found a positive and statistically significant Pearson's correlations between the input length and the number of ground truth codes (for training samples in Dis set: \u03c1 = 0.479, p < .001, and in Full set: \u03c1 = 0.557, p < .001)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-116",
"text": "Therefore, we use the input length as an extra conditioning input to the output layer to shift the bias of the output sigmoid layer from zero accordingly, to cope with the problem of under-coding to some extent."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-117",
"text": "We embed the input length using Equation 3, where T j is the embedding function for the j th class, l is the input length for an arbitrary sample, K j \u2208 R is the layer weight, and d j \u2208 R is its bias."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-118",
"text": "Note that the under-coding may differ from one class to another due to the difference in their occurrence frequencies, hence we use separate length embedding functions for each class to capture these differences."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-119",
"text": "Moreover, using a nonlinear function such as sigmoid in the embedding function has more flexibility and helps the model to generalize better to unseen input lengths in the two extremes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-120",
"text": "The embedded length is incorporated in the output layer as shown in Equation 4, where U j \u2208 R dc is the weight and b j \u2208 R is the bias, and y j \u2208 R is the prediction for the j th class."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-121",
"text": "Note that T j (l) shifts the bias of the output sigmoid layer according to the input length (Fig. 1) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-122",
"text": "We use the binary cross entropy loss function on the output of this layer as shown in Equation 5 , where g j is the ground truth for the j th class."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-123",
"text": "For each batch, the loss function is calculated for each sample and is averaged across them."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-124",
"text": "MVC-LDA is trained to minimize this loss function."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-125",
"text": "The top blocks in Figure 1 summarize the entire MVC-LDA model."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-127",
"text": "**REGULARIZING ATTENTION BY LABEL DESCRIPTION**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-128",
"text": "As mentioned earlier, many classes are quite rare in the data, so their attention modules are trained on very few examples."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-129",
"text": "To better handle these cases of sparsity, we relied also on the label descriptions included in MIMIC-III (e.g., 518.81: 'Acute respiratory failure'; 37.22: 'Left heart cardiac catheterization')."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-130",
"text": "We hypothesized that a code's description is semantically and lexically similar to the segments of input text that contain positive evidence for that code."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-131",
"text": "Thus, we devised a means of directing attention via regularization, constraining the attention weight V j for the j th class by its description."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-132",
"text": "We map the description of labels to an embedding space using a nonlinear function f , a neural network composed of an embedding layer tied with W e , a convolutional layer with kernel size of s, and d c filters, a spatial max pooling, and a nonlinear dense output layer with sigmoid function (See the blue blocks in Figure 1) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-133",
"text": "Suppose the description of the j th label is D j \u2208 R lw ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-134",
"text": "During training, whenever the gold standard contains the j th code, we add a regularization term to the loss function in Equation 5, resulting in the loss function shown in Equation 6, where g j is the ground truth label for the j th class and \u03bb specifies the weight of the regularization in the new loss function."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-135",
"text": "Adding this extra term in the loss function constrains the training of the attention weights to avoid overfitting, particularly for classes with few training samples, by pushing the attention weights to be closer to the description embeddings for each class."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-136",
"text": "Moreover, this regularization indirectly pushes the attention for classes with similar descriptions to be closer to each other."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-138",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-140",
"text": "**BASELINES**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-141",
"text": "We compare our approach with four baselines: flat and hierarchical SVMs (Perotte et al., 2013), LEAM (Wang et al., 2018) , and CAML (Mullenbach et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-142",
"text": "For flat and hierarchical SVMs, we follow the approach of Perotte et al. (2013) , considering 10,000 tf-idf unigram features, training 8,929 binary SVMs for the flat SVMs and 11,693 binary SVMs for hierarchical SVMs."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-143",
"text": "For the hierarchical SVMs, we use the ICD-9-CM hierarchy from bioportal."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-144",
"text": "4 For flat SVMs, a code is considered present if its SVM predicts a positive output."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-145",
"text": "For hierarchical SVMs, a code is considered present if the SVMs for the code and SVMs for the all parents of the code are positive."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-146",
"text": "LEAM (Wang et al., 2018) learns the joint representation of labels and input embeddings and uses their cosine similarity in predicting the codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-147",
"text": "We compare our model with their results on the Dis-50 set."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-148",
"text": "CAML (Mullenbach et al., 2018) has achieved the best state-of-the-art results on MIMIC-III."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-149",
"text": "CAML is composed of a stack of an embedding layer, a CNN layer, and label-dependent attention layers."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-150",
"text": "We run their model on the Dis set using their publicly available code."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-151",
"text": "5 The only difference here is that we found slightly more unique codes: 8,929 to their 8,921."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-153",
"text": "**EVALUATION METRICS**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-154",
"text": "The most widely used metric for evaluating ICD code prediction is micro F1-score (Perotte et al., 2013; Wang et al., 2018; Mullenbach et al., 2018; Kavuluru et al., 2015) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-155",
"text": "New studies have reported results on macro F1-score, precision@n, and AUC of ROC as well (Wang et al., 2018; Mullenbach et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-156",
"text": "As evaluation metric we rely on micro F1-score (micro F1), macro F1-score (macro F1) for the top 50 codes, area under the precisionrecall curve (PR AUC) and precision@n (P@8 when evaluating the models on all codes and P@5 when evaluating the models for the top 50 codes)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-157",
"text": "For obtaining the micro F1, the micro precision and recall are calculated by collapsing all classes into one class and considering the task as a single binary classification."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-158",
"text": "Therefore, micro F1 weighs all class occurrences similarly."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-159",
"text": "On the other hand, macro precision and recall are calculated by evaluating the precision and recall for each class and then averaging those values across the classes, weighting all classes similarly."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-160",
"text": "However, since the number of ground truth occurrences for 54% of the codes in Dis and Full sets are zero (see Table 1 ), the recall for those values can not be calculated."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-161",
"text": "Therefore, we do not report macro F1 when evaluating the models on all codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-162",
"text": "For the Dis-50 set, however, there is no such problem, as all testing codes occur in training."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-163",
"text": "Furthermore, depending on the application, one may tune the threshold for binary classification."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-164",
"text": "Hence, we also report PR AUC, which provides the area under the curve of micro recall versus micro precision for thresholds between 0 and 1."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-165",
"text": "We do not use ROC AUC as a metric, as this measures true negatives, which are extremely frequent for this problem and thus yield very high and uninformative scores."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-166",
"text": "P@n gives the precision of the n highest prediction scores for each sample and is motivated by assessing the performance of the auto-coder model in an AI-assist workflow, where a human coder would hypothetically be provided with the top n predictions for each note."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-167",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-168",
"text": "**EXPERIMENTAL DETAILS**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-169",
"text": "We used PyTorch for building and training our models."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-170",
"text": "We train our base model (MVC-LDA) and our regularized model (MVC-RLDA) for the three sets (i.e., Dis-50, Dis and Full)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-171",
"text": "As our optimizer, we rely on Adam with learning rate of 0.001 and \u03b2 1 = 0.9, \u03b2 2 = 0.999, = 1e \u2212 8."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-172",
"text": "To optimize other hyperparameters, we use Hyperband (Li et al., 2017) , an algorithm for expediting random search on hyperparameters for machine learning models, making it 5 to 30 times faster than Bayesian optimization."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-173",
"text": "Hyperband requires specifying a resource to maximally exploit to find the best parameters; we set this to 27 training epochs, as well as an additional pruning parameter \u03b7, which we set to 3."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-174",
"text": "We chose to optimize three hyperparameters with Hyperband: the number of CNN filters, multi-view CNN kernel sizes, and regularization weight for MVC-RLDA."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-175",
"text": "Table 2 shows the hyperparameters we tried for our models and the selected value for each, optimized to maximize the micro F1 on the development set."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-176",
"text": "During training, we use a batch size of 4 samples, and if the sample length is higher than 10,000, we randomly select a segment with 10,000 words from the input."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-177",
"text": "During testing, we use the whole input."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-178",
"text": "We train all models with early stopping, using micro F1 on the development set as stopping criterion with a patience of 10 epochs."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-179",
"text": "We use a dropout of 0.2 in our models for reducing the chance of overfitting, with a pseudorandom seed before starting the experiments."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-181",
"text": "**EVALUATION RESULTS**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-182",
"text": "We evaluate the baseline models on the Dis set and Dis-50 sets and provide micro F1 scores for diagnosis and procedure codes for comparability with previous studies (Mullenbach et al., 2018; Perotte et al., 2013) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-183",
"text": "Table 3 provides the evaluation of our models (MVC-LDA and MVC-RLDA) and the four baselines."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-184",
"text": "6 When the models are trained and tested on Dis-50 set, our models outperform the previous studies in terms of micro and macro F1, P@5, and PR AUC."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-185",
"text": "This is due to the architecture differences, such as the use of multi-view CNN and 6 The authors of CAML (Mullenbach et al., 2018) reported micro F1-Proc=60.9%, micro F1-Diag=52.4%, micro F1=53.9%, P@8=70.9% and P@15=56.1% on the Dis set."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-186",
"text": "However, these results are for 8,921 codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-187",
"text": "better use of description by our model compared to the two baselines."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-188",
"text": "When the models are trained on all codes (Dis or Full sets), we have provided evaluations for all codes and also the test of 50 most frequent codes (i.e., the test set of Dis-50 and their corresponding Full sets)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-189",
"text": "The results demonstrate that the neural network models trained on Dis or Full sets outperform the models trained on Dis-50 set when evaluated on the top 50 codes in terms of micro and macro F1 and PR AUC."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-190",
"text": "These improvements may have at least two explanations: one, that the Dis-50 set includes less data, as any documents with no occurrences of any of the top 50 codes are not included; and two, that the larger models jointly learn to predict all codes within a single architecture, which can be thought of as a data-dependent regularization for the top 50 codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-191",
"text": "The flat and hierarchical SVMs performance are lower than all models on the top 50 codes and on all codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-192",
"text": "They use 10K unigram tf-idf features, and hence, they are subject to the typical limitations of bag-of-words features (no phrases, syntax, locality, etc.) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-193",
"text": "Hierarchical SVMs use the hierarchy of the codes in a form of a tree and utilize the dependency between the codes during training."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-194",
"text": "For each split of the tree, the parent SVM is trained by only the data for its children."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-195",
"text": "This increases the recall for the sparse classes as demonstrated by Perotte et al. (2013) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-196",
"text": "Therefore the hierarchical SVMs outperform flat SVMs, but their performance is still lower than the CNN-based models (i.e., CAML, MVC-LDA, and MVC-RLDA)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-197",
"text": "CAML, previously the best stateof-the-art model, outperforms both SVM based models in all metrics; our base model, MVC-LDA, outperforms CAML across the board."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-198",
"text": "This shows that the added multi-view CNN and length of text to our model are helpful."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-199",
"text": "On the Dis set, MVC-RLDA outperforms MVC-LDA, achieving the best performance in terms of all metrics except micro F1 on top 50."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-200",
"text": "This may be explained by the fact that regularization by definition of codes is more helpful for sparse classes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-201",
"text": "Therefore the top frequent codes may not benefit as much from this added feature as the sparse codes (further analysis is provided in the following sections)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-202",
"text": "Models trained on the Full set outperform their counterparts on the Dis set, with the overall best performance across all metrics achieved by MVC-RLDA on the Full set, except for PR AUC."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-203",
"text": "This shows that the added notes for each patient may have information which may not be present only in the discharge summaries and therefore are useful in learning the codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-204",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-205",
"text": "**ABLATION STUDY**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-206",
"text": "We perform an ablation study on our best model, removing certain components one at a time to gauge their respective contributions."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-207",
"text": "The components we study here are regularization, multiview CNN, the use of notes other than discharge summaries and conditioning the output layer on input length embedding (T (l))."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-208",
"text": "(For most components, we simply remove them individually; for the multi-view CNN, we replace the layer with a simple CNN with a kernel size of 12, which is the maximum kernel size in MVC-RLDA.) Figure 2 shows the the reduction amount of micro F1, PR AUC, and P@8 by removing each of the four components."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-209",
"text": "The first observation is that removing each component reduces all metrics, showing their importance to our best model."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-210",
"text": "Comparing the components with each other shows that length embedding has the lowest effect."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-211",
"text": "The use of regularization and a reliance on all available notes, on the other hand, are consistently beneficial."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-212",
"text": "The situation with multi-view CNN is more complex: for micro F1, it has the highest impact, while it has somewhat less on PR AUC and virtually none on P@8."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-213",
"text": "This difference can be explained by examining precision and recall: removing multi-view CNN decreases recall from 50.17% to 43.78%, although precision increases (62.97% to 68.63%)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-214",
"text": "PR AUC measures the overall performance of the model across different prediction thresholds; with higher precision and lower recall, the F1-optimal threshold for this model is actually a little lower than 0.5, hence the relatively poor performance at 0.5."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-215",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-216",
"text": "**EFFECT OF REGULARIZATION ON DIFFERENT LABELS**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-217",
"text": "In this section we examine micro F1 across different codes in terms of their occurrence frequency in the training set."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-218",
"text": "We limit this analysis to the codes which were both in the training set and in the testing set (i.e., 3956 codes)."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-219",
"text": "We divide the range of the logarithm of number of available training instances for the codes into 10 bins."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-220",
"text": "of available training samples, we plot the difference of the micro F1 achieved by MVC-RLDA from MVC-LDA for test samples in Figure 3(b) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-221",
"text": "Firstly, we see the differences are positive, which shows that micro F1 consistently improves across all bins."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-222",
"text": "Secondly, as expected, the most dramatic improvements are achieved for the codes with the fewest training samples, and the least improvement is achieved by the region for which we have the highest number of training examples."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-223",
"text": "This was the region which had the highest F1 ( Fig. 3(a) ) score and hence less room for improvement."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-224",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-225",
"text": "**CONCLUSION**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-226",
"text": "In this paper we introduced MVC-RLDA, a model for medical code predictions, composed of a stack of embeddings, multi-view CNNs with cross channel max pooling shared across all codes, and separate spatial attention pooling for each code."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-227",
"text": "This model has the potential to flexibly capture the relationship between different n-grams and codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-228",
"text": "We further enhance this model by using the descriptions of the labels in regularizing the attention weights to mitigate the effect of overfitting, especially for classes with few training examples."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-229",
"text": "We also demonstrate the advantage of using other notes aside from the discharge summaries."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-230",
"text": "Our model surpasses the previous state-of-the-art model on the MIMIC III dataset, providing more accurate predictions according to numerous metrics."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-231",
"text": "We also presented a detailed analysis of the results to highlight the contributions of our innovations in the achieved result."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-232",
"text": "The simplest among these was to use all available text in addition to the discharge summary, as our approach was to concatenate all relevant notes in each input."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-233",
"text": "It is worth exploring more nuanced approaches for integrating other notes in the input, as all notes may not be similarly important."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-234",
"text": "Other modifications may yield further improvements."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-235",
"text": "For instance, we trained the model on all ground-truth codes equally, similarly to previous approaches (Baumel et al., 2018; Wang et al., 2018; Mullenbach et al., 2018; Perotte et al., 2013) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-236",
"text": "However, medical codes are ordered according to their importance."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-237",
"text": "It is worth exploring approaches which take the rank of labels into account."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-238",
"text": "Furthermore, devising models which incorporate the hierarchical knowledge of the codes can be helpful."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-239",
"text": "Finally, it will be important to test our model in an AIassist workflow to see how automated predictions can expedite human coding."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-3",
"text": "This is a difficult natural language processing task that requires parsing long, heterogeneous documents and selecting a set of appropriate codes from tens of thousands of possibilities-many of which have very few positive training samples."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-4",
"text": "We present a deep learning system that advances the state of the art for the MIMIC-III dataset, achieving a new best micro F1-measure of 55.85%, significantly outperforming the previous best result (Mullenbach et al., 2018) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-5",
"text": "We achieve this through a number of enhancements, including two major novel contributions: multiview convolutional channels, which effectively learn to adjust kernel sizes throughout the input; and attention regularization, mediated by natural-language code descriptions, which helps overcome sparsity for thousands of uncommon codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-6",
"text": "These and other modifications are selected to address difficulties inherent to both automated coding specifically and deep learning generally."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-7",
"text": "Finally, we investigate our accuracy results in detail to individually measure the impact of these contributions and point the way towards future algorithmic improvements."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-10",
"text": "Coding medical reports is the standard method used by health care institutions for summarizing patients' diagnoses and the procedures performed on them."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-11",
"text": "Among other things, medical codes are used for billing, epidemiology assessment, cohort identification, and quality control of health care providers."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-12",
"text": "Assignment of standardized codes, though valuable, is a difficult task even for human coders."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-13",
"text": "Part of this challenge is in the sheer number of codes: the United States version of the Ninth Revision of the International Classification of Diseases (ICD-9-CM), for instance, contains approximately 18,000 procedure and diagnosis codes; the current 10th Revision (ICD-10) includes even more codes, approximately 171,000."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-14",
"text": "1 Additionally, the amount of data to be processed is substantial: coding inpatient charts typically involves reviewing multiple notes such as discharge summaries, progress notes, operative notes, and physician, nurse, or attendee notes; any of these notes may individually evidence for specificity of one or more codes."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-15",
"text": "Code sets are often subject to annual revision, making constant re-training and feedback to coders necessary."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-16",
"text": "Compounding the task's general difficulty, there is a degree of subjectivity in coding which can result in discrepancies even between well-trained, highly accurate coders (Farkas and Szarvas, 2008) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-17",
"text": "All of these factors contribute to increased errors by human coders."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-18",
"text": "Automated coding or AI-assisted coding approaches can reduce the time and effort spent by humans for annotating the reports and, even more importantly, potentially reduce their errors."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-19",
"text": "Given enough annotated data, machine learning algorithms trained on data annotated by multiple coders can dilute the subjectivity of individual judgments, and hence reduce subjectivity error (Farkas and Szarvas, 2008) ."
},
{
"sent_id": "a0bd41c3653073dd79e19d3ddc8d14-C001-20",
"text": "However, automated coding also shares many of the aforementioned challenges."
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-4"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"a0bd41c3653073dd79e19d3ddc8d14-C001-47"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-184",
"a0bd41c3653073dd79e19d3ddc8d14-C001-185"
]
],
"cite_sentences": [
"a0bd41c3653073dd79e19d3ddc8d14-C001-4",
"a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"a0bd41c3653073dd79e19d3ddc8d14-C001-185"
]
},
"@EXT@": {
"gold_contexts": [
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-4",
"a0bd41c3653073dd79e19d3ddc8d14-C001-5"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"a0bd41c3653073dd79e19d3ddc8d14-C001-47"
]
],
"cite_sentences": [
"a0bd41c3653073dd79e19d3ddc8d14-C001-4",
"a0bd41c3653073dd79e19d3ddc8d14-C001-45"
]
},
"@BACK@": {
"gold_contexts": [
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-40"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-148"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-154"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-155"
]
],
"cite_sentences": [
"a0bd41c3653073dd79e19d3ddc8d14-C001-40",
"a0bd41c3653073dd79e19d3ddc8d14-C001-148",
"a0bd41c3653073dd79e19d3ddc8d14-C001-154",
"a0bd41c3653073dd79e19d3ddc8d14-C001-155"
]
},
"@SIM@": {
"gold_contexts": [
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-45"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-58"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-59"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-182"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-235"
]
],
"cite_sentences": [
"a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"a0bd41c3653073dd79e19d3ddc8d14-C001-58",
"a0bd41c3653073dd79e19d3ddc8d14-C001-59",
"a0bd41c3653073dd79e19d3ddc8d14-C001-182",
"a0bd41c3653073dd79e19d3ddc8d14-C001-235"
]
},
"@USE@": {
"gold_contexts": [
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"a0bd41c3653073dd79e19d3ddc8d14-C001-46"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-61"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-141"
],
[
"a0bd41c3653073dd79e19d3ddc8d14-C001-148",
"a0bd41c3653073dd79e19d3ddc8d14-C001-150"
]
],
"cite_sentences": [
"a0bd41c3653073dd79e19d3ddc8d14-C001-45",
"a0bd41c3653073dd79e19d3ddc8d14-C001-61",
"a0bd41c3653073dd79e19d3ddc8d14-C001-141",
"a0bd41c3653073dd79e19d3ddc8d14-C001-148"
]
}
}
},
"ABC_e5ef75cd497dd94b4cf818291707df_8": {
"x": [
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-2",
"text": "We propose the first joint model for word segmentation, POS tagging, and dependency parsing for Chinese."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-3",
"text": "Based on an extension of the incremental joint model for POS tagging and dependency parsing (Hatori et al., 2011) , we propose an efficient character-based decoding method that can combine features from state-of-the-art segmentation, POS tagging, and dependency parsing models."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-4",
"text": "We also describe our method to align comparable states in the beam, and how we can combine features of different characteristics in our incremental framework."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-5",
"text": "In experiments using the Chinese Treebank (CTB), we show that the accuracies of the three tasks can be improved significantly over the baseline models, particularly by 0.6% for POS tagging and 2.4% for dependency parsing."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-6",
"text": "We also perform comparison experiments with the partially joint models."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-9",
"text": "In processing natural languages that do not include delimiters (e.g. spaces) between words, word segmentation is the crucial first step that is necessary to perform virtually all NLP tasks."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-10",
"text": "Furthermore, the word-level information is often augmented with the POS tags, which, along with segmentation, form the basic foundation of statistical NLP."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-11",
"text": "Because the tasks of word segmentation and POS tagging have strong interactions, many studies have been devoted to the task of joint word segmentation and POS tagging for languages such as Chinese (e.g. Kruengkrai et al. (2009) )."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-12",
"text": "This is because some of the segmentation ambiguities cannot be resolved without considering the surrounding grammatical constructions encoded in a sequence of POS tags."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-13",
"text": "The joint approach to word segmentation and POS tagging has been reported to improve word segmentation and POS tagging accuracies by more than 1% in Chinese (Zhang and Clark, 2008) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-14",
"text": "In addition, some researchers recently proposed a joint approach to Chinese POS tagging and dependency parsing (Li et al., 2011; Hatori et al., 2011) ; particularly, Hatori et al. (2011) proposed an incremental approach to this joint task, and showed that the joint approach improves the accuracies of these two tasks."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-15",
"text": "In this context, it is natural to consider further a question regarding the joint framework: how strongly do the tasks of word segmentation and dependency parsing interact?"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-16",
"text": "In the following Chinese sentences:"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-17",
"text": "S\u00ca sV s \u00f8s current peace-prize and peace operation related"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-18",
"text": "The current peace prize and peace operations are related."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-19",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-20",
"text": "**S\u00ca S V S \u00d8S \u00c2S CURRENT PEACE AWARD PEACE OPERATION RELATED GROUP**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-21",
"text": "The current peace is awarded to peace-operation-related groups."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-22",
"text": "the only difference is the existence of the last word \u00e2S; however, whether or not this word exists changes the whole syntactic structure and segmentation of the sentence."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-23",
"text": "This is an example in which word segmentation cannot be handled properly without considering long-range syntactic information."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-24",
"text": "Syntactic information is also considered beneficial to improve the segmentation of out-ofvocabulary (OOV) words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-25",
"text": "Unlike languages such as Japanese that use a distinct character set (i.e. katakana) for foreign words, the transliterated words in Chinese, many of which are OOV words, frequently include characters that are also used as common or function words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-26",
"text": "In the current systems, the existence of these characters causes numerous oversegmentation errors for OOV words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-27",
"text": "Based on these observations, we aim at building a joint model that simultaneously processes word segmentation, POS tagging, and dependency parsing, trying to capture global interaction among these three tasks."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-28",
"text": "To handle the increased computational complexity, we adopt the incremental parsing framework with dynamic programming (Huang and Sagae, 2010) , and propose an efficient method of character-based decoding over candidate structures."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-29",
"text": "Two major challenges exist in formalizing the joint segmentation and dependency parsing task in the character-based incremental framework."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-30",
"text": "First, we must address the problem of how to align comparable states effectively in the beam."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-91",
"text": "**FEATURES**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-31",
"text": "Because the number of dependency arcs varies depending on how words are segmented, we devise a step alignment scheme using the number of character-based arcs, which enables effective joint decoding for the three tasks."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-32",
"text": "Second, although the feature set is fundamentally a combination of those used in previous works (Zhang and Clark, 2010; Huang and Sagae, 2010) , to integrate them in a single incremental framework is not straightforward."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-33",
"text": "Because we must perform decisions of three kinds (segmentation, tagging, and parsing) in an incremental framework, we must adjust which features are to be activated when, and how they are combined with which action labels."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-34",
"text": "We have also found that we must balance the learning rate between features for segmentation and tagging decisions, and those for dependency parsing."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-35",
"text": "We perform experiments using the Chinese Treebank (CTB) corpora, demonstrating that the accuracies of the three tasks can be improved significantly over the pipeline combination of the state-of-the-art joint segmentation and POS tagging model, and the dependency parser."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-36",
"text": "We also perform comparison experiments with partially joint models, and investigate the tradeoff between the running speed and the model performance."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-38",
"text": "**RELATED WORKS**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-39",
"text": "In Chinese, Luo (2003) proposed a joint constituency parser that performs segmentation, POS tagging, and parsing within a single character-based framework."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-40",
"text": "They reported that the POS tags contribute to segmentation accuracies by more than 1%, but the syntactic information has no substantial effect on the segmentation accuracies."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-41",
"text": "In contrast, we built a joint model based on a dependency-based framework, with a rich set of structural features."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-42",
"text": "Using it, we show the first positive result in Chinese that the segmentation accuracies can be improved using the syntactic information."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-43",
"text": "Another line of work exists on lattice-based parsing for Semitic languages (Cohen and Smith, 2007; Goldberg and Tsarfaty, 2008) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-44",
"text": "These methods first convert an input sentence into a lattice encoding the morphological ambiguities, and then conduct joint morphological segmentation and PCFG parsing."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-45",
"text": "However, the segmentation possibilities considered in those studies are limited to those output by an existing morphological analyzer."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-46",
"text": "In addition, the lattice does not include word segmentation ambiguities crossing boundaries of space-delimited tokens."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-47",
"text": "In contrast, because the Chinese language does not have spaces between words, we fundamentally need to consider the lattice structure of the whole sentence."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-48",
"text": "Therefore, we place no restriction on the segmentation possibilities to consider, and we assess the full potential of the joint segmentation and dependency parsing model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-49",
"text": "Among the many recent works on joint segmentation and POS tagging for Chinese, the linear-time incremental models by Zhang and Clark (2008) and Zhang and Clark (2010) largely inspired our model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-50",
"text": "Zhang and Clark (2008) proposed an incremental joint segmentation and POS tagging model, with an effective feature set for Chinese."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-51",
"text": "However, it requires to computationally expensive multiple beams to compare words of different lengths using beam search."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-52",
"text": "More recently, Zhang and Clark (2010) proposed an efficient character-based decoder for their word-based model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-53",
"text": "In their new model, a single beam suffices for decoding; hence, they reported that their model is practically ten times as fast as their original model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-54",
"text": "To incorporate the word-level features into the character-based decoder, the features are decomposed into substring-level features, which are effective for incomplete words to have comparable scores to complete words in the beam."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-55",
"text": "Because we found that even an incremental approach with beam search is intractable if we perform the wordbased decoding, we take a character-based approach to produce our joint model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-56",
"text": "The incremental framework of our model is based on the joint POS tagging and dependency parsing model for Chinese (Hatori et al., 2011) , which is an extension of the shift-reduce dependency parser with dynamic programming (Huang and Sagae, 2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-57",
"text": "They specifically modified the shift action so that it assigns the POS tag when a word is shifted onto the stack."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-58",
"text": "However, because they regarded word segmentation as given, their model did not consider the interaction between segmentation and POS tagging."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-59",
"text": "Based on the joint POS tagging and dependency parsing model by Hatori et al. (2011) , we build our joint model to solve word segmentation, POS tagging, and dependency parsing within a single framework."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-60",
"text": "Particularly, we change the role of the shift action and additionally use the append action, inspired by the character-based actions used in the joint segmentation and POS tagging model by Zhang and Clark (2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-61",
"text": "The list of actions used is the following:"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-62",
"text": "\u2022 A: append the first character in the queue to the word on top of the stack."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-63",
"text": "\u2022 SH(t): shift the first character in the input queue as a new word onto the stack, with POS tag t."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-64",
"text": "\u2022 RL/RR: reduce the top two trees on the stack,"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-65",
"text": "Although SH(t) is similar to the one used in Hatori et al. (2011) , now it shifts the first character in the queue as a new word, instead of shifting a word."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-66",
"text": "Following Zhang and Clark (2010) , the POS tag is assigned to the word when its first character is shifted, and the word-tag pairs observed in the training data and the closed-set tags (Xia, 2000) are used to prune unlikely derivations."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-67",
"text": "Because 33 tags are defined in the CTB tag set (Xia, 2000) , our model exploits a total of 36 actions."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-68",
"text": "To train the model, we use the averaged perceptron with the early update (Collins and Roark, 2004) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-69",
"text": "In our joint model, the early update is invoked by mistakes in any of word segmentation, POS tagging, or dependency parsing."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-71",
"text": "**ALIGNMENT OF STATES**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-72",
"text": "When dependency parsing is integrated into the task of joint word segmentation and POS tagging, it is not straightforward to define a scheme to align (synchronize) the states in the beam."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-73",
"text": "In beam search, we use the step index that is associated with each state: the parser states in process are aligned according to the index, and the beam search pruning is applied to those states with the same index."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-74",
"text": "Consequently, for the beam search to function effectively, all states with the same index must be comparable, and all terminal states should have the same step index."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-75",
"text": "We can first think of using the number of shifted characters as the step index, as Zhang and Clark (2010) does."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-76",
"text": "However, because RL/RR actions can be performed without incrementing the step index, the decoder tends to prefer states with more dependency arcs, resulting more likely in premature choice of 'reduce' actions or oversegmentation of words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-77",
"text": "Alternatively, we can consider using the number of actions that have been applied as the step index, as Hatori et al. (2011) does."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-78",
"text": "However, this results in inconsistent numbers of actions to reach the terminal states: some states that segment words into larger chunks reach a terminal state earlier than other states with smaller chunks."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-79",
"text": "For these reasons, we have found that both approaches yield poor models that are not at all competitive with the baseline (pipeline) models 1 ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-80",
"text": "To address this issue, we propose an indexing scheme using the number of character-based arcs."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-81",
"text": "We presume that in addition to the word-to-word dependency arcs, each word (of length M ) implicitly has M \u2212 1 inter-character arcs, as in: A B C , A B C , and A B C (each rectangle denotes a word)."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-82",
"text": "Then we can define the step index as the sum of the number of shifted characters and the total number of (inter-word and intra-word) dependency arcs, which thereby meets all the following conditions:"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-83",
"text": "(1) All subtrees spanning M consecutive characters have the same index 2M \u2212 1. Note that the number of shifted characters is also necessary to meet condition (3)."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-84",
"text": "Otherwise, it allows an unlimited number of SH(t) actions without incrementing the step index."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-85",
"text": "Figure 1 portrays how the states are aligned using the proposed scheme, where a subtree is denoted as a rectangle with its partial index shown inside it."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-86",
"text": "In our framework, because an action increases the step index by 1 (for SH(t) or RL/RR) or 2 (for A), we need to use two beams to store new states at each step."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-87",
"text": "The computational complexity of the entire process is O(B(T + 3) \u00b7 2N ), where B is the beam size, T is the number of POS tags (= 33), and N is the number of characters in the sentence."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-88",
"text": "Theoretically, the computational time is greater than that with the character-based joint segmentation and tagging model by Zhang and Clark (2010) by a factor of"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-89",
"text": "2.1, when the same beam size is used."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-92",
"text": "The feature set of our model is fundamentally a combination of the features used in the state-of-the-art joint segmentation and POS tagging model (Zhang and Clark, 2010) and dependency parser (Huang and Sagae, 2010) , both of which are used as baseline models in our experiment."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-93",
"text": "However, we must carefully adjust which features are to be activated and when, and how they are combined with which action labels, depending on the type of the features because we intend to perform three tasks in a single incremental framework."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-94",
"text": "The list of the features used in our joint model is presented in Table 1 , where S01-S05, W01-W21, and T01-05 are taken from Zhang and Clark (2010) , and P01-P28 are taken from Huang and Sagae (2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-95",
"text": "Note that not all features are always considered: each feature is only considered if the action to be performed is included in the list of actions in the \"When to apply\" column."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-96",
"text": "Because S01-S05 are used to represent the likelihood score of substring sequences, they are only used for A and SH(t) without being combined with any action label."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-97",
"text": "Because T01-T05 are used to determine the POS tag of the word being shifted, they are only applied for SH(t)."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-98",
"text": "Because W01-W21 are used to determine whether to segment at the current position or not, they are only used for those actions involved in boundary determination decisions (A, SH(t), RL 0 , and RR 0 )."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-99",
"text": "The action labels RL 0 /RR 0 are used to denote the 'reduce' actions that determine the word boundary 2 , whereas RL 1 /RR 1 denote those 'reduce' actions that are applied when the word boundary has already been fixed."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-100",
"text": "In addition, to capture the shared nature of boundary determination actions (SH(t), RL 0 /RR 0 ), we use a generalized action label SH' to represent any of them when combined with W01-W21."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-101",
"text": "We also propose to use the features U01-U03, which we found are effective to adjust the characterlevel and substring-level scores."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-102",
"text": "Regarding the parsing features P01-P28, because we found that P01-P17 are also useful for segmentation decisions, these features are applied to all actions including A, with an explicit distinction of action labels RL 0 /RR 0 from RL 1 /RR 1 ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-103",
"text": "On the other hand, P18-P28 are only used when one of the parser actions (SH(t), RL, or RR) is applied."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-104",
"text": "Note that P07-P09 and P18-P21 (look-ahead features) require the look-ahead information of the next word form and POS tags, which cannot be incorporated straightforwardly in an incremental framework."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-105",
"text": "Although we have found that these features can be incorporated using the delayed features proposed by Hatori et al. (2011) , we did not use them in our current model because it results in the significant increase of computational time."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-107",
"text": "**DICTIONARY FEATURES**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-108",
"text": "Because segmentation using a dictionary alone can serve as a strong baseline in Chinese word segmentation (Sproat et al., 1996) , the use of dictionaries is expected to make our joint model more robust and enables us to investigate the contribution of the syntactic dependency in a more realistic setting."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-109",
"text": "Therefore, we optionally use four features D01-D04 associated with external dictionaries."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-132",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-110",
"text": "These features distinguish each dictionary source, reflecting the fact that different dictionaries have different characteristics."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-111",
"text": "These features will also be used in our reimplementation of the model by Zhang and Clark (2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-113",
"text": "**ADJUSTING THE LEARNING RATE OF FEATURES**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-114",
"text": "In formulating the three tasks in the incremental framework, we found that adjusting the update rate depending on the type of the features (segmentation/tagging vs. parsing) crucially impacts the final performance of the model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-115",
"text": "To investigate this point, we define the feature vector \u03c6 and score \u03a6 of the * q\u22121 and q\u22122 respectively denote the last-shifted word and the word shifted before q\u22121."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-116",
"text": "q.w and q.t respectively denote the (root) word form and POS tag of a subtree (word) q, and q.b and q.e the beginning and ending characters of q.w. c0 and c1 are the first and second characters in the queue."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-117",
"text": "q.w\\e denotes the set of characters excluding the ending character of q.w."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-118",
"text": "len(\u00b7) denotes the length of the word, capped at 16 if longer."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-119",
"text": "cat(\u00b7) denotes the category of the character, which is the set of POS tags observed in the training data."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-120",
"text": "Di is a dictionary, a set of words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-121",
"text": "The action label \u03c6 means that the feature is not combined with any label; \"as-is\" denotes the use of the default action set \"A, SH(t), and RR/RL\" as is."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-122",
"text": "action a being applied to the state \u03c8 as"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-123",
"text": "where \u03c6 st corresponds to the segmentation and tagging features (those starting with 'U', 'S', 'T', or 'D'), and \u03c6 p is the set of the parsing features (starting with 'P')."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-124",
"text": "Then, if we set \u03c3 p to a number smaller than 1, perceptron updates for the parsing features will be kept small at the early stage of training because the update is proportional to the values of the feature vector."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-125",
"text": "However, even if \u03c3 p is initially small, the global weights for the parsing features will increase as needed and compensate for the small \u03c3 p as the training proceeds."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-126",
"text": "In this way, we can control the contribution of syntactic dependencies at the early stage of training."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-127",
"text": "Section 4.3 shows that the best setting we found is \u03c3 p = 0.5: this result suggests that we probably should resolve remaining errors by preferentially using the local n-gram based features at the early stage of training."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-128",
"text": "Otherwise, the premature incorporation of the non-local syntactic dependencies might engender overfitting to the training data."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-130",
"text": "**EXPERIMENT**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-133",
"text": "We use the Chinese Penn Treebank ver."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-134",
"text": "5.1, 6.0, and 7.0 (hereinafter CTB-5, CTB-6, and CTB-7) for evaluation."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-135",
"text": "These corpora are split into training, development, and test sets, according to previous works."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-136",
"text": "For CTB-5, we refer to the split by Duan et al. (2007) as CTB-5d, and to the split by Jiang et al. (2008) as CTB-5j."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-137",
"text": "We also prepare a dataset for cross validation: the dataset CTB-5c consists of sentences from CTB-5 excluding the development and test sets of CTB-5d and CTB-5j."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-138",
"text": "We split CTB5c into five sets (CTB-5c-n), and alternatively use four of these as the training set and the rest as the test set."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-139",
"text": "CTB-6 is split according to the official split described in the documentation, and CTB-7 is split according to Wang et al. (2011) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-140",
"text": "The statistics of these splits are shown in Table 2 ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-141",
"text": "As external dictionaries, we use the HowNet Word List 3 , consisting of 91,015 words, and page names from the Chinese Wikipedia 4 as of Oct 26, 2011, consisting of 709,352 words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-142",
"text": "These dictionaries only consist of word forms with no frequency or POS information."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-143",
"text": "We use standard measures of word-level precision, recall, and F1 score, for evaluating each task."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-144",
"text": "The output of dependencies cannot be correct unless the syntactic head and dependent of the dependency relation are both segmented correctly."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-145",
"text": "Following the standard setting in dependency parsing works, we evaluate the task of dependency parsing with the unlabeled attachment scores excluding punctuations."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-146",
"text": "Statistical significance is tested by McNemar's test ( \u2020 : p < 0.05, \u2021 : p < 0.01)."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-148",
"text": "**BASELINE AND PROPOSED MODELS**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-149",
"text": "We use the following baseline and proposed models for evaluation."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-150",
"text": "\u2022 SegTag: our reimplementation of the joint segmentation and POS tagging model by Zhang and Clark (2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-151",
"text": "Table 5 shows that this reimplementation almost reproduces the accuracy of their implementation."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-152",
"text": "We used the beam of 16, which they reported to achieve the best accuracies."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-153",
"text": "\u2022 Dep': the state-of-the-art dependency parser by Huang and Sagae (2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-154",
"text": "We used our reimplementation, which is used in Hatori et al. (2011) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-155",
"text": "\u2022 Dep: Dep' without look-ahead features."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-156",
"text": "\u2022 TagDep: the joint POS tagging and dependency parsing model (Hatori et al., 2011) , where the look-ahead features are omitted."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-157",
"text": "tagging (Zhang and Clark, 2008; Zhang and Clark, 2010) and dependency parsing (Huang and Sagae, 2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-158",
"text": "Therefore, we can investigate the contribution of the joint approach through comparison with the pipeline and joint models."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-160",
"text": "**DEVELOPMENT RESULTS**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-161",
"text": "We have some parameters to tune: parsing feature weight \u03c3 p , beam size, and training epoch."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-162",
"text": "All these parameters are set based on experiments on CTB-5c."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-163",
"text": "For experiments on CTB-5j, CTB-6, and CTB-7, the training epoch is set using the development set."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-164",
"text": "Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-5c-1 with respect to the training epoch and different parsing feature weights, where \"Seg\", \"Tag\", and \"Dep\" respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-165",
"text": "In this experiment, the external dictionaries are not used, and the beam size of 32 is used."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-166",
"text": "Interestingly, if we simply set \u03c3 p to 1, the accuracies seem to converge at lower levels."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-167",
"text": "The \u03c3 p = 0.2 setting seems to reach almost identical segmentation and tagging accuracies as the best setting \u03c3 p = 0.5, but the convergence occurs more slowly."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-168",
"text": "Based on this experiment, we set \u03c3 p to 0.5 throughout the experiments in this paper."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-169",
"text": "Table 3 shows the performance and speed of the full joint model (with no dictionaries) on CTB-5c-1 with respect to the beam size."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-170",
"text": "Although even the beam size of 32 results in competitive accuracies for word segmentation and POS tagging, the dependency accuracy is affected most by the increase of the beam size."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-171",
"text": "Based on this experiment, we set the beam size of SegTagDep to 64 throughout the exper- iments in this paper, unless otherwise noted."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-173",
"text": "**MAIN RESULTS**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-174",
"text": "In this section, we present experimentally obtained results using the proposed and baseline models."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-175",
"text": "Table 4 shows the segmentation, POS tagging, and dependency parsing F1 scores of these models on CTB-5c."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-176",
"text": "Irrespective of the existence of the dictionary features, the joint model SegTagDep largely increases the POS tagging and dependency parsing accuracies (by 0.56-0.63% and 2.34-2.44%); the improvements in parsing accuracies are still significant even compared with SegTag+Dep' (the pipeline model with the look-ahead features)."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-177",
"text": "However, when the external dictionaries are not used (\"wo/dict\"), no substantial improvements for segmentation accuracies were observed."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-178",
"text": "In contrast, when the dictionaries are used (\"w/dict\"), the segmentation accuracies are now improved over the baseline model SegTag consistently (on every trial)."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-179",
"text": "Although the overall improvement in segmentation is only around 0.1%, more than 1% improvement is observed if we specifically examine OOV 6 words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-180",
"text": "The difference between \"wo/dict\" and \"w/dict\" results suggests that the syntactic dependencies might work as a noise when the segmentation model is insufficiently stable, but the model does improve when it is stable, not receiving negative effects from the syntactic dependencies."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-181",
"text": "The partially joint model SegTag+TagDep is shown to perform reasonably well in dependency parsing: with dictionaries, it achieved the 2.02% improvement over SegTag+Dep, which is only 0.32% lower than SegTagDep."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-182",
"text": "However, whereas SegTag+TagDep showed no substantial improvement in tagging accuracies over SegTag (when the dictionaries are used), SegTagDep achieved consistent improvements of 0.46% and 0.58% (without/with dic- tionaries); these differences can be attributed to the combination of the relieved error propagation and the incorporation of the syntactic dependencies."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-183",
"text": "In addition, SegTag+TagDep has OOV tagging accuracies consistently lower than SegTag, suggesting that the syntactic dependency has a negative effect on the POS tagging accuracy of OOV words 7 ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-184",
"text": "In contrast, this negative effect is not observed for SegTagDep: both the overall tagging accuracy and the OOV accuracy are improved, demonstrating the effectiveness of the proposed model."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-185",
"text": "Figure 3 shows the performance and processing time comparison of various models and their combinations."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-186",
"text": "Although SegTagDep takes a few times longer to achieve accuracies comparable to those of SegTag+Dep/TagDep, it seems to present potential for greater improvement, especially for tagging and parsing accuracies, when a larger beam can be used."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-187",
"text": "Table 5 and Table 6 show a comparison of the segmentation and POS tagging accuracies with other state-of-the-art models."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-188",
"text": "\"Kruengkrai+ '09\" is a lattice-based model by Kruengkrai et al. (2009) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-189",
"text": "\"Zhang '10\" is the incremental model by Zhang and Clark (2010) ."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-190",
"text": "These two systems use no external resources other than the CTB corpora."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-191",
"text": "\"Sun+ '11\" is a CRF-based model (Sun, 2011 ) that uses a combination of several models, with a dictionary of idioms. \"Wang+ '11\" is a semi-supervised model by Wang et al. (2011) , which additionally uses the Chinese Gigaword Corpus."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-192",
"text": "Our models with dictionaries (those marked with '(d)') have competitive accuracies to other state-ofthe-art systems, and SegTagDep(d) achieved the best reported segmentation and POS tagging accuracies, using no additional corpora other than the dictionaries."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-193",
"text": "Particularly, the POS tagging accuracy is more than 0.4% higher than the previous best system thanks to the contribution of syntactic dependencies."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-194",
"text": "These results also suggest that the use of readily available dictionaries can be more effective than semi-supervised approaches."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-196",
"text": "**COMPARISON WITH OTHER SYSTEMS**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-197",
"text": "----------------------------------"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-198",
"text": "**CONCLUSION**"
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-199",
"text": "In this paper, we proposed the first joint model for word segmentation, POS tagging, and dependency parsing in Chinese."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-200",
"text": "The model demonstrated substantial improvements on the three tasks over the pipeline combination of the state-of-the-art joint segmentation and POS tagging model, and dependency parser."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-201",
"text": "Particularly, results showed that the Table 6 : Final results on CTB-6 and CTB-7 accuracies of POS tagging and dependency parsing were remarkably improved by 0.6% and 2.4%, respectively corresponding to 8.3% and 10.2% error reduction."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-202",
"text": "For word segmentation, although the overall improvement was only around 0.1%, greater than 1% improvements was observed for OOV words."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-203",
"text": "We conducted some comparison experiments of the partially joint and full joint models."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-204",
"text": "Compared to SegTagDep, SegTag+TagDep performs reasonably well in terms of dependency parsing accuracy, whereas the POS tagging accuracies are more than 0.5% lower."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-205",
"text": "In future work, probabilistic pruning techniques such as the one based on a maximum entropy model are expected to improve the efficiency of the joint model further because the accuracies are apparently still improved if a larger beam can be used."
},
{
"sent_id": "e5ef75cd497dd94b4cf818291707df-C001-206",
"text": "More efficient decoding would also allow the use of the look-ahead features (Hatori et al., 2011) and richer parsing features (Zhang and Nivre, 2011) ."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"e5ef75cd497dd94b4cf818291707df-C001-32"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-49"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-60"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-66"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-75"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-92"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-94"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-111"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-149",
"e5ef75cd497dd94b4cf818291707df-C001-150",
"e5ef75cd497dd94b4cf818291707df-C001-157"
],
[
"e5ef75cd497dd94b4cf818291707df-C001-187",
"e5ef75cd497dd94b4cf818291707df-C001-189"
]
],
"cite_sentences": [
"e5ef75cd497dd94b4cf818291707df-C001-32",
"e5ef75cd497dd94b4cf818291707df-C001-49",
"e5ef75cd497dd94b4cf818291707df-C001-60",
"e5ef75cd497dd94b4cf818291707df-C001-66",
"e5ef75cd497dd94b4cf818291707df-C001-75",
"e5ef75cd497dd94b4cf818291707df-C001-92",
"e5ef75cd497dd94b4cf818291707df-C001-94",
"e5ef75cd497dd94b4cf818291707df-C001-111",
"e5ef75cd497dd94b4cf818291707df-C001-150",
"e5ef75cd497dd94b4cf818291707df-C001-157",
"e5ef75cd497dd94b4cf818291707df-C001-189"
]
},
"@SIM@": {
"gold_contexts": [
[
"e5ef75cd497dd94b4cf818291707df-C001-49"
]
],
"cite_sentences": [
"e5ef75cd497dd94b4cf818291707df-C001-49"
]
},
"@BACK@": {
"gold_contexts": [
[
"e5ef75cd497dd94b4cf818291707df-C001-52"
]
],
"cite_sentences": [
"e5ef75cd497dd94b4cf818291707df-C001-52"
]
},
"@DIF@": {
"gold_contexts": [
[
"e5ef75cd497dd94b4cf818291707df-C001-86",
"e5ef75cd497dd94b4cf818291707df-C001-88",
"e5ef75cd497dd94b4cf818291707df-C001-89"
]
],
"cite_sentences": [
"e5ef75cd497dd94b4cf818291707df-C001-88"
]
}
}
},
"ABC_a0730efd9575800ba779516af1f440_8": {
"x": [
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-2",
"text": "We propose a simple unsupervised approach to detecting non-compositional components in multiword expressions based on Wiktionary."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-3",
"text": "The approach makes use of the definitions, synonyms and translations in Wiktionary, and is applicable to any type of MWE in any language, assuming the MWE is contained in Wiktionary."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-4",
"text": "Our experiments show that the proposed approach achieves higher F-score than state-of-the-art methods."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-7",
"text": "A multiword expression (MWE) is a combination of words with lexical, syntactic or semantic idiosyncrasy (Sag et al., 2002; Baldwin and Kim, 2009 )."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-8",
"text": "An MWE is considered (semantically) \"non-compositional\" when its meaning is not predictable from the meaning of its components."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-9",
"text": "Conversely, compositional MWEs are those whose meaning is predictable from the meaning of the components."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-10",
"text": "Based on this definition, a component is compositional within an MWE, if its meaning is reflected in the meaning of the MWE, and it is non-compositional otherwise."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-11",
"text": "Understanding which components are noncompositional within an MWE is important in NLP applications in which semantic information is required."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-12",
"text": "For example, when searching for spelling bee, we may also be interested in documents about spelling, but not those which contain only bee."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-13",
"text": "For research project, on the other hand, we are likely to be interested in documents which contain either research or project in isolation, and for swan song, we are only going to be interested in documents which contain the phrase swan song, and not just swan or song."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-14",
"text": "In this paper, we propose an unsupervised approach based on Wikitionary for predicting which components of a given MWE have a compositional usage."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-15",
"text": "Experiments over two widely-used datasets show that our approach outperforms stateof-the-art methods."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-17",
"text": "**RELATED WORK**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-18",
"text": "Previous studies which have considered MWE compositionality have focused on either the identification of non-compositional MWE token instances (Kim and Baldwin, 2007; Fazly et al., 2009; Forthergill and Baldwin, 2011; Muzny and Zettlemoyer, 2013) , or the prediction of the compositionality of MWE types (Reddy et al., 2011; Salehi and Cook, 2013; Salehi et al., 2014) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-19",
"text": "The identification of non-compositional MWE tokens is an important task when a word combination such as kick the bucket or saw logs is ambiguous between a compositional (generally non-MWE) and non-compositional MWE usage."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-20",
"text": "Approaches have ranged from the unsupervised learning of type-level preferences (Fazly et al., 2009 ) to supervised methods specific to particular MWE constructions (Kim and Baldwin, 2007) or applicable across multiple constructions using features similar to those used in all-words word sense disambiguation (Forthergill and Baldwin, 2011; Muzny and Zettlemoyer, 2013) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-21",
"text": "The prediction of the compositionality of MWE types has traditionally been couched as a binary classification task (compositional or non-compositional: Baldwin et al. (2003) , Bannard (2006) ), but more recent work has moved towards a regression setup, where the degree of the compositionality is predicted on a continuous scale (Reddy et al., 2011; Salehi and Cook, 2013; Salehi et al., 2014) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-22",
"text": "In either case, the modelling has been done either over the whole MWE (Reddy et al., 2011; Salehi and Cook, 2013) , or relative to each component within the MWE (Baldwin et al., 2003; Bannard, 2006) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-23",
"text": "In this paper, we focus on the binary classification of MWE types relative to each component of the"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-24",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-25",
"text": "**MWE.**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-26",
"text": "The work that is perhaps most closely related to this paper is that of Salehi and Cook (2013) and Salehi et al. (2014) , who use translation data to predict the compositionality of a given MWE relative to each of its components, and then combine those scores to derive an overall compositionality score."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-27",
"text": "In both cases, translations of the MWE and its components are sourced from PanLex (Baldwin et al., 2010; Kamholz et al., 2014) , and if there is greater similarity between the translated components and MWE in a range of languages, the MWE is predicted to be more compositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-28",
"text": "The basis of the similarity calculation is unsupervised, using either string similarity (Salehi and Cook, 2013) or distributional similarity (Salehi et al., 2014) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-29",
"text": "However, the overall method is supervised, as training data is used to select the languages to aggregate scores across for a given MWE construction."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-30",
"text": "To benchmark our method, we use two of the same datasets as these two papers, and repurpose the best-performing methods of Salehi and Cook (2013) and Salehi et al. (2014) for classification of the compositionality of each MWE component."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-32",
"text": "**METHODOLOGY**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-33",
"text": "Our basic method relies on analysis of lexical overlap between the component words and the definitions of the MWE in Wiktionary, in the manner of Lesk (1986) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-34",
"text": "That is, if a given component can be found in the definition, then it is inferred that the MWE carries the meaning of that component."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-35",
"text": "For example, the Wiktionary definition of swimming pool is \"An artificially constructed pool of water used for swimming\", suggesting that the MWE is compositional relative to both swimming and pool."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-36",
"text": "If the MWE is not found in Wiktionary, we use Wikipedia as a backoff, and use the first paragraph of the (top-ranked) Wikipedia article as a proxy for the definition."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-37",
"text": "As detailed below, we further extend the basic method to incorporate three types of information found in Wiktionary: (1) definitions of each word in the definitions, (2) synonyms of the words in the definitions, and (3) translations of the MWEs and components."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-39",
"text": "**DEFINITION-BASED SIMILARITY**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-40",
"text": "The basic method uses Boolean lexical overlap between the target component of the MWE and a definition."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-41",
"text": "A given MWE will often have multiple definitions, however, begging the question of how to combine across them, for which we propose the following three methods."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-43",
"text": "**FIRST DEFINITION (FIRSTDEF):**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-44",
"text": "Use only the first-listed Wiktionary definition for the MWE, based on the assumption that this is the predominant sense."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-45",
"text": "All Definitions (ALLDEFS): In the case that there are multiple definitions for the MWE, calculate the lexical overlap for each independently and take a majority vote; in the case of a tie, label the component as non-compositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-46",
"text": "Idiom Tag (ITAG): In Wiktionary, there is facility for users to tag definitions as idiomatic."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-47",
"text": "1 If, for a given MWE, there are definitions tagged as idiomatic, use only those definitions; if there are no such definitions, use the full set of definitions."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-49",
"text": "**SYNONYM-BASED DEFINITION EXPANSION**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-50",
"text": "In some cases, a component is not explicitly mentioned in a definition, but a synonym does occur, indicating that the definition is compositional in that component."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-51",
"text": "In order to capture synonymbased matches, we optionally look for synonyms of the component word in the definition, 2 and expand our notion of lexical overlap to include these synonyms."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-52",
"text": "For example, for the MWE china clay, the definition is kaolin, which includes neither of the components."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-53",
"text": "However, we find the component word clay in the definition for kaolin, as shown below."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-54",
"text": "A fine clay, rich in kaolinite, used in ceramics, paper-making, etc."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-55",
"text": "This method is compatible with the three definition-based similarity methods described above, and indicated by the +SYN suffix (e.g. FIRSTDEF+SYN is FIRSTDEF with synonymbased expansion)."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-57",
"text": "**TRANSLATIONS**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-58",
"text": "A third information source in Wiktionary that can be used to predict compositionality is sense-level translation data."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-59",
"text": "Due to the user-generated nature of Wiktionary, the set of languages for which 1 Although the recall of these tags is low (Muzny and Zettlemoyer, 2013 translations are provided varies greatly across lexical entries."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-60",
"text": "Our approach is to take whatever translations happen to exist in Wiktionary for a given MWE, and where there are translations in that language for the component of interest, use the LCSbased method of Salehi and Cook (2013) to measure the string similarity between the translation of the MWE and the translation of the components."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-61",
"text": "Unlike Salehi and Cook (2013) , however, we do not use development data to select the optimal set of languages in a supervised manner, and instead simply take the average of the string similarity scores across the available languages."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-62",
"text": "In the case of more than one translation in a given language, we use the maximum string similarity for each pairing of MWE and component translation."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-63",
"text": "Unlike the definition and synonym-based approach, the translation-based approach will produce real rather than binary values."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-64",
"text": "To combine the two approaches, we discretise the scores given by the translation approach."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-65",
"text": "In the case of disagreement between the two approaches, we label the given MWE as non-compositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-66",
"text": "This results in higher recall and lower precision for the task of detecting compositionality."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-68",
"text": "**AN ANALYSIS OF WIKTIONARY COVERAGE**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-69",
"text": "A dictionary-based method is only as good as the dictionary it is applied to."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-70",
"text": "In the case of MWE compositionality analysis, our primary concern is lexical coverage in Wiktionary, i.e., what proportion of a representative set of MWEs is contained in Wiktionary."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-71",
"text": "We measure lexical coverage relative to the two datasets used in this research (described in detail in Section 4), namely 90 English noun compounds (ENCs) and 160 English verb particle constructions (EVPCs)."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-72",
"text": "In each case, we calculated the proportion of the dataset that is found in Wiktionary, Wiktionary+Wikipedia (where we back off to a Wikipedia document in the case that a MWE is not found in Wiktionary) and WordNet (Fellbaum, 1998) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-73",
"text": "The results are found in Table 1 , and indicate perfect coverage in Wiktionary+Wikipedia for the ENCs, and very high coverage for the EVPCs."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-74",
"text": "In both cases, the coverage of WordNet is substantially lower, although still respectable, at around 90%."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-76",
"text": "**DATASETS**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-77",
"text": "As mentioned above, we evaluate our method over the same two datasets as Salehi and Cook (2013) (which were later used, in addition to a third dataset of German noun compounds, in Salehi et al. (2014) ): (1) 90 binary English noun compounds (ENCs, e.g. spelling bee or swimming pool); and (2) 160 English verb particle constructions (EVPCs, e.g. stand up and give away)."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-78",
"text": "Our results are not directly comparable with those of Salehi and Cook (2013) and Salehi et al. (2014) , however, who evaluated in terms of a regression task, modelling the overall compositionality of the MWE."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-79",
"text": "In our case, the task setup is a binary classification task relative to each of the two components of the MWE."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-80",
"text": "The ENC dataset was originally constructed by Reddy et al. (2011) , and annotated on a continuous [0, 5] scale for both overall compositionality and the component-wise compositionality of each of the modifier and head noun."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-81",
"text": "The sampling was random in an attempt to make the dataset balanced, with 48% of compositional English noun compounds, of which 51% are compositional in the first component and 60% are compositional in the second component."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-82",
"text": "We generate discrete labels by discretising the component-wise compositionality scores based on the partitions [0, 2.5] and (2.5, 5]."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-83",
"text": "On average, each NC in this dataset has 1.4 senses (definitions) in Wiktionary."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-84",
"text": "The EVPC dataset was constructed by Bannard (2006) , and manually annotated for compositionality on a binary scale for each of the head verb and particle."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-85",
"text": "For the 160 EVPCs, 76% are verb-compositional and 48% are particlecompositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-86",
"text": "On average, each EVPC in this dataset has 3.0 senses (definitions) in Wiktionary."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-88",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-89",
"text": "The baseline for each dataset takes the form of looking for a user-annotated idiom tag in the Wiktionary lexical entry for the MWE: if there is an idiomatic tag, both components are considered to be non-compositional; otherwise, both components are considered to be compositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-90",
"text": "We expect this method to suffer from low precision for two reasons: first, the guidelines given to the annotators of our datasets might be different from what Wiktionary contributors assume to be an idiom."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-91",
"text": "Second, the baseline method assumes that for any non-compositional MWE, all components must be equally non-compositional, despite the wealth of MWEs where one or more components are compositional (e.g. from the Wiktionary guidelines for idiom inclusion, 3 computer chess, basketball player, telephone box)."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-92",
"text": "We also compare our method with: (1) \"LCS\", the string similarity-based method of Salehi and Cook (2013) , in which 54 languages are used; (2) \"DS\", the monolingual distributional similarity method of Salehi et al. (2014) ; (3) \"DS+DSL2\", the multilingual distributional similarity method of Salehi et al. (2014) , including supervised language selection for a given dataset, based on crossvalidation; and (4) \"LCS+DS+DSL2\", whereby the first three methods are combined using a supervised support vector regression model."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-93",
"text": "In each case, the continuous output of the model is equal-width discretised to generate a binary classification."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-94",
"text": "We additionally present results for the combination of each of the six methods proposed in this paper with LCS, DS and DSL2, using a linear-kernel support vector machine (represented with the suffix \" COMB(LCS+DS+DSL2) \" for a given method)."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-95",
"text": "The results are based on cross-3 http://en.wiktionary.org/wiki/ Wiktionary:Idioms_that_survived_RFD validation, and for direct comparability, the partitions are exactly the same as Salehi et al. (2014) ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-96",
"text": "Tables 2 and 3 provide the results when our proposed method for detecting non-compositionality is applied to the ENC and EVPC datasets, respectively."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-97",
"text": "The inclusion of translation data was found to improve all of precision, recall and F-score across the board for all of the proposed methods."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-98",
"text": "For reasons of space, results without translation data are therefore omitted from the paper."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-99",
"text": "Overall, the simple unsupervised methods proposed in this paper are comparable with the unsupervised and supervised state-of-the-art methods of Salehi and Cook (2013) and Salehi et al. (2014) , with ITAG achieving the highest F-score for the ENC dataset and for the verb components of the EVPC dataset."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-100",
"text": "The inclusion of synonyms boosts results in most cases."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-101",
"text": "When we combine each of our proposed methods with the string and distributional similarity methods of Salehi and Cook (2013) and Salehi et al. (2014) , we see substantial improvements over the comparable combined method of \"LCS+DS+DSL2\" in most cases, demonstrating both the robustness of the proposed methods and their complementarity with the earlier methods."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-102",
"text": "It is important to reinforce that the proposed methods make no language-specific assumptions and are therefore applicable to any type of MWE and any language, with the only requirement being that the MWE of interest be listed in the Wiktionary for"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-104",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-105",
"text": "We analysed all items in each dataset where the system score differed from that of the human annotators."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-106",
"text": "For both datasets, the majority of incorrectly-labelled items were compositional but predicted to be non-compositional by our system, as can be seen in the relatively low precision scores in Tables 2 and 3 ."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-107",
"text": "In many of these cases, the prediction based on definitions and synonyms was compositional but the prediction based on translations was non-compositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-108",
"text": "In such cases, we arbitrarily break the tie by labelling the instance as non-compositional, and in doing so favour recall over precision."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-109",
"text": "Some of the incorrectly-labelled ENCs have a gold-standard annotation of around 2.5, or in other words are semi-compositional."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-110",
"text": "For example, the compositionality score for game in game plan is 2.82/5, but our system labels it as noncompositional; a similar thing happens with figure and the EVPC figure out."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-111",
"text": "Such cases demonstrate the limitation of approaches to MWE compositionality that treat the problem as a binary classification task."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-112",
"text": "On average, the EVPCs have three senses, which is roughly twice the number for ENCs."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-113",
"text": "This makes the prediction of compositionality harder, as there is more information to combine across (an effect that is compounded with the addition of synonyms and translations)."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-114",
"text": "In future work, we hope to address this problem by first finding the sense which matches best with the sentences given to the annotators."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-116",
"text": "**CONCLUSION**"
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-117",
"text": "We have proposed an unsupervised approach for predicting the compositionality of an MWE relative to each of its components, based on lexical overlap using Wiktionary, optionally incorporating synonym and translation data."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-118",
"text": "Our experiments showed that the various instantiations of our approach are superior to previous state-of-the-art supervised methods."
},
{
"sent_id": "a0730efd9575800ba779516af1f440-C001-119",
"text": "All code to replicate the results in this paper has been made publicly available at https://github.com/bsalehi/ wiktionary_MWE_compositionality."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a0730efd9575800ba779516af1f440-C001-18"
],
[
"a0730efd9575800ba779516af1f440-C001-21"
],
[
"a0730efd9575800ba779516af1f440-C001-22"
]
],
"cite_sentences": [
"a0730efd9575800ba779516af1f440-C001-18",
"a0730efd9575800ba779516af1f440-C001-21",
"a0730efd9575800ba779516af1f440-C001-22"
]
},
"@SIM@": {
"gold_contexts": [
[
"a0730efd9575800ba779516af1f440-C001-26"
],
[
"a0730efd9575800ba779516af1f440-C001-77"
],
[
"a0730efd9575800ba779516af1f440-C001-78"
]
],
"cite_sentences": [
"a0730efd9575800ba779516af1f440-C001-26",
"a0730efd9575800ba779516af1f440-C001-77",
"a0730efd9575800ba779516af1f440-C001-78"
]
},
"@USE@": {
"gold_contexts": [
[
"a0730efd9575800ba779516af1f440-C001-28"
],
[
"a0730efd9575800ba779516af1f440-C001-30"
],
[
"a0730efd9575800ba779516af1f440-C001-60"
],
[
"a0730efd9575800ba779516af1f440-C001-92"
],
[
"a0730efd9575800ba779516af1f440-C001-99"
],
[
"a0730efd9575800ba779516af1f440-C001-101"
]
],
"cite_sentences": [
"a0730efd9575800ba779516af1f440-C001-28",
"a0730efd9575800ba779516af1f440-C001-30",
"a0730efd9575800ba779516af1f440-C001-60",
"a0730efd9575800ba779516af1f440-C001-92",
"a0730efd9575800ba779516af1f440-C001-99",
"a0730efd9575800ba779516af1f440-C001-101"
]
},
"@DIF@": {
"gold_contexts": [
[
"a0730efd9575800ba779516af1f440-C001-61"
],
[
"a0730efd9575800ba779516af1f440-C001-78"
],
[
"a0730efd9575800ba779516af1f440-C001-99"
]
],
"cite_sentences": [
"a0730efd9575800ba779516af1f440-C001-61",
"a0730efd9575800ba779516af1f440-C001-78",
"a0730efd9575800ba779516af1f440-C001-99"
]
}
}
},
"ABC_155920441b8e81dff4e2b8e110383d_8": {
"x": [
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-2",
"text": "In this paper, we investigate the effects of using subword information in representation learning."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-3",
"text": "We argue that using syntactic subword units effects the quality of the word representations positively."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-4",
"text": "We introduce a morpheme-based model and compare it against to word-based, characterbased, and character n-gram level models."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-5",
"text": "Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-6",
"text": "We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-7",
"text": "The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character ngram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morphemebased models better at syntactic tasks."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-10",
"text": "The distributional hypothesis of Harris (1954) has been used to motivate work on vector space models to learn word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-11",
"text": "Deep learning models learn another kind of vector space model for building word representations, which shows superior performance in representing words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-12",
"text": "Although deep neural networks have been very successful in representing words via such vectors, those models have not been very successful at estimating the representations of rare words since they do not appear often enough to allow us to collect reliable statistics about their context."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-13",
"text": "Morphologically complex words are also rare by definition."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-14",
"text": "Cao and Rei (2016) state that a word like unbelievableness does not exist in the first 17 million words of Wikipedia."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-15",
"text": "Some methods have been proposed to deal with the sparsity issue in learning word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-16",
"text": "One approach is to utilize the subword information such as characters, character n-grams, or morphemes rather than learning distinct word representations without considering the inner structure of words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-17",
"text": "Character-based models usually learn better word representations compared to word-based models since they capture the regularities inside the words so that it mitigates the sparsity in representation learning."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-18",
"text": "However, those models learn the representations through the characters that do not correspond to a syntactic or semantic unit."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-19",
"text": "In Turkish, two words can have similar word representations under a character-based model just because of their common suffixes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-20",
"text": "For example, character-based models such as (Bojanowski et al., 2017) generate similar word representations for words that have common character n-grams such as kitaplardan (from the books) and kasaplardan (from the butchers) (where lar and dan are suffixes, kitap and kasap are the roots) although the two words are semantically not related at all."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-21",
"text": "Another problem we observed for the characterbased models is that such models estimate distant representations for words that are semantically related but involve different forms of the same morpheme so called allomorphs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-22",
"text": "This is one of the consequences of vowel harmony in some languages like Turkish."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-23",
"text": "We observed this through several semantic similarity tasks performed on semantically similar but orthographically different words by using the word representations obtained from character n-gram level models such as fasttext (Bojanowski et al., 2017) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-24",
"text": "For example, Turkish words such as mavililerinki (of the ones with the blue color) and sar\u0131l\u0131lar\u0131nki (of the ones with the yellow color) with allomorphs li and l\u0131; ler and lar; in and \u0131n are asserted to be distant from each other in regard to their word representations under a character n-gram level model such as fasttext (Bojanowski et al., 2017) , although the two words are semantically similar and both referring to colors."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-25",
"text": "In this paper, we argue that learning word representations through morphemes rather than characters lead to more accurate word vectors especially in morphologically complex languages."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-26",
"text": "Such character-based models are strongly affected by the orthographic commonness of words, that governs orthographically similar words to have similar word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-27",
"text": "We introduce a model to learn morpheme and word representations especially for morphologically very complex words without using an external supervised morphological segmentation system."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-28",
"text": "Instead, we use an unsupervised segmentation model to initialize our model with a list of candidate morphological segmentations of each word in the training data."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-29",
"text": "We do not provide a single segmentation per word like others (Botha and Blunsom, 2014; Qiu et al., 2014) , but instead we provide a list of potential segmentations of each word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-30",
"text": "Therefore, our model relaxes the requirement of an external segmentation system in morpheme-based representation learning."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-31",
"text": "To our knowledge, this will be the first attempt in co-learning of morpheme representations and word representations in an unsupervised framework without assuming a single morphological segmentation per word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-32",
"text": "Our model is mostly similar to that of Lazaridou et al. (2013) and Botha and Blunsom (2014) since we also aim to learn morpheme and word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-33",
"text": "Our model is akin to that of Pinter et al. (2017) from the training perspective since they infer the out-of-vocabulary word embeddings from pre-trained word embeddings."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-34",
"text": "Here, we also try to mimic the word2vec (Mikolov et al., 2013) embeddings (i.e. that are the expected outputs of the model) to learn the rare word representations with a complex morphology."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-35",
"text": "Our model shows some architectural similarities to that of Cao and Rei (2016) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-36",
"text": "Both models use the attention mechanism to up-weight the correct morphological segmentation of a word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-37",
"text": "However, their model is character-based and our model is morpheme-based where different segmentations of each word contribute to the resulting vector."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-38",
"text": "It should be noted that our main concern is to investigate what character-based models cannot learn that the morpheme-based models learn."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-39",
"text": "As for the experimental setting, we have chosen Turkish language that has a complex morphology and severe allomorphy."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-40",
"text": "The results show that a morpheme-based model is better at estimating word representations of morphologically complex words (with at least 2-3 suffixes) compared to other word-based and character-based models."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-41",
"text": "We present experimental results on Turkish as an agglutinative language and English as a morphologically poor language."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-43",
"text": "**RELATED WORK**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-44",
"text": "Classical word representation models such as word2vec (Mikolov et al., 2013) have been successful in learning word representations for frequent words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-45",
"text": "Since these classical models are based on collecting contextual information in a very large corpus, they estimate deficient word representations for rare words due to insufficient contextual information."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-46",
"text": "This has a negative consequence in some natural language processing tasks that make use of the word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-47",
"text": "One approach to overcome this deficiency in estimating rare word representations is to apply compositional methods."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-48",
"text": "Each word comprises of different subword units, such as characters, character n-grams, or morphemes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-49",
"text": "Lazaridou et al. (2013) apply compositional methods by having the stem and affix representations in order to estimate the distributional representation of morphologically complex words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-50",
"text": "Bojanowski et al. (2017) introduce an extension to word2vec (Mikolov et al., 2013) by representing each word in terms of the vector representations of its n-grams, which was earlier applied by Sch\u00fctze (1993) that learns the representations of fourgrams by applying singular value decomposition (SVD)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-51",
"text": "Analogously, Alexandrescu and Kirchhoff (2006) represent each character n-gram with a vector representation and words are estimated by the summation of the subword representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-52",
"text": "Their results show that compositional methods that are originally proposed for estimating the meaning of phrases can also be used for estimating the meaning of a word by combining the information coming from different subword units."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-53",
"text": "Botha and Blunsom (2014) introduce Compositional models that use character-level features show that the representations of rare words can be estimated more accurately (in both semantic and syntactic tasks) than the word-based models since the character-level models share more features across different words that helps to mitigate sparsity."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-54",
"text": "Cotterell and Sch\u00fctze (2015) encode morphological tags within word embeddings by using a log-bilinear model, thereby leading morphologically similar words to have closer word representations in the embedding space."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-55",
"text": "Luong et al. (2013) learn word representations based on morphemes that are obtained from an external morphological segmentation system."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-56",
"text": "Collobert et al. (2011) enhance word vectors with some character-level features such as capitalization."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-57",
"text": "Bhatia et al. (2016) incorporate morphological information as a prior distribution to improve word embeddings."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-58",
"text": "They use Morfessor (Creutz and Lagus, 2002) as an external morphological segmentation system to extract the inner structure of words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-60",
"text": "**THE MORPH2VEC MODEL**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-61",
"text": "In our morpheme-based model, a word is encoded by a sequence of morphemes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-62",
"text": "Each word w s i with a particular morphological segmentation s i is represented by a list of morphemes m = {m 0 , m 1 , ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-63",
"text": ". . , m n } as follows:"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-64",
"text": "We assume that the correct morphological segmentation of a word is not known a priori by assuming a completely unsupervised learning model."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-65",
"text": "We use an unsupervised neural segmentation algorithm (\u00dcst\u00fcn and Can, 2016 ) that generates a list of candidate segmentations for a given word (see Section 4 for the details)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-66",
"text": "Each distinct morpheme is defined by a column vector in a morpheme embedding matrix W m \u2208 R d morph \u00d7|M | where d morph is the vector dimension for the morphemes and M is the set of all pseudo morphemes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-67",
"text": "Word representations are coupled with a particular morphological segmentation of each word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-68",
"text": "In other words, each segmentation of a single word has its own representation."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-69",
"text": "Word representation for each particular segmentation is learned by a sequential function f that takes a sequence of morphemes and generates the word representation with a dimension of d word ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-70",
"text": "The word embedding that is to be estimated compositionally via its morphemes that belong to segmentation s i is denoted by v s i and estimated by a function f as follows:"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-71",
"text": "where v m 0 denotes the vector of m 0 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-72",
"text": "We use bidirectional LSTMs (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) to estimate a trainable function f in our neural network architecture that is illustrated in Figure 1 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-73",
"text": "In the forward LSTMs, morphemes from the beginning till the end of the word are given sequentially, whereas in the backward LSTMs, morphemes from the end till the beginning of the word are given in the reverse order."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-74",
"text": "Each output of Bi-LSTM which is the concatenation of the outputs of the forward and backward LSTMs represents a particular segmentation of a given word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-75",
"text": "Therefore, we train the model with a list of potential segmentations of each word in training data."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-76",
"text": "Since a word is represented by different morpheme sequences that refer to different segmentations of the same word, we use an attention model over these sequences that are learned by the BiLSTMs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-77",
"text": "Attention model learns a weight \u03b1 i for each segmentation, such that weighted sum of the embeddings of all candidate segmentations:"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-78",
"text": "where v s i is the vector for segmentation s i that is the output of a Bi-LSTM."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-79",
"text": "The weight \u03b1 i is estimated as follows (Bahdanau et al., 2014) :"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-80",
"text": "Here, a feed-forward layer is used with a softmax function that is applied over the outputs of Bi-LSTMs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-81",
"text": "W \u00b7v s i denotes the corresponding column in the weight matrix of the feed-forward layer in the attention."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-82",
"text": "For training, we use the pre-trained word2vec (Mikolov et al., 2013) vectors in order to minimize the cost between the learned and pre-trained vectors with the following objective function:"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-83",
"text": "where h(w k ) is the cost for the kth word w k in a training set of size N with a L2 regularization term on the model parameters \u03b8."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-84",
"text": "We use the cosine proximity loss between the learned and the pretrained vector."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-86",
"text": "**NEURAL MORPHOLOGICAL SEGMENTATION**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-87",
"text": "Although it is possible to train the model by providing all the potential segmentations of each word, we utilize an unsupervised segmentation algorithm to make the model computationally more efficient by reducing the search space."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-88",
"text": "The segmentation algorithm is based on the neural model by\u00dcst\u00fcn and Can (2016) that uses the semantic similarity between substrings of a word to detect the potential morpheme boundaries."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-89",
"text": "This algorithm is based on the idea that the meaning of a word is preserved especially through inflection s 0 araba -lar\u0131m -\u0131n s 1 arabalar -\u0131 -m\u0131n s 2 arabalar -\u0131m -\u0131n s 3 araba -lar -\u0131 -m\u0131n s 4 araba -lar -\u0131m -\u0131n Table 2 : The cosine similarities between the substrings (parent-child) of the Turkish word arabalar\u0131m\u0131n (of my cars)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-90",
"text": "0.25 is assigned for the cosine similarity threshold and only the splits above the threshold are listed."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-91",
"text": "and it benefits from the word representations to utilize this preservation."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-92",
"text": "The parent-child relations such as (respect,respectful) are defined similar to that of Narasimhan et al. (2015) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-93",
"text": "The algorithm begins by generating all possible segmentations where there are at most K segments 1 (see Table 1 )."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-94",
"text": "Then, the algorithm checks the semantic similarity at each split point (between the parent and its child) whether it is greater than a threshold 2 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-95",
"text": "If the condition is satisfied for all split points in a segmentation, the segmentation is added to the segmentations list that will be passed to a Bi-LSTM."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-96",
"text": "Figure 2 illustrates an example for the segmentation algorithm on the Turkish word arabalar\u0131m\u0131n (of my cars)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-97",
"text": "# denotes a function that takes two words and returns true if the cosine similarity between two substrings is above the threshold value."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-98",
"text": "The cosine similarities between the substrings of the word arabalar\u0131m\u0131n are given in Table 2 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-99",
"text": "Here we use the segmentation algorithm for mainly training purposes because the accuracy of the algorithm has a strong impact on generating word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-100",
"text": "Since all possible morphemes are not generated in training, if the segmentation algorithm generates an unknown morpheme in testing, the representation for that word involving the unknown suffix cannot be generated."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-101",
"text": "In order to ensure that all morphemes have a rep-resentation, we use an external supervised segmentation system for only testing purposes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-102",
"text": "Another reason is that due to incorrect segmentations suggested by the unsupervised segmentation algorithm, two words (semantically related) involving the same set of suffixes cannot benefit from the syntactic similarity and therefore the representations of those words might diverge in testing."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-104",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-105",
"text": "We performed several experiments to assess the quality of our morpheme and word embeddings."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-106",
"text": "We did experiments on Turkish as a highly agglutinative language with a very complex morphology and English with a comparably poor morphology."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-108",
"text": "**EXPERIMENTAL SETTING**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-109",
"text": "In all experiments, morpheme vectors have a dimension of d morph = 75, while the forward and backward LSTMs have a dimension of d LST M = 300."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-110",
"text": "Since the output of the Bi-LSTMs is the concatenation of the forward and backward LSTMs, the Bi-LSTM output has a dimensionality of d biLST M = 600."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-111",
"text": "The output of the Bi-LSTMs is reduced to half after feeding the output through a feed-forward layer that results with a word vector dimension of d word = 300."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-112",
"text": "Our model is implemented in Keras, and publicly available 3 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-113",
"text": "For the pre-trained word vectors, we used the word vectors of dimension 300 that were obtained by training word2vec (Mikolov et al., 2013) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-114",
"text": "For Turkish, we trained word2vec on Boun corpus (Sak et al., 2008 ) that contains 361 million word tokens."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-115",
"text": "For English, we used the Google's pre-trained word2vec model 4 that was trained on 100 billion words with a vocabulary size of 3M."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-116",
"text": "For training of our model, we used the most frequent 200K words from the pre-trained vocabularies to filter out the noise for both languages."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-117",
"text": "In order to compare the quality of our embeddings against the embeddings obtained from character n-gram level model fasttext (Bojanowski et al., 2017) , we used the pre-trained word vectors trained on Wikipedia (Bojanowski et al., 2017) and we used the Google's pre-trained word vectors 5 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-118",
"text": "In order to compare our model with the character-based model by Cao and Rei (2016) , we used Text8 corpus 6 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-119",
"text": "Model en tr word2vec (Mikolov et al., 2013) Only for testing reasons, we used PC-KIMMO (Koskenniemi, 1984) for English and the two-level Turkish morphology (Ak\u0131n and Ak\u0131n, 2007) for Turkish in order to segment test sets to obtain the actual morphemes for generating word representations from the morpheme vectors that are learned in a fully unsupervised setting."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-120",
"text": "Unsupervised segmentation system also could be used for the evaluation step, but we wanted to minimize the effect of incorrect segmentations to be able to evaluate the embeddings properly."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-121",
"text": "Yet, we discuss the effect of the supervised vs unsupervised segmentations in Section 5.5."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-122",
"text": "We did only intrinsic evaluation with a set of experiments that assess the quality of the word and morpheme representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-124",
"text": "**EVALUATION OF WORD REPRESENTATIONS: WORD SIMILARITY RESULTS**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-125",
"text": "In order to evaluate the quality of the word vectors, we did experiments on a list of word pairs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-126",
"text": "We computed the cosine similarity between the learned vectors of each word pair and compared the similarity scores against to human judgments."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-127",
"text": "We used the Set 2 in WordSim353 dataset (Finkelstein et al., 2001 ) for the semantic similarity experiments that already involves the human judgment scores from 1 to 10 for 200 English word pairs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-128",
"text": "Since there is no available word-pair list for Turkish, we prepared WordSimTr 7 that involves 138 word pairs and asked 15 human annotators to judge how similar two words are on a fixed scale from 1 to 10 where 1 shows a poor semantic similarity between the two words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-129",
"text": "Our Turkish word pair list involves two groups of words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-130",
"text": "The first group involves 81 semantically similar words that have at least two suffixes (possibly allomorphs)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-131",
"text": "An example pair is televizyonlarda (on the televi-"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-133",
"text": "**MODEL**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-134",
"text": "WordSim353 RW char2vec (Cao and Rei, 2016) 0.345 0.284 morph2vec 0.386 0.297 sions) and radyolarda (on the radios) that have lar (for the plural) and da (for the locative case) with a semantically similar stem pair."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-135",
"text": "The second group involves 57 semantically unrelated word pairs that are orthographically similar through their suffixes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-136",
"text": "An example word pair in this group is kitaplardan (from the books) and kasaplardan (from the butchers) with two suffixes lar (for the plural) and dan (for the ablative case) with semantically unrelated two stems kitap (the book) and kasap (the butcher)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-137",
"text": "Some other example word pairs in the Turkish word pair list is given in Table 5 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-138",
"text": "As seen on the table, our morpheme-based model is better at learning word representations with multiple suffixes."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-139",
"text": "The results are given in Table 3 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-140",
"text": "English words mostly do not involve any suffixes, which hinders our model's performance."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-141",
"text": "However, our model performs better than both fasttext (Bojanowski et al., 2017) and word2vec (Mikolov et al., 2013) on Turkish despite the highly agglutinative morphological structure of the language."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-142",
"text": "It shows that our model learns better word representations for morphologically complex words, whereas words with no suffixes are not estimated as good as the complex ones."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-143",
"text": "We also compared our model against the character-based model char2vec (Cao and Rei, 2016) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-144",
"text": "For this purpose, we trained our model on the same dataset and parameters as char2vec to be able to compare with their reported results."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-145",
"text": "The dataset is called Text8 corpus and consists of the first 100mb of a cleaned-up dumb of Wikipedia in 2006."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-146",
"text": "For the evaluation, we tested our word embeddings on Rare Words (RW) (Luong et al., 2013) and Wordsim353 (Finkelstein et al., 2001) datasets."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-147",
"text": "The results are given in Table 4 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-148",
"text": "Our results outperform char2vec (Cao and Rei, 2016) on both word similarity test sets."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-149",
"text": "This shows that our model learns better word embeddings for both in-vocabulary and rare words compared to char2vec (Cao and Rei, 2016) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-151",
"text": "**EVALUATION OF WORD REPRESENTATIONS: ANALOGY RESULTS**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-152",
"text": "We performed experiments for the analogy task in order to test whether the suffixes make a linear numerical change on the word vectors in the embedding space."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-153",
"text": "The analogy experiments are usually performed for a triple of words such that A is to B so C is to ?, where A-B+C is expected to be equal to the questioned word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-154",
"text": "The analogy can be semantical such as cat is to meow, so dog is to bark, or syntactic such as go is to gone, so have is to had."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-155",
"text": "Here, we tested only the syntactic analogy on a list of word tuples since our focus is especially morphologically complex languages."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-156",
"text": "For English, we used the syntactic relations section provided in the Google analogy dataset (Mikolov et al., 2013) that involves 10675 questions."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-157",
"text": "Since there is no analogy dataset for Turkish, we prepared a Turkish analogy set SynAnalogyTr 8 with 206 syntactic questions that involves inflected word forms."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-158",
"text": "The syntactic word tuples are judged by 40 human annotators in a scale from 1 to 10, where 1 shows a weak word analogy."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-159",
"text": "Most words involve more than one suffix to test the morphological regularity in the analogy task."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-160",
"text": "The results are given in Table 6 and Table 7 for English and Turkish."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-161",
"text": "The results show that our model outperforms both word2vec (Mikolov et al., 2013) and fasttext (Bojanowski et al., 2017) on both Turkish and English languages."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-162",
"text": "Additionally, some examples to analogy results are given in Table 9 and the nearest neighbors of the Turkish word kitap-lar-dan-m\u0131\u015f (it was from the books) are given in Table 8 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-164",
"text": "**EVALUATION OF MORPHEME REPRESENTATIONS: ALLOMORPH RESULTS**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-165",
"text": "In addition to the evaluation of the word vectors, we also evaluated the morpheme vectors that are the input embeddings to the neural network to be estimated during training."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-166",
"text": "In order to evaluate how well our morpheme vectors represent the morphemes, we used the allomorphs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-167",
"text": "Allomorphs can be considered as true synonyms as they convey the same meaning with each other but with a different orthography."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-168",
"text": "Word Pair Human Score morph2vec word2vec fasttext kitap-lar-dan / kasap-lar-dan 0.12 0.07 0.19 0.55 (from the books) / (from the butchers) kag\u0131t-ta-ki-ler / bardak-ta-ki-ler 0.22 0.12 OOW 0.64 (the ones on the paper) / (the ones in the glass) \u015firket-ler-de / firma-lar-da 0.87 0.82 0.46 0.76 (in the companies) / (in the firms) kazanan-lar-d\u0131 / yenilen-ler-di 0.64 0.60 OOW 0.43 (they were the winners) / (they were the defeated ones) Table 5 : Example Turkish word pairs and their similarities based on human judgements, morph2vec, word2vec (Mikolov et al., 2013) , and fasttext (Bojanowski et al., 2017) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-169",
"text": "-denotes the morpheme boundaries."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-171",
"text": "**MODEL**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-172",
"text": "Accuracy (%) word2vec (Mikolov et al., 2013) 74.0 fasttext (Bojanowski et al., 2017) 74.9 morph2vec 80.5 Table 6 : Analogy results on English Google syntactic analogy dataset."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-173",
"text": "Model Accuracy (%) word2vec (Mikolov et al., 2013) 16.0 fasttext (Bojanowski et al., 2017) 65.5 morph2vec 71.3 Table 7 : Analogy results on Turkish syntactic analogy dataset SynAnalogyTr."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-174",
"text": "In Turkish, there is a common use of allomorphs due to the vowel and consonant harmony in the language."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-175",
"text": "For example, the morpheme d\u0131 has got 8 allomorphs one of which is chosen depending on the last vowel and the consonant in the word, that are di, du, d\u00fc, ti, tu, t\u00fc, and t\u0131."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-176",
"text": "For example, -ti (the past tense of the third person singular) is chosen, for the verb git-(mek) (to go), whereas du is chosen for the verb solu-(mak) (to breathe)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-177",
"text": "We prepared a Turkish dataset that involves 108 morphemes 9 that are allomorphs of 33 unique morpheme types including tense and case markers as well as derivations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-178",
"text": "For the evaluation of allomorphs, we used the MAP metric that is often used in information retrieval tasks."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-179",
"text": "For each allomorph set in the data, we calculated the MAP@k where k is the number of allomorphs for the given morpheme."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-180",
"text": "If the allomorph of a morpheme ex-9 http://nlp.cs.hacettepe.edu.tr/projects/morph2vec/ kitap-lar-dan-m\u0131\u015f Cosine sim. (it was from the books) yaz\u0131lm\u0131\u015f (it was written) 0.669 hikaye-ler-den (from the stories) 0.667 kitap-lar (the books) 0.661 kitap-lar-dan (from the books) 0.639 kitap-lar-da (in the books) 0.635 roman-lar-dan (from the novels) 0.625 Table 8 : Nearest neighbors of the word kitaplar-dan-m\u0131\u015f and the cosine similarity between the word and the neighbor that are obtained from morph2vec word vectors."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-181",
"text": "ists in the k nearest neighbours, then it is regarded as correct, otherwise it is incorrect."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-182",
"text": "We averaged the MAP@k scores for all allomorph sets."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-183",
"text": "The results are given in Table 10 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-184",
"text": "Our model can learn the morpheme representations better than fasttext (Bojanowski et al., 2017) since allomorphs in our model are closer to each other in the embedding space compared to fasttext."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-185",
"text": "Some of the allomorphs obtained from our model and fasttext (Bojanowski et al., 2017) are given in Table 11."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-186",
"text": "As seen on the table, our model can capture the allomorphs better than fasttext (Bojanowski et al., 2017) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-187",
"text": "Additionally, all Turkish allomorphs learned by our model are given in Figure 3 ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-188",
"text": "As can be seen from the figure, the allomorphs fall into similar regions in the vector space."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-189",
"text": "Apart from some infrequent morphemes, the rest has similar representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-190",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-191",
"text": "**THE EFFECT OF SUPERVISION**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-192",
"text": "In our experiments, the model training is performed in a fully unsupervised setting in terms of (to be opened) (to cover) (to cover himself) Table 9 : Example Turkish analogy questions and the cosine similarities between the expected words and the learned word representations obtained from morph2vec, word2vec and fasttext."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-193",
"text": "Model MAP fasttext (Bojanowski et al., 2017) 0.504 morph2vec 0.618 Table 10 : MAP scores for the allomorph coverage in fasttext (Bojanowski et al., 2017) and the morph2vec."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-194",
"text": "morph morph2vec fasttext iyor \u0131yor yor uyor\u00fcyor \u0131yor yor uyor\u00fcyor m\u0131 mu mi m\u00fc mi mu \u0131yor d\u0131 t\u0131 di du tu d\u00fc ti duk d\u0131r t\u0131 d\u00fc di \u0131n t\u0131r \u0131 m\u0131\u015f m\u00fc\u015f mu\u015f m\u00fc\u015f yor d\u0131k Table 11 : Some allomorphs of the given morpheme on the left that are found by our model and fasttext (Bojanowski et al., 2017) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-195",
"text": "The bold font indicates the non-allomorphs for the given morpheme type."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-196",
"text": "the segmentation algorithm."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-197",
"text": "However, we used supervised methods for segmenting test sets in evaluation to generate word representations from the actual morphemes that are learned in the training."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-198",
"text": "We conducted another set of experiments with both supervised and unsupervised segmentation algorithms to show the effect of the segmentation algorithm used to generate the word embeddings in the word similarity test sets."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-199",
"text": "Table 12 demonstrates the effect of the segmentation algorithm."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-200",
"text": "Here, we employed the neural unsupervised model by\u00dcst\u00fcn and Can (2016) and the supervised segmentation system Zemberek (Ak\u0131n and Ak\u0131n, 2007) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-201",
"text": "The results show that we can generate better word embeddings when the morphemes are extracted by a supervised segmentation algorithm beforehand although the difference is not significant."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-202",
"text": "Therefore the supervised segmentation al-"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-203",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-204",
"text": "**MODEL**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-205",
"text": "Spearman Unsupervised (\u00dcst\u00fcn and Can, 2016) 0.517 Supervised (Ak\u0131n and Ak\u0131n, 2007) 0.529 gorithm used in testing can be replaced with an unsupervised segmentation algorithm."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-206",
"text": "However, it should be noted that when an unsupervised algorithm used, the possibility to come across with a segment that is not present in our model increases, hence we cannot generate the word embeddings for such words."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-207",
"text": "Therefore, the results for the unsupervised setting is limited to only in-vocabulary words (i.e the words for which we can create a word embedding)."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-208",
"text": "----------------------------------"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-209",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-210",
"text": "Recent work shows that character level models learn more representative word embeddings for rare words (including morphologically complex words) compared to word level models, which is a sign that incorporating subword information improves the word representations."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-211",
"text": "However, in this paper, we argued that morpheme-based representation models can learn better word embeddings (especially for the syntactic tasks) since they incorporate the syntactic and semantic information through the morphemes better compared to character level models."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-212",
"text": "We pointed to the poor representation of allomorphs in complex words where the character-level models estimate a low word similarity between semantically similar words with different forms of the same morpheme, i.e. allomorphs."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-213",
"text": "Moreover, we pointed to the character level models that assign a high word similar- ity to the words that are orthographically similar but semantically unrelated."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-214",
"text": "We introduce a morpheme-based representation model that learns word embeddings through the morphemes that are obtained from a list of morphological segmentations for each word."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-215",
"text": "Therefore, our work introduces the idea of releasing the need for using an external morphological segmentation system in such representation learning models that are based on subword information."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-216",
"text": "Our morpheme-based model morph2vec learns better word representations for morphologically complex words compared to the word-based model word2vec (Mikolov et al., 2013) , character-based model char2vec (Cao and Rei, 2016) , and the character n-gram level model fasttext (Bojanowski et al., 2017) ."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-217",
"text": "Our results are also competitive for the English language."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-218",
"text": "We leave other languages and experiments such as morphological segmentation task for the future work."
},
{
"sent_id": "155920441b8e81dff4e2b8e110383d-C001-219",
"text": "Another goal is to perform extrinsic evaluation on a different task such as part-of-speech tagging using the learned word embeddings."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"155920441b8e81dff4e2b8e110383d-C001-34"
],
[
"155920441b8e81dff4e2b8e110383d-C001-82"
],
[
"155920441b8e81dff4e2b8e110383d-C001-113"
],
[
"155920441b8e81dff4e2b8e110383d-C001-156"
]
],
"cite_sentences": [
"155920441b8e81dff4e2b8e110383d-C001-34",
"155920441b8e81dff4e2b8e110383d-C001-82",
"155920441b8e81dff4e2b8e110383d-C001-113",
"155920441b8e81dff4e2b8e110383d-C001-156"
]
},
"@BACK@": {
"gold_contexts": [
[
"155920441b8e81dff4e2b8e110383d-C001-44"
],
[
"155920441b8e81dff4e2b8e110383d-C001-50"
]
],
"cite_sentences": [
"155920441b8e81dff4e2b8e110383d-C001-44",
"155920441b8e81dff4e2b8e110383d-C001-50"
]
},
"@DIF@": {
"gold_contexts": [
[
"155920441b8e81dff4e2b8e110383d-C001-141"
],
[
"155920441b8e81dff4e2b8e110383d-C001-161"
],
[
"155920441b8e81dff4e2b8e110383d-C001-216"
]
],
"cite_sentences": [
"155920441b8e81dff4e2b8e110383d-C001-141",
"155920441b8e81dff4e2b8e110383d-C001-161",
"155920441b8e81dff4e2b8e110383d-C001-216"
]
}
}
},
"ABC_bb5e6e32d7e507bc6d943719c02902_8": {
"x": [
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-78",
"text": "We think this is the main reason why gate sparsification does not help here."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-101",
"text": "2 Except for the first few epochs because ct is initialized with 0 value."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-102",
"text": "[17] Zhang, X., Zhao, J., and LeCun, Y. (2015)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-2",
"text": "Recently, a lot of techniques were developed to sparsify the weights of neural networks and to remove networks' structure units, e. g. neurons."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-3",
"text": "We adjust the existing sparsification approaches to the gated recurrent architectures."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-4",
"text": "Specifically, in addition to the sparsification of weights and neurons, we propose sparsifying the preactivations of gates."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-5",
"text": "This makes some gates constant and simplifies LSTM structure."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-6",
"text": "We test our approach on the text classification and language modeling tasks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-7",
"text": "We observe that the resulting structure of gate sparsity depends on the task and connect the learned structure to the specifics of the particular tasks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-8",
"text": "Our method also improves neuron-wise compression of the model in most of the tasks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-9",
"text": "* Equal contribution 33rd NeurIPS"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-12",
"text": "Recurrent neural networks (RNNs) yield high-quality results in many applications but often are memory-and time-consuming due to a large number of parameters."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-13",
"text": "A popular approach for RNN compression is sparsification (setting a lot of weights to zero), it may compress RNN orders of times with only a slight quality drop or even with quality improvement due to the regularization effect [11] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-14",
"text": "Sparsification of the RNN is usually performed either at the level of individual weights (unstructured sparsification) [13, 11, 1] or at the level of neurons [14] (structured sparsification -removing weights by groups corresponding to neurons)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-15",
"text": "The latter additionally accelerates the testing stage."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-16",
"text": "However, most of the modern recurrent architectures (e. g. LSTM [3] or GRU [2] ) have a gated structure."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-17",
"text": "We propose to add an intermediate level of sparsification between individual weights [1] and neurons [14] -gates (see fig. 1 , left)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-18",
"text": "Precisely, we remove weights by groups corresponding to gates, which makes some gates constant, independent of the inputs, and equal to the activation function of the bias."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-19",
"text": "As a result, the LSTM/GRU structure is simplified."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-20",
"text": "With this intermediate level introduced, we obtain a three-level sparsification hierarchy: sparsification of individual weights helps to sparsify gates (make them constant), and sparsification of gates helps to sparsify neurons (remove them from the model)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-21",
"text": "The described idea can be implemented for any gated architecture in any sparsification framework."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-22",
"text": "We implement the idea for LSTM in two frameworks: pruning [14] and Bayesian sparsification [1] and observe that resulting gate structures (which gates are constant and which are not) vary for different NLP tasks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-23",
"text": "We analyze these gate structures and connect them to the specifics of the particular tasks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-24",
"text": "The proposed method also improves neuron-wise compression of the RNN in most cases."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-25",
"text": "In this section, we describe the three-level sparsification approach for LSTM."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-26",
"text": "LSTM cell is composed of input, forget and output gates (i, f , o) and information flow g (which we also call gate for brevity)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-27",
"text": "All four gates are computed in a similar way, for example, for the input gate:"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-28",
"text": "To make a gate constant, we need to zero out a corresponding row of the LSTM weight matrix W (see dotted horizontal lines in fig. 2 )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-29",
"text": "We do not sparsify biases because they do not take up much memory compared to the weight matrices."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-30",
"text": "For example, if we set the k-th row of matrices W x i and W h i to zero, there are no ingoing connections to the corresponding gate, so the k-th input gate becomes constant, independent of x t and h t\u22121 and equal to sigm(b i,k )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-31",
"text": "As a result, we do not need to compute the k-th input gate on a forward pass and can use a precomputed value."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-32",
"text": "We can construct the mask (whether the gate is constant or not) and use it to insert constant values into gate vectors i, f, g, o. This lessens the amount of computations on the forward pass."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-33",
"text": "To remove a neuron, we need to zero out a corresponding column of the LSTM weight matrix W and of the next layer matrix (see solid vertical lines in fig. 2 )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-34",
"text": "This ensures that there are no outgoing connections from the neuron, and the neuron does not affect the network output."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-35",
"text": "To sum up, our three-level hierarchy of gated RNN sparsification works as follows."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-36",
"text": "Ideally, our goal is to remove a hidden neuron, this leads to the most effective compression and acceleration."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-37",
"text": "If we don't remove the hidden neuron, some of its four gates may become constant; this also saves computation and memory."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-38",
"text": "If some gate is still non-constant, some of its weights may become zero; this reduces the size of the model."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-39",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-40",
"text": "**IMPLEMENTATION OF THE IDEA**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-41",
"text": "We incorporate the proposed intermediate level of sparsification into two sparsification frameworks (more details are given in Appendices A and B):"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-42",
"text": "Pruning."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-43",
"text": "We apply Lasso to individual weights and group Lasso [15] to five groups of the LSTM weights (four gate groups and one neuron group, see fig. 2 )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-44",
"text": "We use the same pruning algorithm as in Intrinsic Sparse Structure (ISS) [14] , a structured pruning approach developed specifically for LSTM."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-45",
"text": "In contrast to our approach, they do not sparsify gates, and remove a neuron if all its ingoing and outgoing connections are set to zero."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-46",
"text": "Bayesian sparsification."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-47",
"text": "We rely on Sparse Variational Dropout [10, 1] to sparsify individual weights."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-48",
"text": "Following [5] , for each neuron, we introduce a group weight which is multiplied by the output of this neuron in the computational graph (setting to zero this group weight entails removing the neuron)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-49",
"text": "To sparsify gates, for each gate we introduce a separate group weight which is multiplied by the preactivation of the gate before adding a bias (setting to zero this group weight makes the gate constant)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-50",
"text": "[1] with additional group weights for neurons [5] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-52",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-53",
"text": "In the pruning framework, we perform experiments on word-level language modeling (LM) on a PTB dataset [7] following ISS [14] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-54",
"text": "We use a standard model of Zaremba et al. [16] of two sizes (small and large) with an embedding layer, two LSTM layers, and a fully-connected output layer (Emb + 2 LSTM + FC)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-55",
"text": "Here regularization is applied only to LSTM layers following [14] , and its strength is selected using grid search so that qualities of ISS and our model are approximately equal."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-56",
"text": "In the Bayesian framework, we perform an evaluation on the text classification (datasets IMDb [6] and AGNews [17]) and language modeling (dataset PTB, character and word level tasks) following [1] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-57",
"text": "The architecture for the character-level LM is LSTM + FC, for the text classification is Emb + LSTM + FC on the last hidden state, for the word level LM is the same as in pruning."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-58",
"text": "Here we regularize and sparsify all layers following [1] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-59",
"text": "Sizes of LSTM layers may be found in tab. 1."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-60",
"text": "Embedding layers have 300/200/1500 neurons for classification tasks/small/large word level LM."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-61",
"text": "More experimental details are given in Appendix C."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-63",
"text": "**QUANTITATIVE RESULTS**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-64",
"text": "We compare our three-level sparsification approach (W+G+N) with the original dense model and a two-level sparsification (weights and neurons, W+N) in tab."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-65",
"text": "1. We do not compare two frameworks between each other; our goal is to show that the proposed idea improves results in both frameworks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-66",
"text": "In most experiments, our method improves gate-wise and neuron-wise compression of the model without a quality drop."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-67",
"text": "The only exception is the character-level LM, which we discuss later."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-68",
"text": "The numbers for compression are not comparable between two frameworks because in pruning only LSTM layers are sparsified while in the Bayesian framework all layers in the network are sparsified."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-70",
"text": "**QUALITATIVE RESULTS**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-71",
"text": "Below we analyze the resulting gate structure for different tasks, models and sparsification approaches."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-72",
"text": "Gate structure depends on the task."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-73",
"text": "Figure 1 , right shows the typical examples of the gate structures of the remaining hidden neurons obtained using the Bayesian approach."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-74",
"text": "We observe that the gate structure varies for different tasks."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-75",
"text": "For the word-level LM task, output gates are very important because models need both store all the information about the input in the memory and output only the current prediction at each timestep."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-76",
"text": "On the contrary, for text classification tasks, models need to output the answer only once at the end of the sequence, hence they rarely use output gates."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-77",
"text": "The character-level LM task is more challenging than the word level one: the model uses the whole gate mechanism to solve it."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-79",
"text": "As can be seen in fig. 1 , right, in the second LSTM layer of the small word-level language model, a lot of neurons have only one nonconstant gate -output gate."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-80",
"text": "We investigate the described effect and find that the neurons with only non-constant output gate learn short-term dependencies while neurons with all non-constant gates usually learn long-term dependencies."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-81",
"text": "To show that, we compute the gradients of each hidden neuron of the second LSTM layer w. r. t. the input of this layer at different lag t and average the norm of this gradient over the validation set (see fig. 3 )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-82",
"text": "The neurons with only non-constant output gate are \"short\": the gradient is large only for the latest timesteps and small for old timesteps."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-83",
"text": "On the contrary, neurons with all non-constant gates are mostly \"long\": the gradient is non-zero even for old timesteps."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-84",
"text": "In other words, changing input 20-100 steps ago does not affect \"short\" neurons too much, which is not true for the \"long' neurons."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-85",
"text": "The presence of such \"short\" neurons is expectable for the language model: neurons without memory quickly adapt to the latest changes in the input sequence and produce relevant output."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-86",
"text": "In fact, for the neurons with only non-constant output gate, the memory cell c t is either monotonically increasing or monotonically decreasing depending on the sign of constant information flow g so tanh(c t ) always equals either to \u22121 or +1 2 and h t = o t or \u2212o t ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-87",
"text": "This means these neurons are simplified to vanilla recurrent units."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-88",
"text": "For classification tasks, memorizing information about the whole input sequence until the last timestep is important, therefore information flow g is non-constant and saves information from the input to the memory."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-89",
"text": "In other words, long dependencies are highly important for the classification."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-90",
"text": "Gradient plots ( fig. 3 ) confirm this claim: the values of the neurons are strongly influenced by both old and latest inputs."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-91",
"text": "Gradients are bigger for the short lag only for one neuron because this neuron focuses not only on the previous hidden states but also on reading the current inputs."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-92",
"text": "Gate structure intrinsically exists in LSTM."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-93",
"text": "As discussed above, the most visible gate structures are obtained for IMDB classification (a lot of constant output gates and non-constant information flow) and for the second LSTM layer of the small word-level LM task (a lot of neurons with only non-constant output gates)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-94",
"text": "In our experiments, for these tasks, the same gate structures are detected even with unstructured sparsification, but with lower overall compression and less number of constant gates, see Appendix D. This shows that the gate structure intrinsically exists in LSTM and depends on the task."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-95",
"text": "The proposed method utilizes this structure to achieve better compression."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-96",
"text": "We obtain a similar effect when we compare gate structures for the small word-level LM obtained using two different sparsification techniques: Bayes W+G+N ( fig. 1, right) and Pruning W+G+N ( fig. 4, left) ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-97",
"text": "The same gates become constant in these models."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-98",
"text": "For the large language model ( fig. 4, right) , the structure is slightly different than for the small model."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-99",
"text": "It is expected because there is a significant quality gap between these two models, so their intrinsic structure may be different."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-100",
"text": "Pruning -Large LSTM -Layer 2 Figure 4 : Gate structure for word-level LM for Pruning W+G+N for different model sizes."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-103",
"text": "Character-level convolutional networks for text classification."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-104",
"text": "In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-106",
"text": "**A TECHNICAL DETAILS ON THE IMPLEMENTATION OF THE IDEA IN PRUNING**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-107",
"text": "Consider a dataset of N sequences (x i , y i ) and a model p(y|x, W, b) defined by a recurrent neural network with weights W and biases b."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-108",
"text": "To implement our idea about three levels of sparsification, for each neuron \u03b7, we define five (intersecting) sets of weights w \u03b7,i , w \u03b7,f , w \u03b7,g , w \u03b7,o , w \u03b7,h ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-109",
"text": "The first four sets of weights correspond to four gates (dotted horizontal lines in fig. 2) , and the last set corresponds to the neuron (solid vertical lines in fig. 2 )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-110",
"text": "We apply group Lasso regularization [15] to these groups."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-111",
"text": "We also apply Lasso regularization to the individual weights."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-112",
"text": "Following [14] , we set to zero all the individual weights with absolute value less than the threshold."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-113",
"text": "If for some \u03b7 all the weights in w \u03b7,h are set to zero, we remove the corresponding neuron as it does not affect the network's output."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-114",
"text": "If for some gate (for example, f ) all the weights in w \u03b7,f are set to zero, we mark this gate as constant."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-115",
"text": "In contrast to our approach, in [14] , group Lasso is applied to larger groups w \u03b7 :"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-116",
"text": "They eliminate a neuron \u03b7 if all the weights in w \u03b7 are zero."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-117",
"text": "This approach does not lead to the sparse gate structure."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-118",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-119",
"text": "**B TECHNICAL DETAILS ON THE IMPLEMENTATION OF THE IDEA IN BAYESIAN FRAMEWORK**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-120",
"text": "Sparse variational dropout."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-121",
"text": "Our approach relies on Sparse variational dropout [10] (SparseVD)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-122",
"text": "This model treats the weights of the neural network as random variables and comprises a log-uniform prior over the weights: p(|w ij |) \u221d 1 |wij | and a fully factorized normal approximate posterior over the weights: q(w ij ) = N (w ij |m ij , \u03c3 2 ij )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-123",
"text": "Biases are treated as deterministic parameters."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-124",
"text": "To find the parameters of the approximate posterior distribution and biases, the evidence lower bound (ELBO) is optimized:"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-125",
"text": "Because of the log-uniform prior, for the majority of weights, the signal-to-noise ratio m 2 ij /\u03c3 2 ij \u2192 0 and these weights do not affect the network's output."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-126",
"text": "In [1] , SparseVD is adapted to the RNNs."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-127",
"text": "Training our model."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-152",
"text": "For the two-level sparsification (W+N), we use Lasso regularization with \u03bb = 1e \u2212 5 and group Lasso regularization with \u03bb = 0.0015."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-128",
"text": "We work with the group weights z in the same way as with the weights W : we approximate the posterior with the fully factorized normal distribution given the fully factorized log-uniform prior distribution."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-129",
"text": "To estimate the expectation in (1), we sample weights from the approximate posterior distribution in the same way as in [1] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-130",
"text": "With the integral estimated with one Monte-Carlo sample, the first term in (1) becomes the usual loss function (for example, cross-entropy in language modeling)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-131",
"text": "The second term is a regularizer depending on the parameters \u00b5 and \u03c3 (for the exact formula, see [10] )."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-132",
"text": "After learning, we zero out all the weights and the group weights with the signal-to-noise ratio less than 0.05."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-133",
"text": "At the testing stage, we use the mean values of all the weights and the group weights."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-135",
"text": "**C EXPERIMENTAL SETUP**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-136",
"text": "Datasets."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-137",
"text": "To evaluate our approach on the text classification task, we use two standard datasets: IMDb dataset [6] for binary classification and AGNews dataset [17] for four-class classification."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-138",
"text": "We set aside 15% and 5% of the training data for validation purposes respectively."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-139",
"text": "For both datasets, we use a vocabulary of 20,000 most frequent words."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-140",
"text": "To evaluate our approach on the language modeling task, we use the Penn Treebank corpus [7] with the train/valid/test partition from [8] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-141",
"text": "The dataset has a vocabulary of 50 characters or 10,000 words."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-142",
"text": "Pruning."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-143",
"text": "All the small models including baseline are trained without dropout as in standard TensorFlow implementation."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-144",
"text": "We train them from scratch for 20 epochs with SGD with a decaying learning rate schedule: an initial learning rate is equal to 1, the learning rate starts to decay after the 4-th epoch, the learning rate decay is equal to 0.6."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-145",
"text": "For the two-level sparsification (W+N), we use Lasso regularization with \u03bb = 1e \u2212 5 and group Lasso regularization with \u03bb = 0.002."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-146",
"text": "For the three-level sparsification (W+G+N), we use Lasso regularization with \u03bb = 1e \u2212 5 and group Lasso regularization with \u03bb = 0.0017."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-147",
"text": "We use the threshold 1e \u2212 4 to prune the weights in both models during training."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-148",
"text": "All the large models including baseline are trained in the same setting as in [14] except for the group Lasso regularization because we change the weight groups."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-149",
"text": "We use the code provided by the authors."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-150",
"text": "Particularly, we use binary dropout [16] with the same dropout rates."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-151",
"text": "We train the models from scratch for 55 epochs with SGD with a decaying learning rate schedule: an initial learning rate is equal to 1, the learning rate decreases two times during training (after epochs 18 and 36), the learning rate decay is equal to 0.2 and 0.1 for two-and three-level sparsification correspondingly."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-153",
"text": "For the three-level sparsification (W+G+N), we use Lasso regularization with \u03bb = 1.5e \u2212 05 and group Lasso regularization with \u03bb = 0.00125."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-154",
"text": "We use the same threshold 1e \u2212 4 as in the small models."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-155",
"text": "Bayesian sparsification."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-156",
"text": "In all the Bayesian models, we sparsify the weight matrices of all layers."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-157",
"text": "Since in text classification tasks, usually only a small number of input words are important, we use additional multiplicative weights to sparsify the input vocabulary following Chirkova et al. [1] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-158",
"text": "For the networks with the embedding layer, in configurations W+N and W+G+N, we also sparsify the embedding components (by introducing group weights z x multiplied by x t .)"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-159",
"text": "We train our networks using Adam [4] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-160",
"text": "Baseline networks overfit for all our tasks, therefore, we present results for them with early stopping."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-161",
"text": "Models for the text classification and the character-level LM are trained in the same setting as in [1] (we used the code provided by the authors)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-162",
"text": "For the text classification tasks, we use a learning rate equal to 0.0005 and train Bayesian models for 800 / 150 epochs on IMDb / AGNews."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-163",
"text": "The embedding layer for IMDb / AGNews is initialized with word2vec [9] / GloVe [12] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-164",
"text": "For the language modeling tasks, we train Bayesian models for 250 / 50 epochs on character-level / word-level tasks using a learning rate of 0.002."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-165",
"text": "For all the weights that we sparsify, we initialize log \u03c3 with -3."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-166",
"text": "We eliminate weights with the signalto-noise ratio less than \u03c4 = 0.05."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-167",
"text": "To compute the number of the remaining neurons or non-constant gates, we use the corresponding rows/columns of W and the corresponding weights z if applicable."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-168",
"text": "[1] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-169",
"text": "For language modeling, we evaluate quality on validation and test sets."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-170",
"text": "Compression is equal to |W |/|W = 0|."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-171",
"text": "In the last columns, the numbers of the remaining hidden neurons and non-constant gates in the LSTM layers are reported."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-172",
"text": "----------------------------------"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-173",
"text": "**D EXPERIMENTS WITH UNSTRUCTURED BAYESIAN SPARSIFICATION**"
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-174",
"text": "In this section, we present experimental results for the unstructured Bayesian sparsification (configuration Bayes W)."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-175",
"text": "This configuration corresponds to a model of Chirkova et al. [1] ."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-176",
"text": "Table 2 shows quantitative results, and figure 5 shows the resulting gate structures for the IMDB classification task and the second LSTM layer of the word-level language modeling task."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-177",
"text": "Since Bayes W model does not comprise any group weights, the overall compression of the RNN is lower than for Bayes W+G+N (tab. 1), so there are more non-constant gates."
},
{
"sent_id": "bb5e6e32d7e507bc6d943719c02902-C001-178",
"text": "However, the patterns in gate structures are the same as in Bayes W+G+N gate structures ( fig. 1 ): for the IMDB classification, the model has a lot of constant output gates and non-constant information flow, for language modeling, the model has neurons with only non-constant output gates."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"bb5e6e32d7e507bc6d943719c02902-C001-14"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-126"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-175"
]
],
"cite_sentences": [
"bb5e6e32d7e507bc6d943719c02902-C001-14",
"bb5e6e32d7e507bc6d943719c02902-C001-126",
"bb5e6e32d7e507bc6d943719c02902-C001-175"
]
},
"@USE@": {
"gold_contexts": [
[
"bb5e6e32d7e507bc6d943719c02902-C001-17"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-22"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-47"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-56"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-58"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-129"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-157"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-161"
],
[
"bb5e6e32d7e507bc6d943719c02902-C001-175"
]
],
"cite_sentences": [
"bb5e6e32d7e507bc6d943719c02902-C001-17",
"bb5e6e32d7e507bc6d943719c02902-C001-22",
"bb5e6e32d7e507bc6d943719c02902-C001-47",
"bb5e6e32d7e507bc6d943719c02902-C001-56",
"bb5e6e32d7e507bc6d943719c02902-C001-58",
"bb5e6e32d7e507bc6d943719c02902-C001-129",
"bb5e6e32d7e507bc6d943719c02902-C001-157",
"bb5e6e32d7e507bc6d943719c02902-C001-161",
"bb5e6e32d7e507bc6d943719c02902-C001-175"
]
},
"@EXT@": {
"gold_contexts": [
[
"bb5e6e32d7e507bc6d943719c02902-C001-17"
]
],
"cite_sentences": [
"bb5e6e32d7e507bc6d943719c02902-C001-17"
]
}
}
},
"ABC_22253d7b7cd43697b99909e09e7ebb_8": {
"x": [
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-2",
"text": "Abstract."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-3",
"text": "We assign binary and ternary error-correcting codes to the data of syntactic structures of world languages and we study the distribution of code points in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-4",
"text": "We show that, while most codes populate the lower region approximating a superposition of Thomae functions, there is a substantial presence of codes above the Gilbert-Varshamov bound and even above the asymptotic bound and the Plotkin bound."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-5",
"text": "We investigate the dynamics induced on the space of code parameters by spin glass models of language change, and show that, in the presence of entailment relations between syntactic parameters the dynamics can sometimes improve the code."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-6",
"text": "For large sets of languages and syntactic data, one can gain information on the spin glass dynamics from the induced dynamics in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-52",
"text": "We refer to these as \"bilingual and trilingual syntactic codes\"."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-9",
"text": "This is a companion paper to [13] , where techniques from coding theory were proposed as a way to address quantitatively the distribution of syntactic features across a set (or family) of languages."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-10",
"text": "In this paper we perform a computational analysis, based on the syntactic structures of world languages recorded in the SSWL database, and on data of syntactic parameters collected by Longobardi and collaborators."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-11",
"text": "We first analyze all codes obtained from pairs and triples of languages in the SSWL database and we show that their code points tend to populate the lower region of the space of code parameters, approximating two scaled copies of the Thomae function."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-12",
"text": "Points that fall in that region behave essentially like random codes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-13",
"text": "We then analyze arbitrary subsets of languages and syntactic data from the SSWL database and we compute the density of the distribution of their code points, showing that, while most of them populate the lower region, there is a significant presence of codes above the Gilbert-Varshamov bound and even above the asymptotic bound and the Plotkin bound."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-14",
"text": "We consider spin glass models of language change and the induced dynamics on the space of code parameters, and show that, in the presence of entailment relations the dynamics can enter the region above the Gilbert-Varshamov bound."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-15",
"text": "We also show how the induced dynamics in the space of code parameters can be helpful in gaining information on the behavior of the spin glass model on large datasets of languages and parameters, where convergence becomes very slow and a direct analysis of the dynamics becomes computationally difficult."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-16",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-17",
"text": "**CODES AND CODE PARAMETERS FROM SYNTACTIC STRUCTURES**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-18",
"text": "In [13] a new coding theory approach to measuring entropy and complexity of a set of natural languages was proposed."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-19",
"text": "The idea is to associate to each language in the set a vector of binary variables that describe syntactic properties of the language."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-20",
"text": "The notion of encoding syntactic structures through a set of binary syntactic parameters is a crucial part of the Principles and Parameters program of Linguistics developed by Chomsky, [3] , [4] , see also [1] for an expository introduction."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-21",
"text": "Thus, given a set of languages L = { 1 , . . . , N } 1 and a set of n binary syntactic variables, whose values are known for all the N languages, one obtains a code C L in F n 2 consisting of N code words."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-22",
"text": "As argued in [13] , one can use the properties of the resulting codes C L , as an error correcting code, and its position in the space of code parameters to measure how syntactic features are distributed across the languages in the set."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-23",
"text": "Moreover, the position of the code C L in the space of code parameters, with respect to curves such as the Gilbert-Varshamov bound and the asymptotic bound provide a measure of entropy and of complexity of the set of languages L, which differs from measures of entropy/complexity for an individual language."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-24",
"text": "2.1."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-25",
"text": "Code parameters and bounds."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-26",
"text": "In the theory of error-correcting codes (see for instance [18] ), to a given code C \u2282 F n q , one assigns two code parameters: the transmission rate, or relative rate of the code, which measures how good the encoding procedure is, and which is given by the ratio"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-27",
"text": "where k = log 2 (#C) is the absolute rate of C, and the relative minimum distance of the code, which measures how good the decoding is, and which is given by the ratio"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-28",
"text": "where d H ( 1 , 2 ) denotes the Hamming distance between the binary strings that constitute the code words of 1 and 2 ,"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-29",
"text": ", with x i , y i \u2208 {0, 1}. Codes C that have both R(C) and \u03b4(C) as large as possible are optimal for error-correction, as they have simpler encoding and less error-prone decoding."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-53",
"text": "The computation grows rapidly much heavier for larger sets of languages."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-30",
"text": "In general, it is not possible to arbitrarily improve both parameters, hence the quality of a code C is estimated by the position of its code point (\u03b4(C), R(C)) in the space of code parameters of coordinates (\u03b4, R) inside the square [0, 1] \u00d7 [0, 1]."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-31",
"text": "Various bounds on code parameters have been studied, [8] , [18] , [20] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-32",
"text": "As discussed in [13] there are two bounds, that is, two curves in the space of code parameters, that have an especially interesting meaning: the Gilbert-Varshamov curve, which is related to the statistical behavior of random codes (see [2] , [5] ), and the asymptotic bound, whose existence was proved in [8] , further studied in [9] , [10] , [11] , [12] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-33",
"text": "The asymptotic bound separates the region where code points are dense and have infinite multiplicity from the region where they are sparse and with finite multiplicity, [9] , [12] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-34",
"text": "The Gilbert-Varshamov curve for a q-ary code has a simple form,"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-35",
"text": "the q-ary Shannon entropy."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-36",
"text": "With the asymptotic bound, however, the situation is much more complicated, as one does not have an explicit expression."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-37",
"text": "Indeed, the question of the computability of the asymptotic bound was posed in [9] , and addressed in [12] in terms of a relation to Kolmogorov complexity (which is not a computable function)."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-38",
"text": "Namely, it is shown in [12] that the asymptotic bound becomes computable, given an oracle that can order codes by increasing Kolmogorov complexity."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-39",
"text": "Even though one does not have an explicit expression for the asymptotic bound, several estimates on its location in the space of code paramaters are described in [18] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-40",
"text": "Thus, in practical cases, it will be possible to obtain sufficient conditions to check if a code point violates the asymptotic bound by using some of these estimates."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-41",
"text": "In particular, there is a relation between the asymptotic curve R = \u03b1 q (\u03b4) and the Gilbert-Varshamov bound,"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-42",
"text": "and a relation between the asymptotic bound and the Plotkin bound R \u2264 1 \u2212 \u03b4 q ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-43",
"text": "The Plotkin line lies above the Gilbert-Varshamov curve and the asymptotic bound satisfies \u03b1 q (\u03b4) \u2264 1 \u2212 \u03b4 q , with \u03b1 q (\u03b4) = 0 for (q \u2212 1)/q < \u03b4 \u2264 1."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-44",
"text": "We will be considering only binary codes, hence we have everywhere q = 2."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-45",
"text": "2.2. Bilingual and trilingual syntactic codes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-46",
"text": "In this companion paper to [13] , we carry out some analysis, based on the binary syntactic variables recorded in the SSWL database \"Syntactic Structures of World Languages\", [21] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-47",
"text": "We will not refer to these variables as \"syntactic parameters\", because it is understood among linguists that the binary variables recorded in the SSWL database do not correspond to \"syntactic parameters\" the sense of the Principles and Parameters program, for example because of conflation of deep and surface structure."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-48",
"text": "However, this database is useful because it contains a fairly large number of world languages (253) and of syntactic variables (115)."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-49",
"text": "From a computational perspective, one of the main problems in using SSWL data lies in the fact that not all 115 variables are mapped for all the 253 languages: indeed the languages are very non-uniformly mapped, with some (mostly Indo-European) languages mapped with 100% of the variables and others with only very few entries."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-50",
"text": "Thus, for the purpose of the computations in this paper, when comparing a set of different languages, we have only used those variables that are fully mapped, in the SSWL database, for all the languages in the given set."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-51",
"text": "We consider here all the pairs { 1 , 2 } of languages in the SSWL database, and all triples { 1 , 2 , 3 }, and the resulting codes C 1 , 2 and C 1 , 2 , 3 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-54",
"text": "However, these cases are enough to see some interesting results on how the corresponding code parameters are distributed in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-55",
"text": "As explained above, the length of the code words for these codes is not fixed: it is the largest number of syntactic binary variables in the SSWL list that are completely mapped for the languages in the set."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-56",
"text": "Thus, for a set"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-57",
"text": ". This is not a problem, since points in the space of code parameters correspond to codes of any arbitrary length."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-58",
"text": "For each code C 1 , 2 and C 1 , 2 , 3 we compute the corresponding code parameters"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-59",
"text": "and we plot them in the plane of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-60",
"text": "We then compare their position to different bounds in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-61",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-62",
"text": "**CODE PARAMETERS OF SYNTACTIC CODES**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-63",
"text": "The Python code is available at www.its.caltech.edu/\u223cmatilde/SSWLcodes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-64",
"text": "The code params.py script has utilities that will compute the code parameters for a given subset of the languages."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-65",
"text": "The plot for all the codes C 1 , 2 and C 1 , 2 , 3 of pairs and triples of languages in the SSWL database was generated with fixed sized subset.py."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-66",
"text": "The resulting plot is given in Figure 1 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-67",
"text": "Note that for all the codes in this set the absolute rate of the code is always either k(C 1 , 2 ) = 1 or k(C 1 , 2 , 3 ) = log 2 (3)."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-68",
"text": "3.1."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-69",
"text": "Thomae function and random parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-70",
"text": "The fractal pattern that one sees appearing in Figure 1 may seem an first surprising, but in fact it has a very simple explanation."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-71",
"text": "Recall that the Thomae function defined as"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-72",
"text": "A plot of the undergraph of the Thomae function is shown in Figure 2 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-73",
"text": "One can clearly see the similarity with Figure 1 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-74",
"text": "Indeed, the reason why the code parameters in Figure 1 approximate the Thomae function depends on the fact that we are fixing the absolute rate of the codes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-75",
"text": "We have code points"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-76",
"text": "Thus, since k is fixed to be either 1 or log 2 (3), we obtain two copies, scaled by the respective values of k, of the graph of which, when (d, n) = 1, agrees with the Thomae function, restricted to those values of d and n that occur in our set of codes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-77",
"text": "We see that we obtain several additional points in Figure 1 that lie in the undergraph of the Thomae function, which come from the cases with (d, n) = 1, where for d = ur and n = vr, in addition to the plot point ( We compare this behavior with the case of code parameters produced by randomly generated points."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-78",
"text": "A plot of 3-tuples of a set of randomly generated parameters is shown in Figure 3 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-79",
"text": "This was generated with random parameters.py."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-80",
"text": "We see that the lower region of Figure 1 behaves similarly to the case of randomly generated parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-81",
"text": "Indeed, the region in the space of code parameters that lies below the Gilbert-Varshamov line is typically the one that is populated by random codes, [2] , [5] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-83",
"text": "**GILBERT-VARSHAMOV AND PLOTKIN BOUND.**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-84",
"text": "We want to analyze the position, with respect to various bounds, of the code parameters of codes C L , for sets L of languages and their SSWL syntactic data."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-85",
"text": "In order to do that, since systematic computations for larger sets of languages become lengthy, we check the position of code points on sets of randomly selected languages and syntactic parameters in the SSWL database."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-86",
"text": "The script random subset.py takes the list of binary syntactic variables recorded in the SSWL database and selects a random choice of a subset of these binary variables."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-87",
"text": "Then it randomly selects a subset of languages, among those for which the selected set of parameters is completely mapped in the database."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-88",
"text": "In a typical plot obtained with this method, like the one shown in Figure 4 , we see curves that correspond to a superposition of several scaled Thomae functions, for varying values of k. A large part of the code points tends to cluster in the region below the GilbertVarshamov curve, similarly to what one would expect for random codes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-89",
"text": "However, there is typically a significant portion of the code points obtained in this way that lies above the Gilbert-Varshamov, populating the area between the Gilbert-Varshamov curve and the Plotkin line."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-90",
"text": "Since these points tend to cover the entire width of the region between these two curves, which contains the asymptotic bound, certainly part of them will lie above the asymptotic bound, in the region of the sporadic codes."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-91",
"text": "One usually sees an even smaller number of code points that lie above the Plotkin bound (hence certainly above the asymptotic bound)."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-92",
"text": "This confirms the observation made in [13] regarding code points of syntactic codes and their position in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-93",
"text": "Using randomized sets of SSWL parameters and corresponding sets of languages, we can also plot the density of code points in the various regions of the space of code parameters, as shown in Figure 5 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-95",
"text": "**DYNAMICS IN THE SPACE OF CODE PARAMETERS**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-96",
"text": "In [17] a dynamical model of language change was proposed, based on a spin glass model for syntactic parameters and language interactions."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-97",
"text": "It is a simple model based on a graph with languages at the vertices, represented by a vector of their syntactic parameters interpreted as spin variables, and strengths of interaction between languages along the edges, measured using data proportional to the amount of bilingualism."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-98",
"text": "In the case of syntactic parameters behaving as independent variables, in the low temperature regime (see [17] for a discussion of the interpretation of the temperature parameter in this model) the dynamics converges rapidly towards an equilibrium state where all the spin variables corresponding to a given syntactic feature for the various languages align to the value most prevalent in the initial configuration."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-99",
"text": "The SSWL database does not record relations between parameters, although it can be shown by other approaches that interesting relations are present, see [14] , [15] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-100",
"text": "Using syntactic data from [6] , [7] , which record explicit entailment relation between different parameter, it was shown in [17] , for small graph examples, that in the presence of relations the dynamics settles on equilibrium states that are not necessarily given by completely aligned spins."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-101",
"text": "4.1. Spin glass models for syntactic parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-102",
"text": "When we interpret the dynamics of the model considered in [17] in terms of codes and the space of code parameters, the initial datum of the set of languages L at the vertices of the graph, with its given list of syntactic binary variables, determines a code C L ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-103",
"text": "The absolute rate k = k(C L ) = log 2 (#L) and the number of syntactic features considered n = n(C L ) remain fixed along the dynamics, hence the dynamics moves the code points along the horizontal lines with fixed R-coordinate."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-104",
"text": "In the case of independent syntactic binary variables, the dynamics follows a gradient descent for an energy functional that is simply given by the Hamiltonian"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-105",
"text": "for a given syntactic variable x i , i = 1, . . . , n, where J , is the strength of the interaction along the edge connecting the vertices and and S x i is the \u00b11 values spin variable associated to the binary variable x i ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-106",
"text": "The minimum of the energy H x i for the single variable x i is achieved when S x i ( ) S x i ( ) = 1, that is, when |x i ( ) \u2212 x i ( )| = 0."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-107",
"text": "Thus, in this case where each syntactic variables runs as an independent Ising model, the minimum is achieved where"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-108",
"text": ", that is, when all the spins align."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-109",
"text": "In the presence of entailment relations between different syntactic variables, it was shown in [17] that the Hamiltonian should be modified by a term that introduces the relations as a Lagrange multiplier."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-110",
"text": "This alters the dynamics and the equilibrium state, depending on a parameter that measures how strongly enforced the relations are."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-111",
"text": "From the point of view of coding theory discussed here, it seems more reasonable to modify this dynamical system, so that it can be better described as a dynamics in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-112",
"text": "It is natural therefore to consider a similar setting, where we assign a given set L of languages to the vertices of a complete graph G L , with assigned energies J e = J , at the edges e \u2208 E(G L ) with \u2202e = { , }."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-113",
"text": "We denote by x( ) = (x j ( )) n j=1 the vector of binary variables that lists the n syntactic features of the language ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-114",
"text": "We consider these as maps x : L \u2192 {0, 1} n , or equivalently as points x \u2208 {0, 1} n L ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-115",
"text": "Consider an energy functional of the form"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-116",
"text": "for J , = J e > 0, where d H (x( ), x( )) is the Hamming distance,"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-117",
"text": "The corresponding partition function is given by"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-118",
"text": "At low temperature (large \u03b2), the partition function is concentrated around the minimum of H(x), that is, were all d H (x( ), x( )) = 0, hence where all the vectors x( ) \u2208 {0, 1} n agree."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-119",
"text": "Given an initial condition x 0 \u2208 {0, 1} n L and the datum (J e ) e\u2208E(G L ) of the strengths of the interaction energies along the edges, the same method used in [17] , based on the standard Metropolis-Hastings algorithm, can be used to study the dynamics in this setting, with a similar behavior."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-120",
"text": "In the space of code parameters, given the code point (\u03b4 0 , R 0 ) = (\u03b4(C(x 0 )), R(C(x 0 ))) associated to the initial condition x 0 , the dynamics moves the code point along the line with constant R = R 0 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-121",
"text": "As the dynamics approaches the minimum of the action, the code point enters the region below the Gilbert-Varshamov bound, as it moves towards smaller values of \u03b4."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-123",
"text": "**4.2.**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-124",
"text": "Dynamics in the presence of entailment relations."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-125",
"text": "It is more interesting to see what happens in the case of where the syntactic variables are not independent but involve entailment relations between syntactic parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-126",
"text": "To this purpose we need to use syntactic data from [6] , [7] , where relations between syntactic parameters are explicitly recorded."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-127",
"text": "A typical form of the relations described in [6] , [7] consists of two syntactic parameters x i and x j , where x i is unconstrained at has binary values x i \u2208 {0, 1}, while the values of x j are constrained by the value of x i , so that if x i = 1 x j can take any of two binary values, while if x i = 0 then x j becomes undefined."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-128",
"text": "We express this by considering x j as a ternary valued variables, x j \u2208 {\u22121, 0, +1} F 3 , where x j = \u00b11 stand for the ordinary binary values and x j = 0 signifies undefined."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-129",
"text": "We can then write the relation in the form of a function"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-130",
"text": "where the solutions to R ij (x) = 0 are precisely (x i = 1, x j = \u00b11) and (x i = 0, x j = 0)."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-131",
"text": "We introduce a parameter E ij \u2265 0 that measures how strongly the relation R ij is enforced."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-132",
"text": "The modified energy functional that accounts for the presence of a relation between the i-the and the j-th parameter is then of the form"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-133",
"text": "where we allow here for the possibility that the relation may be differently (more strongly or weakly) enforces for different \u2208 L. In practice, it will be convenient to assume that E ij ( ) = E ij is independent of \u2208 L. When considering all the possible relations between different parameters in the list, that are of the form R ij as above, we separate out the set {1, . . . , n} of all the syntactic parameters in the list in two sets, {1, . ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-134",
"text": ". , n} = B \u222a T , where B is the set of independent binary variables and T is the set of entailed ternary variables, and we write the energy functional as"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-135",
"text": "where"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-136",
"text": "and where E ij = 0 if there is no direct dependence of x j upon x i ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-137",
"text": "One can consider additional terms R i 1 ,...,ir.j (x) with i a \u2208 B, j \u2208 T and j \u2208 T of a similar form, when the ternary parameter x j is entailed by more than one binary parameter x i , and add them to the energy functional in a similar way, with entailment energies E i 1 ,...,ir,j \u2265 0."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-138",
"text": "4.3. Dynamics in the space of code parameters."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-139",
"text": "In these models, where part of the syntactic variables x i \u2208 B are seen as binary variables and part x j \u2208 T as ternary variables, for the purpose of coding theory, we consider the whole x = (x i ) n i=1 as a vector in F n 3 , in order to compute the code parameters of the resulting code C(L) \u2282 F n 3 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-140",
"text": "One can see already in a very simple example, and using the dynamical system in the form described in [17] , that the dynamics in the space of code parameters now does not need to move towards the \u03b4 = 0 line."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-141",
"text": "Consider the very small example, with just two entailed syntactic variables and four languages, discussed in [17] , where the chosen languages are L = { 1 , 2 , 3 , 4 } = {English, Welsh, Russian, Bulgarian} and the two syntactic parameters are {x 1 , x 2 } = {StrongDeixis, StrongAnaphoricity}. Since we have an entailment relation, the possible values of the variables x i are now ternary, x i ( ) \u2208 {0, \u22121, +1}, that is, we consider here codes C \u2282 F n 3 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-142",
"text": "In this example n = 2."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-143",
"text": "The initial condition x 0 is given by"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-144",
"text": "Note that, since we have two identical code words x 0 ( 1 ) = x 0 ( 3 ) in this initial condition, the parameter d(C L ) = 0, so the code point (\u03b4(C L ), R(C L )) = (0, log 3 (2))) already lies on the vertical line \u03b4 = 0."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-145",
"text": "We consider in this case the same dynamical system used in [17] to model the case with entailment, which is a modification of the Ising model to a coupling of an Ising and a Potts model with q = 3 at the vertices of the graph."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-146",
"text": "This dynamics, which depends on the temperature parameter T = 1/\u03b2 an on an auxiliary parameter E, the \"entailment energy\", that measures how strongly the entailment relation is enforced."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-147",
"text": "In the cases with high temperature and either high or low entailment energy, it is shown in [17] that one can have equilibrium states like"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-148",
"text": "for the high entailment energy case, or"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-149",
"text": "for the low entailment energy case."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-150",
"text": "In both of these cases, the minimum distance d = min = d H (x( ), x( )) = 1, hence \u03b4 = 1/2."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-151",
"text": "Thus, along the dynamics, the code point in the space of code parameters has moved away from the line \u03b4 = 0, along the line with constant R. The final code point with \u03b4 = 1/2 and R = log 3 (2) lies above the GV-curve R = 1 \u2212 H 3 (\u03b4)."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-152",
"text": "Thus, in this very simple example we have seen that the dynamics in the Figure 6 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-153",
"text": "Average magnetization for the spin glass model of [17] computed for the languages and parameters of [6] , in the cases with T = 10 and E = 0; T = 10 and E = 9000; T = 910 and E = 0."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-154",
"text": "case where syntactic parameters are not independent variables can in fact move the code toward a better code, passing from below to above the GV-bound."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-156",
"text": "**4.4.**"
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-157",
"text": "Simulations."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-158",
"text": "The example mentioned above is too simple and artificial to be significant, but we can analyze a more general situation, where we consider the full syntactic data of [6] , [7] , with all the entailment relations taken into account, and the same interaction energies along the edges as in [17] , taken from the data of [16] , which can be regarded as roughly proportional to a measure of the amount of bilingualism."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-159",
"text": "When we work with the full set of data from [6] , [7] , involving 63 parameters for 28 languages (from which we exclude those that do not occur in the [16] data), we see that the large size of the graph and the presence of many entailment relations render the dynamics Figure 8 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-160",
"text": "Dynamics in the space of code parameters: average distance."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-161",
"text": "a lot more complicated than the simple examples discussed in [17] ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-162",
"text": "Indeed for such a large graph the convergence of the dynamics becomes extremely slow, even in the low temperature case and even when entailment relations are switched off, as shown in the graph of the average magnetization in Figure 6 ."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-163",
"text": "Such a large system becomes computationally too heavy, and it is difficult to handle a sufficiently large iterations to get to see any convergence effect."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-164",
"text": "However, when one considers codes obtained by extracting arbitrary subsets of three languages from this set and follows them along the dynamics, computing the corresponding position in the space of code parameters, one sees that, in the case without entailment (E = 0) the average distance drops notably after enough iteration, as shown in Figure 8 indicating that the simulation might in fact converge, even though at the same state in the number of iteration the average magnetization is not settling yet."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-165",
"text": "In the case with entailment relations one should expect the convergence process to be even slower."
},
{
"sent_id": "22253d7b7cd43697b99909e09e7ebb-C001-166",
"text": "Moreover, as in the small example discussed above, the \u03b4 parameter may settle on a limit value different than zero, so the data of the simulation are less informative."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"22253d7b7cd43697b99909e09e7ebb-C001-96"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-98"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-100"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-109"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-119"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-141"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-140"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-147",
"22253d7b7cd43697b99909e09e7ebb-C001-148",
"22253d7b7cd43697b99909e09e7ebb-C001-149"
]
],
"cite_sentences": [
"22253d7b7cd43697b99909e09e7ebb-C001-96",
"22253d7b7cd43697b99909e09e7ebb-C001-98",
"22253d7b7cd43697b99909e09e7ebb-C001-100",
"22253d7b7cd43697b99909e09e7ebb-C001-109",
"22253d7b7cd43697b99909e09e7ebb-C001-119",
"22253d7b7cd43697b99909e09e7ebb-C001-141",
"22253d7b7cd43697b99909e09e7ebb-C001-140",
"22253d7b7cd43697b99909e09e7ebb-C001-147"
]
},
"@USE@": {
"gold_contexts": [
[
"22253d7b7cd43697b99909e09e7ebb-C001-102"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-119"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-145"
],
[
"22253d7b7cd43697b99909e09e7ebb-C001-158"
]
],
"cite_sentences": [
"22253d7b7cd43697b99909e09e7ebb-C001-102",
"22253d7b7cd43697b99909e09e7ebb-C001-119",
"22253d7b7cd43697b99909e09e7ebb-C001-145",
"22253d7b7cd43697b99909e09e7ebb-C001-158"
]
},
"@DIF@": {
"gold_contexts": [
[
"22253d7b7cd43697b99909e09e7ebb-C001-160",
"22253d7b7cd43697b99909e09e7ebb-C001-161"
]
],
"cite_sentences": [
"22253d7b7cd43697b99909e09e7ebb-C001-161"
]
}
}
},
"ABC_7adc4bb66b9173ccee2adc4b64c945_8": {
"x": [
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-2",
"text": "Negation words, such as no and not, play a fundamental role in modifying sentiment of textual expressions."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-3",
"text": "We will refer to a negation word as the negator and the text span within the scope of the negator as the argument."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-4",
"text": "Commonly used heuristics to estimate the sentiment of negated expressions rely simply on the sentiment of argument (and not on the negator or the argument itself)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-5",
"text": "We use a sentiment treebank to show that these existing heuristics are poor estimators of sentiment."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-6",
"text": "We then modify these heuristics to be dependent on the negators and show that this improves prediction."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-7",
"text": "Next, we evaluate a recently proposed composition model (Socher et al., 2013 ) that relies on both the negator and the argument."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-8",
"text": "This model learns the syntax and semantics of the negator's argument with a recursive neural network."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-9",
"text": "We show that this approach performs better than those mentioned above."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-10",
"text": "In addition, we explicitly incorporate the prior sentiment of the argument and observe that this information can help reduce fitting errors."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-13",
"text": "define negation to be \"a grammatical category that allows the changing of the truth value of a proposition\"."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-14",
"text": "Negation is often expressed through the use of negative signals or negators-words like isn't and never, and it can significantly affect the sentiment of its scope."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-15",
"text": "Understanding the impact of negation on sentiment is essential in automatic analysis of sentiment."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-16",
"text": "The literature contains interesting research attempting to model and understand the behavior (reviewed in Section 2)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-17",
"text": "For example, Figure 1 : Effect of a list of common negators in modifying sentiment values in Stanford Sentiment Treebank."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-18",
"text": "The x-axis is s( w), and y-axis is s(w n , w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-19",
"text": "Each dot in the figure corresponds to a text span being modified by (composed with) a negator in the treebank."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-20",
"text": "The red diagonal line corresponds to the sentiment-reversing hypothesis that simply reverses the sign of sentiment values."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-21",
"text": "a simple yet influential hypothesis posits that a negator reverses the sign of the sentiment value of the modified text (Polanyi and Zaenen, 2004; Kennedy and Inkpen, 2006) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-22",
"text": "The shifting hypothesis (Taboada et al., 2011) , however, assumes that negators change sentiment values by a constant amount."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-23",
"text": "In this paper, we refer to a negation word as the negator (e.g., isn't), a text span being modified by and composed with a negator as the argument (e.g., very good), and entire phrase (e.g., isn't very good) as the negated phrase."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-24",
"text": "The recently available Stanford Sentiment Treebank (Socher et al., 2013) renders manually annotated, real-valued sentiment scores for all phrases in parse trees."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-25",
"text": "This corpus provides us with the data to further understand the quantitative behavior of negators, as the effect of negators can now be studied with arguments of rich syntactic and semantic variety."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-26",
"text": "Figure 1 illustrates the effect of a common list of negators on sentiment as observed on the Stanford Sentiment Treebank."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-27",
"text": "1 Each dot in the figure corresponds to a negated phrase in the treebank."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-28",
"text": "The x-axis is the sentiment score of its argument s( w) and y-axis the sentiment score of the entire negated phrase s(w n , w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-29",
"text": "We can see that the reversing assumption (the red diagonal line) does capture some regularity of human perception, but rather roughly."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-30",
"text": "Moreover, the figure shows that same or similar s( w) scores (x-axis) can correspond to very different s(w n , w) scores (y-axis), which, to some degree, suggests the potentially complicated behavior of negators."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-31",
"text": "2 This paper describes a quantitative study of the effect of a list of frequent negators on sentiment."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-32",
"text": "We regard the negators' behavior as an underlying function embedded in annotated data; we aim to model this function from different aspects."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-33",
"text": "By examining sentiment compositions of negators and arguments, we model the quantitative behavior of negators in changing sentiment."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-34",
"text": "That is, given a negated phrase (e.g., isn't very good) and the sentiment score of its argument (e.g., s(\"very good \u2032\u2032 ) = 0.5), we focus on understanding the negator's quantitative behavior in yielding the sentiment score of the negated phrase s(\"isn \u2032 t very good \u2032\u2032 )."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-35",
"text": "We first evaluate the modeling capabilities of two influential heuristics and show that they capture only very limited regularity of negators' effect."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-36",
"text": "We then extend the models to be dependent on the negators and demonstrate that such a simple extension can significantly improve the performance of fitting to the human annotated data."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-37",
"text": "Next, we evaluate a recently proposed composition model (Socher, 2013) that relies on both the negator and the argument."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-38",
"text": "This model learns the syntax and semantics of the negator's argument with a recursive neural network."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-39",
"text": "This approach performs significantly better than those mentioned above."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-40",
"text": "In addition, we explicitly incorporate the prior sentiment of the argument and observe that this information helps reduce fitting errors."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-41",
"text": "1 The sentiment values have been linearly rescaled from the original range [0, 1] to [-0.5, 0.5] ; in the figure a negative or positive value corresponds to a negative or a positive sentiment respectively; zero means neutral."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-42",
"text": "The negator list will be discussed later in the paper."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-43",
"text": "2 Similar distribution is observed in other data such as Tweets (Kiritchenko et al., 2014) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-45",
"text": "**RELATED WORK**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-46",
"text": "Automatic sentiment analysis The expression of sentiment is an integral component of human language."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-47",
"text": "In written text, sentiment is conveyed with word senses and their composition, and in speech also via prosody such as pitch (Mairesse et al., 2012) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-48",
"text": "Early work on automatic sentiment analysis includes the widely cited work of (Hatzivassiloglou and McKeown, 1997; Pang et al., 2002; Turney, 2002 ), among others."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-49",
"text": "Since then, there has been an explosion of research addressing various aspects of the problem, including detecting subjectivity, rating and classifying sentiment, labeling sentiment-related semantic roles (e.g., target of sentiment), and visualizing sentiment (see surveys by Pang and Lee (2008) and Liu and Zhang (2012) )."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-50",
"text": "Negation modeling Negation is a general grammatical category pertaining to the changing of the truth values of propositions; negation modeling is not limited to sentiment."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-51",
"text": "For example, paraphrase and contradiction detection systems rely on detecting negated expressions and opposites (Harabagiu et al., 2006) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-52",
"text": "In general, a negated expression and the opposite of the expression may or may not convey the same meaning."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-53",
"text": "For example, not alive has the same meaning as dead, however, not tall does not always mean short."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-54",
"text": "Some automatic methods to detect opposites were proposed by Hatzivassiloglou and McKeown (1997) and Mohammad et al. (2013) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-55",
"text": "Negation modeling for sentiment An early yet influential reversing assumption conjectures that a negator reverses the sign of the sentiment value of the modified text (Polanyi and Zaenen, 2004; Kennedy and Inkpen, 2006) , e.g., from +0.5 to -0.5, or vice versa."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-56",
"text": "A different hypothesis, called the shifting hypothesis in this paper, assumes that negators change the sentiment values by a constant amount (Taboada et al., 2011; Liu and Seneff, 2009 )."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-57",
"text": "Other approaches to negation modeling have been discussed in (Jia et al., 2009; Wiegand et al., 2010; Lapponi et al., 2012; Benamara et al., 2012) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-58",
"text": "In the process of semantic composition, the effect of negators could depend on the syntax and semantics of the text spans they modify."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-59",
"text": "The approaches of modeling this include bag-of-wordbased models."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-60",
"text": "For example, in the work of (Kennedy and Inkpen, 2006 ), a feature not good will be created if the word good is encountered within a predefined range after a negator."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-61",
"text": "There exist different ways of incorporating more complicated syntactic and semantic information."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-62",
"text": "Much recent work considers sentiment analysis from a semantic-composition perspective (Moilanen and Pulman, 2007; Choi and Cardie, 2008; Socher et al., 2012; Socher et al., 2013) , which achieved the state-of-the-art performance."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-63",
"text": "Moilanen and Pulman (2007) used a collection of hand-written compositional rules to assign sentiment values to different granularities of text spans."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-64",
"text": "Choi and Cardie (2008) proposed a learning-based framework."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-65",
"text": "The more recent work of (Socher et al., 2012; Socher et al., 2013) proposed models based on recursive neural networks that do not rely on any heuristic rules."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-66",
"text": "Such models work in a bottom-up fashion over the parse tree of a sentence to infer the sentiment label of the sentence as a composition of the sentiment expressed by its constituting parts."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-67",
"text": "The approach leverages a principled method, the forward and backward propagation, to learn a vector representation to optimize the system performance."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-68",
"text": "In principle neural network is able to fit very complicated functions (Mitchell, 1997) , and in this paper, we adapt the state-of-the-art approach described in (Socher et al., 2013) to help understand the behavior of negators specifically."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-70",
"text": "**NEGATION MODELS BASED ON HEURISTICS**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-71",
"text": "We begin with previously proposed methods that leverage heuristics to model the behavior of negators."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-72",
"text": "We then propose to extend them to consider lexical information of the negators themselves."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-74",
"text": "**NON-LEXICALIZED ASSUMPTIONS AND MODELING**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-75",
"text": "In previous research, some influential, widely adopted assumptions posit the effect of negators to be independent of both the specific negators and the semantics and syntax of the arguments."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-76",
"text": "In this paper, we call a model based on such assumptions a non-lexicalized model."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-77",
"text": "In general, we can simply define this category of models in Equation 1."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-78",
"text": "That is, the model parameters are only based on the sentiment value of the arguments."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-80",
"text": "**REVERSING HYPOTHESIS**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-81",
"text": "A typical model falling into this category is the reversing hypothesis discussed in Section 2, where a negator simply reverses the sentiment score s( w) to be \u2212s( w); i.e., f (s( w)) = \u2212s( w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-83",
"text": "**SHIFTING HYPOTHESIS**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-84",
"text": "Basic shifting Similarly, a shifting based model depends on s( w) only, which can be written as:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-85",
"text": "where sign(.) is the standard sign function which determines if the constant C should be added to or deducted from s(w n ): the constant is added to a negative s( w) but deducted from a positive one."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-86",
"text": "Polarity-based shifting As will be shown in our experiments, negators can have different shifting power when modifying a positive or a negative phrase."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-87",
"text": "Thus, we explore the use of two different constants for these two situations, i.e., f (s( w)) = s( w)\u2212sign(s( w)) * C(sign(s( w)))."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-88",
"text": "The constant C now can take one of two possible values."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-89",
"text": "We will show that this simple modification improves the fitting performance statistically significantly."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-90",
"text": "Note also that instead of determining these constants by human intuition, we use the training data to find the constants in all shifting-based models as well as for the parameters in other models."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-92",
"text": "**SIMPLE LEXICALIZED ASSUMPTIONS**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-93",
"text": "The above negation hypotheses rely on s( w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-94",
"text": "As intuitively shown in Figure 1 , the capability of the non-lexicalized heuristics might be limited."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-95",
"text": "Further semantic or syntactic information from either the negators or the phrases they modify could be helpful."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-96",
"text": "The most straightforward way of expanding the non-lexicalized heuristics is probably to make the models to be dependent on the negators."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-97",
"text": "Negator-based shifting We can simply extend the basic shifting model above to consider the lexical information of negators: f (s( w)) = s( w) \u2212 sign(s( w)) * C(w n )."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-98",
"text": "That is, each negator has its own C. We call this model negator-based shifting."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-99",
"text": "We will show that this model also statistically significantly outperforms the basic shifting without overfitting, although the number of parameters have increased."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-100",
"text": "Combined shifting We further combine the negator-based shifting and polarity-based shift-"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-101",
"text": "This shifting model is based on negators and the polarity of the text they modify: constants can be different for each negator-polarity pair."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-102",
"text": "The number of parameters in this model is the multiplication of number of negators by two (the number of sentiment polarities)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-103",
"text": "This model further improves the fitting performance on the test data."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-105",
"text": "**SEMANTICS-ENRICHED MODELING**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-106",
"text": "Negators can interact with arguments in complex ways."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-107",
"text": "Figure 1 shows the distribution of the effect of negators on sentiment without considering further semantics of the arguments."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-108",
"text": "The question then is that whether and how much incorporating further syntax and semantic information can help better fit or predict the negation effect."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-109",
"text": "Above, we have considered the semantics of the negators."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-110",
"text": "Below, we further make the models to be dependent on the arguments."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-111",
"text": "This can be written as:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-112",
"text": "In the formula, r( w) is a certain type of representation for the argument w and it models the semantics or/and syntax of the argument."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-113",
"text": "There exist different ways of implementing r( w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-114",
"text": "We consider two models in this study: one drops s( w) in Equation 4 and directly models f (w n , r( w))."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-115",
"text": "That is, the non-uniform information shown in Figure 1 is not directly modeled."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-116",
"text": "The other takes into account s( w) too."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-117",
"text": "For the former, we adopt the recursive neural tensor network (RNTN) proposed recently by Socher et al. (2013) , which has showed to achieve the state-of-the-art performance in sentiment analysis."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-118",
"text": "For the latter, we propose a prior sentimentenriched tensor network (PSTN) to take into account the prior sentiment of the argument s( w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-120",
"text": "**RNTN: RECURSIVE NEURAL TENSOR NETWORK**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-121",
"text": "A recursive neural tensor network (RNTN) is a specific form of feed-forward neural network based on syntactic (phrasal-structure) parse tree to conduct compositional sentiment analysis."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-122",
"text": "For completeness, we briefly review it here."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-123",
"text": "More details can be found in (Socher et al., 2013) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-124",
"text": "As shown in the black portion of Figure 2 , each instance of RNTN corresponds to a binary parse tree of a given sentence."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-125",
"text": "Each node of the parse tree is a fixed-length vector that encodes compositional semantics and syntax, which can be used to predict the sentiment of this node."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-126",
"text": "The vector of a node, say p 2 in Figure 2 , is computed from the ddimensional vectors of its two children, namely a and p 1 (a, p 1 \u2208 R d\u00d71 ), with a non-linear function:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-127",
"text": "where, W \u2208 R d\u00d7(d+d) and V \u2208 R (d+d)\u00d7(d+d)\u00d7d are the matrix and tensor for the composition function."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-128",
"text": "A major difference of RNTN from the conventional recursive neural network (RRN) (Socher et al., 2012) is the use of the tensor V in order to directly capture the multiplicative interaction of two input vectors, although the matrix W implicitly captures the nonlinear interaction between the input vectors."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-129",
"text": "The training of RNTN uses conventional forward-backward propagation."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-131",
"text": "**PSTN: PRIOR SENTIMENT-ENRICHED TENSOR NETWORK**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-132",
"text": "The non-uniform distribution in Figure 1 has showed certain correlations between the sentiment values of s(w n , w) and s( w), and such information has been leveraged in the models discussed in Section 3."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-133",
"text": "We intend to devise a model that implements Equation 4."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-134",
"text": "It bridges between the models we have discussed above that use either s( w) or r( w)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-135",
"text": "We extend RNTN to directly consider the sentiment information of arguments."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-136",
"text": "Consider the node p 2 in Figure 2 ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-137",
"text": "When calculating its vector, we aim to directly engage the sentiment information of its right child, i.e., the argument."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-138",
"text": "To this end, we make use of the sentiment class information of p 1 , noted as p sen 1 ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-139",
"text": "As a result, the vector of p 2 is calculated as follows:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-140",
"text": "As shown in Equation 6, for the node vector p 1 \u2208 R d\u00d71 , we employ a matrix, namely W sen \u2208 R d\u00d7(d+m) and a tensor, V sen \u2208 R (d+m)\u00d7(d+m)\u00d7d , aiming at explicitly capturing the interplays between the sentiment class of p 1 , denoted as p sen 1 (\u2208 R m\u00d71 ), and the negator a. Here, we assume the sentiment task has m classes."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-141",
"text": "Following the idea of Wilson et al. (2005) , we regard the sentiment of p 1 as a prior sentiment as it has not been affected by the specific context (negators), so we denote our method as prior sentiment-enriched tensor network (PSTN)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-142",
"text": "In Figure 2 , the red portion shows the added components of PSTN."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-143",
"text": "Note that depending on different purposes, p sen 1 can take the value of the automatically predicted sentiment distribution obtained in forward propagation, the gold sentiment annotation of node p 1 , or even other normalized prior sentiment value or confidence score from external sources (e.g., sentiment lexicons or external training data)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-144",
"text": "This is actually an interesting place to extend the current recursive neural network to consider extrinsic knowledge."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-145",
"text": "However, in our current study, we focus on exploring the behavior of negators."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-146",
"text": "As we have discussed above, we will use the human annotated sentiment for the arguments, same as in the models discussed in Section 3."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-147",
"text": "With the new matrix and tensor, we then have \u03b8 = (V, V sen , W, W sen , W label , L) as the PSTN model's parameters."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-148",
"text": "Here, L denotes the vector representations of the word dictionary."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-150",
"text": "**INFERENCE AND LEARNING**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-151",
"text": "Inference and learning in PSTN follow a forwardbackward propagation process similar to that in (Socher et al., 2013) , and for completeness, we depict the details as follows."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-152",
"text": "To train the model, one first needs to calculate the predicted sentiment distribution for each node:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-153",
"text": "and then compute the posterior probability over the m labels:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-154",
"text": "During learning, following the method used by the RNTN model in (Socher et al., 2013) , PSTN also aims to minimize the cross-entropy error between the predicted distribution y i \u2208 R m\u00d71 at node i and the target distribution t i \u2208 R m\u00d71 at that node."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-155",
"text": "That is, the error for a sentence is calculated as:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-156",
"text": "where, \u03bb represents the regularization hyperparameters, and j \u2208 m denotes the j-th element of the multinomial target distribution."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-157",
"text": "To minimize E(\u03b8), the gradient of the objective function with respect to each of the parameters in \u03b8 is calculated efficiently via backpropagation through structure, as proposed by Goller and Kchler (1996) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-158",
"text": "Specifically, we first compute the prediction errors in all tree nodes bottom-up."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-159",
"text": "After this forward process, we then calculate the derivatives of the softmax classifiers at each node in the tree in a top-down fashion."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-160",
"text": "We will discuss the gradient computation for the V sen and W sen in detail next."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-161",
"text": "Note that the gradient calculations for the V, W, W label , L are the same as that of presented in (Socher et al., 2013) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-162",
"text": "In the backpropogation process of the training, each node (except the root node) in the tree carries two kinds of errors: the local softmax error and the error passing down from its parent node."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-163",
"text": "During the derivative computation, the two errors will be summed up as the complement incoming error for the node."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-164",
"text": "We denote the complete incoming error and the softmax error vector for node i as \u03b4 i,com \u2208 R d\u00d71 and \u03b4 i,s \u2208 R d\u00d71 , respectively."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-165",
"text": "With this notation, the error for the root node p 2 can be formulated as follows."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-166",
"text": "where \u2297 is the Hadamard product between the two vectors and f \u2032 is the element-wise derivative of f = tanh."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-167",
"text": "With the results from Equation 8, we then can calculate the derivatives for the W sen at node p 2 using the following equation:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-168",
"text": "Similarly, for the derivative of each slice k(k = 1, . . . , d) of the V sen tensor, we have the following:"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-169",
"text": "Now, let's form the equations for computing the error for the two children of the p 2 node."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-170",
"text": "The difference for the error at p 2 and its two children is that the error for the latter will need to compute the error message passing down from p 2 ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-171",
"text": "We denote the error passing down as \u03b4 p 2 ,down , where the left child and the right child of p 2 take the 1 st and 2 nd half of the error \u03b4 p 2 ,down , namely \u03b4 p 2 ,down [1 : d] and \u03b4 p 2 ,down [d + 1 : 2d], respectively."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-172",
"text": "Following this notation, we have the error message for the two children of p 2 , provided that we have the \u03b4 p 2 ,down :"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-173",
"text": "The incoming error message of node a can be calculated similarly."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-174",
"text": "Finally, we can finish the above equations with the following formula for computing \u03b4 p 2 ,down :"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-175",
"text": "After the models are trained, they are applied to predict the sentiment of the test data."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-176",
"text": "The original RNTN and the PSTN predict 5-class sentiment for each negated phrase; we map the output to real-valued scores based on the scale that Socher et al. (2013) used to map real-valued sentiment scores to sentiment categories."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-177",
"text": "Specifically, we conduct the mapping with the formula: p real i = y i \u00b7 [0.1 0.3 0.5 0.7 0.9]; i.e., we calculate the dot product of the posterior probability y i and the scaling vector."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-178",
"text": "For example, if y i = [0.5 0.5 0 0 0], meaning this phrase has a 0.5 probability to be in the first category (strong negative) and 0.5 for the second category (weak negative), the resulting p real i will be 0.2 (0.5*0.1+0.5*0.3)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-179",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-180",
"text": "**EXPERIMENT SET-UP**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-181",
"text": "Data As described earlier, the Stanford Sentiment Treebank (Socher et al., 2013) has manually annotated, real-valued sentiment values for all phrases in parse trees."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-182",
"text": "This provides us with the training and evaluation data to study the effect of negators with syntax and semantics of different complexity in a natural setting."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-183",
"text": "The data contain around 11,800 sentences from movie reviews that were originally collected by Pang and Lee (2005) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-184",
"text": "The sentences were parsed with the Stanford parser (Klein and Manning, 2003) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-185",
"text": "The phrases at all tree nodes were manually annotated with one of 25 sentiment values that uniformly span between the positive and negative poles."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-186",
"text": "The values are normalized to the range of [0, 1] ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-187",
"text": "In this paper, we use a list of most frequent negators that include the words not, no, never, and their combinations with auxiliaries (e.g., didn't)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-188",
"text": "We search these negators in the Stanford Sentiment Treebank and normalize the same negators to a single form; e.g., \"is n't\", \"isn't\", and \"is not\" are all normalized to \"is not\"."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-189",
"text": "Each occurrence of a negator and the phrase it is directly composed with in the treebank, i.e., w n , w , is considered a data point in our study."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-190",
"text": "In total, we collected 2,261 pairs, including 1,845 training and 416 test cases."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-191",
"text": "The split of training and test data is same as specified in (Socher et al., 2013) ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-192",
"text": "Evaluation metrics We use the mean absolute error (MAE) to evaluate the models, which measures the averaged absolute offsets between the predicted sentiment values and the gold standard."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-193",
"text": "More specifically, MAE is calculated as: M AE = 1 N wn, w |(\u015d(w n , w) \u2212 s(w n , w))|, where\u015d(w n , w) denotes the gold sentiment value and s(w n , w) the predicted one for the pair w n , w , and N is the total number of test instances."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-194",
"text": "Note that mean square error (MSE) is another widely used measure for regression, but it is less intuitive for out task here."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-196",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-197",
"text": "Overall regression performance Table 1 shows the overall fitting performance of all models."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-198",
"text": "The first row of the table is a random baseline, which simply guesses the sentiment value for each test case randomly in the range [0, 1] ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-199",
"text": "The table shows that the basic reversing and shifting heuristics do capture negators' behavior to some degree, as their MAE scores are higher than that of the baseline."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-200",
"text": "Making the basic shifting model to be dependent on the negators (model 4) reduces the prediction error significantly as compared with the error of the basic shifting (model 3)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-201",
"text": "The same is true for the polarity-based shifting (model 5), reflecting that the roles of negators are different when modifying positive and negative phrases."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-202",
"text": "Merging these two models yields additional improvement (model 6)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-203",
"text": "(6), and the model with the double dagger \u2020 \u2020is significantly better than model (7)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-204",
"text": "One-tailed paired t-test with a 95% significance level is used here."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-205",
"text": "Furthermore, modeling the syntax and semantics with the state-of-the-art recursive neural network (model 7 and 8) can dramatically improve the performance over model 6."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-206",
"text": "The PSTN model, which takes into account the human-annotated prior sentiment of arguments, performs the best."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-207",
"text": "This could suggest that additional external knowledge, e.g., that from human-built resources or automatically learned from other data (e.g., as in (Kiritchenko et al., 2014) ), including sentiment that cannot be inferred from its constituent expressions, might be incorporated to benefit the current neural-network-based models as prior knowledge."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-208",
"text": "Note that the two neural network based models incorporate the syntax and semantics by representing each node with a vector."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-209",
"text": "One may consider that a straightforward way of considering the semantics of the modified phrases is simply memorizing them."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-210",
"text": "For example, if a phrase very good modified by a negator not appears in the training and test data, the system can simply memorize the sentiment score of not very good in training and use this score at testing."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-211",
"text": "When incorporating this memorizing strategy into model (6), we observed a MAE score of 0.1222."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-212",
"text": "It's not surprising that memorizing the phrases has some benefit, but such matching relies on the exact reoccurrences of phrases."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-213",
"text": "Note that this is a special case of what the neural network based models can model."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-214",
"text": "Table 1 has demonstrated the benefit of discriminating negators."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-215",
"text": "To understand this further, we plot in Figure 3 the behavior of different negators: the x-axis is a subset of our negators and the y-axis denotes absolute shifting in sentiment values."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-216",
"text": "For example, we can see that the negator \"is never\" on average shifts the sentiment of the arguments by 0.26, which is a significant change considering the range of sentiment value is [0, 1]."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-217",
"text": "For each negator, a 95% confidence interval is shown by the boxes in the figure, which is calculated with the bootstrapping resampling method."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-218",
"text": "We can observe statistically significant differences of shifting abilities between many negator pairs such as that between \"is never\" and \"do not\" as well as between \"does not\" and \"can not\"."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-219",
"text": "In general, we argue that one should always consider modeling negators individually in a sentiment analysis system."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-220",
"text": "Alternatively, if the modeling has to be done in groups, one should consider clustering valence shifters by their shifting abilities in training or external data."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-221",
"text": "Figure 4 shows the shifting capacity of negators when they modify positive (blue boxes) or negative phrases (red boxes)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-222",
"text": "The figure includes five most frequently used negators found in the sentiment treebank."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-223",
"text": "Four of them have significantly different shifting power when composed with positive or negative phrases, which can explain why the polarity-based shifting model achieves improvement over the basic shifting model."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-224",
"text": "Modeling syntax and semantics We have seen above that modeling syntax and semantics through the-state-of-the-art neural networks help improve the fitting performance."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-225",
"text": "Below, we take a closer look at the fitting errors made at different depths of the sentiment treebank."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-226",
"text": "The depth here is defined as the longest distance between the root of a negator-phrase pair w n , w and their descendant leafs."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-227",
"text": "Negators appearing at deeper levels of the tree tend to have more complicated syntax and semantics."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-228",
"text": "In Figure 5 , the x-axis corresponds to different depths and y-axis is the mean absolute errors (MAE)."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-229",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-230",
"text": "**DISCRIMINATING NEGATORS THE RESULTS IN**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-231",
"text": "The figure shows that both RNTN and PSTN perform much better at all depths than the model 6 in Table 1 ."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-232",
"text": "When the depths are within 4, the RNTN performs very well and the (human annotated) prior sentiment of arguments used in PSTN does not bring additional improvement over RNTN."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-233",
"text": "PSTN outperforms RNTN at greater depths, where the syntax and semantics are more complicated and harder to model."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-234",
"text": "The errors made by model 6 is bumpy, as the model considers no semantics and hence its errors are not dependent on the depths."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-235",
"text": "On the other hand, the errors of RNTN and PSTN monotonically increase with depths, indicating the increase in the task difficulty."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-236",
"text": "----------------------------------"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-237",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-238",
"text": "Negation plays a fundamental role in modifying sentiment."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-239",
"text": "In the process of semantic composition, the impact of negators is complicated by the syntax and semantics of the text spans they modify."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-240",
"text": "This paper provides a comprehensive and quantitative study of the behavior of negators through a unified view of fitting human annotation."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-241",
"text": "We first measure the modeling capabilities of two influential heuristics on a sentiment treebank and find that they capture some effect of negation; however, extending these non-lexicalized models to be dependent on the negators improves the per-formance statistically significantly."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-242",
"text": "The detailed analysis reveals the differences in the behavior among negators, and we argue that they should always be modeled separately."
},
{
"sent_id": "7adc4bb66b9173ccee2adc4b64c945-C001-243",
"text": "We further make the models to be dependent on the text being modified by negators, through adaptation of a state-ofthe-art recursive neural network to incorporate the syntax and semantics of the arguments; we discover this further reduces fitting errors."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7adc4bb66b9173ccee2adc4b64c945-C001-24"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-62"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-65"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-123"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-181"
]
],
"cite_sentences": [
"7adc4bb66b9173ccee2adc4b64c945-C001-24",
"7adc4bb66b9173ccee2adc4b64c945-C001-62",
"7adc4bb66b9173ccee2adc4b64c945-C001-65",
"7adc4bb66b9173ccee2adc4b64c945-C001-123",
"7adc4bb66b9173ccee2adc4b64c945-C001-181"
]
},
"@USE@": {
"gold_contexts": [
[
"7adc4bb66b9173ccee2adc4b64c945-C001-37"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-68"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-117"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-151"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-154"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-161"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-176"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-181",
"7adc4bb66b9173ccee2adc4b64c945-C001-182"
],
[
"7adc4bb66b9173ccee2adc4b64c945-C001-191"
]
],
"cite_sentences": [
"7adc4bb66b9173ccee2adc4b64c945-C001-37",
"7adc4bb66b9173ccee2adc4b64c945-C001-68",
"7adc4bb66b9173ccee2adc4b64c945-C001-117",
"7adc4bb66b9173ccee2adc4b64c945-C001-151",
"7adc4bb66b9173ccee2adc4b64c945-C001-154",
"7adc4bb66b9173ccee2adc4b64c945-C001-161",
"7adc4bb66b9173ccee2adc4b64c945-C001-176",
"7adc4bb66b9173ccee2adc4b64c945-C001-181",
"7adc4bb66b9173ccee2adc4b64c945-C001-191"
]
}
}
},
"ABC_d76946b009d67613326a8e7650ad36_8": {
"x": [
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-2",
"text": "Capturing interactions among multiple predicate-argument structures (PASs) is a crucial issue in the task of analyzing PAS in Japanese."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-3",
"text": "In this paper, we propose new Japanese PAS analysis models that integrate the label prediction information of arguments in multiple PASs by extending the input and last layers of a standard deep bidirectional recurrent neural network (bi-RNN) model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-4",
"text": "In these models, using the mechanisms of pooling and attention, we aim to directly capture the potential interactions among multiple PASs, without being disturbed by the word order and distance."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-5",
"text": "Our experiments show that the proposed models improve the prediction accuracy specifically for cases where the predicate and argument are in an indirect dependency relation and achieve a new state of the art in the overall F 1 on a standard benchmark corpus."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-8",
"text": "A predicate-argument structure (PAS) is a structure that represents the relationships between a predicate and its arguments."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-9",
"text": "Identifying PASs in Japanese text is a long-standing challenge chiefly due to the abundance of omitted (elliptical) arguments."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-10",
"text": "In the example in Figure 1 , the dative relation between answer and reporters is not explicitly indicated by the syntactic structure of the sentence."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-11",
"text": "We regard such arguments as elliptical and call those argument slots Zero cases."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-12",
"text": "25% of the obligatory arguments in Japanese newspaper articles are reported to be elliptical."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-13",
"text": "1 The accuracy of identifying the fillers of such Zero cases remains only around 50% in terms of F 1 even if the task is restricted to the identification of intra-sentential predicate-argument relations (Matsubayashi and Inui, 2017) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-14",
"text": "One promising approach for addressing this problem is to model argument sharing across multiple predicates (Iida et al., 2015; Ouchi et al., 2015; Ouchi et al., 2017) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-15",
"text": "In Figure 1 , for example, one can find very limited syntactic clues for predicting the long-distance dative relation between answer and reporters."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-16",
"text": "However, the relation must be easy to identify for human readers who know that the person who asks a question is likely to be answered; namely, the nominative argument of ask is likely to be shared with the dative argument of answer."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-17",
"text": "Capturing such inter-predicative dependencies has, therefore, been considered crucial of Japanese PAS analysis."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-18",
"text": "With this goal in mind, Iida et al. (2015) constructed a subject-shared predicate network with an accurate recognizer of subject-sharing relations and deterministically propagated the predicted subjects to the other predicates in the graph."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-19",
"text": "However, this method is applied only to subject sharing, so it cannot take into account the relationships among multiple argument labels."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-20",
"text": "More recently, as an end-to-end model considering multi-predicate dependencies, Ouchi et al. (2017) used Grid RNN to incorporate intermediate representations of the prediction for one predicate generated by an RNN layer into the inputs of the RNN layer for another predicate."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-21",
"text": "However, in this model, since the information of multiple predicates also propagates through the RNNs, the integration of the prediction information is influenced by word order and distance, which is not necessarily important for aspects of syntactic and semantic relations."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-22",
"text": "Consequently, there might be information loss caused by the surface distances of words, as previous work had pointed out for RNN language models (Linzen et al., 2016) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-23",
"text": "In this study, we propose new Japanese PAS analysis models that integrate the prediction information of arguments in multiple predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-24",
"text": "We extend a standard end-to-end style deep bi-RNN model (Figure 2a) and introduce components that consider the multiple predicate interactions into both the input and last layers (Figures 2b and 3) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-25",
"text": "In contrast to Grid RNN, our extended models stack the extra layers using pooling and attention mechanisms on top of a deep bi-RNN so that they can directly associate the label prediction information for a target (predicate, word) pair with the predictions for words strongly related to the target pair."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-26",
"text": "Through experiments, we show that the proposed models improve argument prediction accuracy, especially for the Zero cases, and achieve a new state-of-the-art performance in the overall F 1 on a standard benchmark corpus."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-28",
"text": "**TASK**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-29",
"text": "In this paper, we employ a task definition based on the NAIST Text Corpus (NTC) Iida et al., 2017) , a commonly used benchmark corpus annotated with nominative (NOM), accusative (ACC), and dative (DAT) arguments for predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-30",
"text": "Given a tokenized sentence w = w 1 , ..., w n and its predicate positions p = p 1 , ..., p q , our task is to identify at most one head of the filler tokens for each argument slot of each predicate."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-31",
"text": "In this study, we follow the setting of Iida et al. (2015) , Ouchi et al. (2017) , and Matsubayashi and Inui (2017) , and focus only on analyzing arguments in a target sentence."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-32",
"text": "In addition, we exclude argument instances that are in the same bunsetsu, a base phrase unit in Japanese, as the target predicate, following Ouchi et al. (2017) , which we will compare with the results in experiments."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-33",
"text": "The semantic labels used in NTC may seem to be rather syntactic as they are named nominative, accusative, etc."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-34",
"text": "However, this annotation task markedly differs from shallow syntactic parsing and is, in fact, more like a semantic role labeling (SRL) task including implicit argument prediction."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-35",
"text": "First, the semantic labels in NTC generalize case alteration caused by voice alteration and thus represent semantic roles analogous to ARG0, ARG1, etc."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-36",
"text": "in the PropBank-style annotation (Palmer et al., 2005) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-37",
"text": "Second, in the corpus, when an argument is omitted (i.e., zero-anaphora), the antecedent is identified with an appropriate semantic role, which is a prominent problem in Japanese semantic analysis and is the primary target of this study."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-39",
"text": "**BASE MODEL**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-40",
"text": "Our proposed models extend end-to-end style SRL systems using deep bi-RNN (Zhou and Xu, 2015; He et al., 2017; Ouchi et al., 2017) to combine mechanisms that consider multiple predicate interactions."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-41",
"text": "Figure 2a shows the network of our base model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-42",
"text": "Formally, given a word sequence w = w 1 , ..., w n and a target predicate position p i in p, the model outputs a label probability for each word position:"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-43",
"text": "Here, c i,t \u2208 {NOM, ACC, DAT, NONE} represents the argument label of the word w t for the target predicate w p i ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-44",
"text": "The input layer creates a vector h 0 i,t \u2208 R dw+1 for each pair of a predicate w p i and a word w t by concatenating a word embedding e(w t ) \u2208 R dw and a binary value representing the target predicate position in a method similar to that of He et al. (2017) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-45",
"text": "The obtained vectors are then input into the deep bi-RNN, where the directions of the layers alternate (Zhou and Xu, 2015) :"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-46",
"text": "Here, h k i,t \u2208 R dr is the output of the k-th RNN layer for a pair (w p i , w t ), and r k is a function representing the k-th RNN layer."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-47",
"text": "We employ gated recurrent units (GRUs) (Cho et al., 2014) for the RNNs."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-48",
"text": "In addition, we use the residual connections (He et al., 2016) following Ouchi et al. (2017) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-49",
"text": "Then, a fourdimensional vector representing a probability p(c i,t |i, p, w) is obtained by applying a softmax layer to each output of the last RNN layer h K i,t ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-50",
"text": "For each argument label c of each predicate, we eventually select a word with the maximum probability that exceeds an output threshold \u03b8 c ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-52",
"text": "**PROPOSED MODELS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-53",
"text": "Our base model independently predicts the arguments of each predicate."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-54",
"text": "In order to capture dependencies between the arguments of multiple predicates, we apply two extensions to our base model: a multi-predicate input layer and three variants of interaction layers on top of the deep bi-RNNs."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-55",
"text": "Figures 2b and 3 show the network structures of the extended models."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-56",
"text": "In contrast to the Grid RNN model of Ouchi et al. (2017) , where the information of multiple predicates propagates through the RNNs, our interaction layers use pooling and attention mechanisms to directly associate the label prediction information for a target (predicate, word) pair with that for words strongly related to the target pair, without being disturbed by word order and distance."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-58",
"text": "**INTERACTION LAYERS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-59",
"text": "Pooling (POOL) Argument sharing across multiple predicates can be captured with both syntactic and semantic clues."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-60",
"text": "At the syntactic level, we want to capture tendencies that, for example, the subject of the predicate of a matrix clause is likely to fill argument slots of other predicates in the same sentence."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-61",
"text": "At the semantic level, we want to model semantic dependencies between neighboring events such as the person who asks a question is likely to be answered, as in Figure 1 ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-62",
"text": "Our proposal is to capture both types of clues by incorporating a max pooling layer on top of the base model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-63",
"text": "Specifically, as illustrated in Figure 3a , for each word w t , we integrate the intermediate representation of label prediction for each predicate h K i,t by applying max pooling to the vectors that represent pairs of prediction information for two predicates h K i,t and h K j,t (including the case i = j):"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-64",
"text": "In this equation, maxpool j (f i,j,t ) is an operation to extract the maximum value of each dimension in {f i,1,t , ..., f i,q,t }."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-65",
"text": "The newly obtained vector h i,t for w p i and w t is input into the softmax layer in the same manner as in the base model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-67",
"text": "**ATTENTION-THEN-POOLING (ATT-POOL)**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-68",
"text": "Besides the argument sharing across multiple predicates, we would also like to capture dependencies between different arguments of a single predicate (and potentially, arguments of multiple predicates)."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-69",
"text": "For example, syntactically, two distinct argument slots of a single predicate are unlikely to share the same filler."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-70",
"text": "Semantically, the subject of a predicate take is likely to be a person when its object is a bread, but is likely to be a company if the object is a new employee."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-71",
"text": "To capture such dependencies, we integrate the intermediate label prediction h K j,t of w t for an arbitrary predicate w p j (including the case i = j) into the prediction of w t for a target predicate w p i ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-72",
"text": "In the integration, we aim to weigh the prediction information for (w p j , w t ) based on its relatedness to the target pair (w p i , w t ) using the attention mechanism (Bahdanau et al., 2015) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-73",
"text": "As in Figure 3b , we calculate a weight a i,j,t (t ) \u2208 R for each of h K j,1 , ..., h K j,n on the basis of the prediction h K i,t for the target pair and we obtain a weighted sum of h K j,t as a summary of the argument information of w p j , which is expected to be useful for the label prediction of (w p i , w t ):"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-74",
"text": "The obtained h i,j,t are concatenated with the prediction for the target pair h K i,t and linearly transformed with the ReLU activation."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-75",
"text": "Max pooling is then applied to these vectors to combine the predictions for multiple predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-76",
"text": "Pooling-then-Self-Attention (POOL-SELFATT) The ATT-POOL model involves a high computational cost because it must compute nq 2 different attentions regarding the number of words n and the number of predicates q in a sentence."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-77",
"text": "Therefore, as illustrated in Figure 3c , in this model, we first apply the max pooling that we applied in the POOL model to reduce the sequences for which attentions must be computed by integrating the label predictions of w t for all the other predicates in advance."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-78",
"text": "Then, we combine the information in the obtained sequence m i,1 , ..., m i,n in a similar manner as in the ATT-POOL model using the attention mechanism, but this time, with self-attention, that is, computing the weights of the elements in the sequence based on the relatedness to the element inside the sequence."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-79",
"text": "Consequently, the number of attentions that must be computed is reduced to nq."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-80",
"text": "Self-Attention (SELFATT) To conduct ablation tests to assess the impact of the proposed extensions, we also implemented a model only with self-attention."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-81",
"text": "This model explicitly considers the relationships between arguments of a single predicate, but not arguments across multiple predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-83",
"text": "**MULTI-PREDICATE INPUT LAYER (MP)**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-84",
"text": "In addition, we add a simple but effective extension to the input layer."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-85",
"text": "As He et al. (2016) reported, the information of the target predicate w p i propagates to the intermediate prediction h K i,t of the candidate argument w t through the deep bi-RNN by just adding a binary value representing the predicate position."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-86",
"text": "Inspired by this finding, as shown in Figure 2b , in the input layer, we add another binary value that represents all the predicate positions to h 0 i,t , aiming to propagate multiple predicate information."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-88",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-89",
"text": "We evaluated the impacts of our extensions and compared their performances to those of previous studies."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-90",
"text": "Our main hypothesis is that the pooling and attention mechanisms are both useful for capturing different types of argument interactions as we explained in Section 4 and do work complementarily of each other to improve the prediction accuracy, especially for arguments in a long-distance dependency."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-112",
"text": "**RESULTS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-114",
"text": "**IMPACT OF EXTENSIONS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-115",
"text": "The first two sets of rows in Table 1 compare the impact of each component of our extension."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-116",
"text": "The effects of incorporating the interaction layer can be seen in the comparisons of the BASE model with the SELFATT, POOL, ATT-POOL, and POOL-SELFATT models."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-117",
"text": "Among the four proposed extensions, POOL-SELFATT, an integration of POOL and SELFATT, achieved the best performance (83.76 in F 1 ), gaining 0.37 points in overall F 1 from BASE."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-118",
"text": "Also, the significance tests in Table 2 show that the POOL and SELFATT models significantly outperform the BASE model, and the POOL-SELFATT model makes a further significant gain from the POOL and SELFATT models."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-119",
"text": "This indicates that POOL and SELFATT work complementarily with each other, and combining them makes a further improvement from each individual extension."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-120",
"text": "Recall that SELFATT is designed to capture long-distance dependencies over a single predicate-argument structure, whereas POOL is expected to capture argument sharing across multiple predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-121",
"text": "These results provide empirical support to the hypotheses behind our design of the interaction layer."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-122",
"text": "The MP model, where the input layer is extended to represent the positions of all the predicates in a sentence, significantly outperforms the BASE model by 0.28 points in overall F 1 ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-123",
"text": "This result suggests the importance of position information regarding the neighboring predicates in identifying the arguments of a given predicate."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-124",
"text": "Furthermore, the MP-POOL-SELFATT model, which is a combination of MP and POOL-SELFATT, resulted in a further 0.27-point improvement and consequently achieved the best overall Table 3 : F 1 scores of each argument label on the NTC 1.5 test set."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-125",
"text": "improves F 1 from BASE by 0.9-1.4 points consistently across all the distance categories other than Dep."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-127",
"text": "**COMPARISON TO RELATED WORK**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-128",
"text": "The third set of rows in Table 1 shows the reported performance of related studies."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-129",
"text": "Grid RNN of Ouchi et al. (2017) is a state-of-the-art end-to-end model, designed to capture interactions among multiple predicate-argument relations."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-130",
"text": "A comparison between their model and the proposed models was somewhat tricky because our replication of Grid RNN did not reproduce the reported performance on the same dataset (see the row of GRID in Table 1 )."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-131",
"text": "Unlike the results reported in Ouchi et al. (2017) , the GRID model in our experiment did not clearly outperform the model without the grid architecture, i.e., the Base model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-132",
"text": "We first suspected that this might have resulted from the difference in dimensionality d r of RNN hidden states: d r = 32 in Ouchi et al. (2017) , whereas d r = 256 in our experiments."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-133",
"text": "Specifically, we speculated that the base model with a low dimensionality left a larger margin for improvement and incorporating the Grid architecture derived positive effects."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-134",
"text": "We thus trained our GRID model with Ouchi et al. (2017) 's settings (d r = 32 and K = 8) and the best performing hyperparameters; however, we were not able to reproduce the reported gain from Grid RNN (see the row of \"GRID (d r = 32, K = 8)\" in Table 1 )."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-135",
"text": "2 This might be an indication of the difficulty in capturing multi-predicate interactions by threading deep bi-RNNs with RNNs, as we discussed in Section 1."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-136",
"text": "Another previous state-of-the-art model was proposed by Matsubayashi and Inui (2017) (M&I 2017) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-137",
"text": "This model extends a feedforward NN with dependency path embeddings and other new features to capture long-distance dependencies in a single PAS."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-138",
"text": "The row \"M&I 2017\" in Table 1 shows the reported performance of their model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-139",
"text": "3 The performance of M&I 2017 is comparable with the performance of our SELFATT model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-140",
"text": "This result provides another piece of empirical evidence that the self-attention mechanism has a comparably positive effect in incorporating dependency path information for capturing long-distance dependencies in a single PAS."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-141",
"text": "Overall, the proposed methods of using the pooling and attention mechanisms for capturing interactions across predicates and arguments gained considerable improvement and achieved state-of-the-art accuracy, significantly outperforming the previous state-of-the-art models."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-142",
"text": "The last set of rows in Table 1 shows the results of the ensemble models."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-143",
"text": "A model that predicts arguments with the average score of the 10 MP-POOL-ATT models further improves the overall F 1 by 1.4 points from that of a single model, achieving state-of-the-art accuracy for NTC 1.5."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-144",
"text": "Table 3 shows the F 1 score for each case label."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-145",
"text": "In a comparison of the single models, although our MP-POOL-ATT model slightly degrades the scores of NOM and ACC on the Dep cases compared to the state-of-the-art model (M&I 2017) , it greatly improves the scores for DAT and the Zero cases."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-146",
"text": "Regarding the ensemble models, MP-POOL-ATT improves the scores for all cases."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-147",
"text": "Iida et al. (2015) and Iida et al. (2016) report Japanese subject anaphora resolution systems, designed to predict only Zero NOM arguments."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-148",
"text": "It is not straightforward to directly compare their results with ours due to the differences in the experimental settings."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-149",
"text": "However, our best performing model outperforms the Figure 4: Examples of prediction errors."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-150",
"text": "In Example (1), only SELFATT failed to predict the answer."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-151",
"text": "In Example (2), only MP-POOL-SELFATT correctly predicted the answer."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-152",
"text": "In Examples (3) and (4), none of the systems predict the answers correctly."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-153",
"text": "model of Ouchi et al. (2015) , which is then reported to outperform both Iida et al. (2015) and Iida et al. (2016) in their experimental settings."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-155",
"text": "**DETAILED ANALYSIS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-156",
"text": "To analyze the behavior of our proposed models in detail, we show some prediction examples of the SELFATT, MP-SELFATT, and MP-POOL-ATT models in the development set with the weights in the attention layers in Figures 4-7 ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-157",
"text": "In Figure 4 , Examples (1) and (2) are the instances for which only SELFATT failed to predict the answer and for which only MP-POOL-SELFATT correctly predicted the answer, respectively."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-158",
"text": "For these examples, the weights in the attention layers behave similarly."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-159",
"text": "Figure 5 shows the weights for Example (1)."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-160",
"text": "In this sentence, the correct NOM of stretch, Tanigawa, is also NOM of respond, which is relatively easy to predict."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-161",
"text": "SELFATT, which is designed to capture dependencies over a single predicate-argument structure, failed to predict NOM of stretch most likely because the answer Tanigawa is distant from the target predicate with its limited syntactic clues."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-162",
"text": "Conversely, MP-POOL-SELFATT and MP-SELFATT successfully predicted the answer by taking the answer token Tanigawa into account when computing the score of the counter candidate thinking."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-163",
"text": "MP-SELFATT, the model that incorporates the other predicate positions into SELFATT, significantly increases the weight for the answer token."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-164",
"text": "MP-POOL-SELFATT, which explicitly integrates the predictions for the other predicates, further increases the weight for the answer token."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-165",
"text": "This example demonstrates that the proposed extensions successfully predict a correct argument by considering the relation to the argument in another predicate where the syntactic relation between the predicate and argument is much clearer and thus the argument relation is relatively easy to predict."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-166",
"text": "Due to space limitations, we cannot show the weights for Example (2), but the same also holds for that example."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-167",
"text": "MP-POOL-SELFATT focuses on professors, which is the \"easy-to-predict\" NOM argument of dive, when the model computes the scores of this token for take and, consequently, support."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-168",
"text": "SELFATT and MP-SELFATT assign smaller weights to that token for take and even smaller weights for support, which is far from the answer token."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-169",
"text": "Examples (3) and (4) are the instances where all the three models failed to predict the answers."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-170",
"text": "Figure 6 illustrates the attention weights in MP-POOL-SELFATT for Example (3)."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-171",
"text": "To solve this example, the model is expected to understand that NOM of accept should be the same as the persons who received the order from the ministries."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-172",
"text": "However, MP-POOL-SELFATT could not acquire this kind of dialog-level knowledge and pays little attention to the correct argument staff when the model computes the score of the wrong answer ministries for NOM of accept."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-173",
"text": "In Example (4), NOM of the nominal predicate demonstration can be a clue for predicting NOM of perform."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-174",
"text": "However, the models currently do not predict the arguments of nominal predicates and therefore cannot capture the relationships between these two sufficiently ( Figure 7 )."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-175",
"text": "This example suggests one of our future directions: the joint prediction of verbal and nominal predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-176",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-177",
"text": "**RELATED WORK**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-178",
"text": "End-to-End Models in SRL End-to-end approaches to SRL have been widely explored recently, and many state-of-the-art results have been achieved (Zhou and Xu, 2015; He et al., 2017; Marcheggiani and Titov, 2017; Tan et al., 2018) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-179",
"text": "Following these advanced models, we adopted a stacked bi-RNN as our base model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-180",
"text": "Methods for Dealing with Long-Distance Dependencies in End-to-End Models In SRL studies, Marcheggiani and Titov (2017) proposed a variant of deep bi-RNN models that connects the intermediate representations of the predictions for the words in syntactic dependency relations on top of the deep RNN."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-181",
"text": "Very recently, aiming to directly connect the related words, Tan et al. (2018) stacked self-attention layers, each of which followed a feedforward layer, in a manner similar to the method of Vaswani et al. (2017) , which was originally applied to an encoder-decoder model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-182",
"text": "Self-attention has been successfully applied to several NLP tasks, including textual entailment, sentiment analysis, summarization, machine translation, and language understanding (Paulus et al., 2017; Shen et al., 2018; Lin et al., 2017; Vaswani et al., 2017) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-183",
"text": "Techniques using pooling have been applied to merge intermediate expressions in predictions in the tasks where related tokens are often at long distance such as coreference resolution and machine reading (Clark and Manning, 2016; Kobayashi et al., 2016) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-184",
"text": "One major contribution of this study is its novel idea of using these techniques for capturing long-distance dependencies for modeling interactions among multiple predicate-argument relations."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-185",
"text": "Approaches to Capturing Multi-Predicate Interactions For Japanese, Ouchi et al. (2015) jointly identified arguments of multiple predicates by modeling argument interactions with a bipartite graph."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-186",
"text": "Iida et al. (2015) constructed a subject-shared predicate network and deterministically propagated the predicted subjects to other predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-187",
"text": "Shibata et al. (2016) adapted a NN framework to Ouchi et al. (2015) 's model using a feedforward network."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-188",
"text": "For an end-to-end neural model, Ouchi et al. (2017) used a Grid RNN to capture multiple predicate interactions."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-189",
"text": "Through experiments, we demonstrated that our proposed models outperformed these models in terms of the overall F 1 on a standard benchmark corpus."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-190",
"text": "4 To the best of our knowledge, there are few previous studies related to SRL considering multiple predicate interactions for languages other than Japanese."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-191",
"text": "Yang and Zong (2014) performed a discriminative reranking in the role classification of shared arguments."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-192",
"text": "Lei et al. (2015) proposed an SRL model based on the dimensionality reduction on a tensor representation to capture meaningful interactions between the argument, predicate, corresponding features, and role label."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-193",
"text": "It is not straightforward to compare these methods with our models; however, it is an intriguing future issue to consider how well the techniques devised for Japanese PAS analysis work for other languages."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-194",
"text": "Other Approaches to Argument Omission In order to perform robust prediction for arguments with fewer syntactic clues, several previous studies have explored various types of selectional preference scores that consider the semantic relations between a predicate and its arguments (Iida et al., 2007; Imamura et al., 2009; Sasano and Kurohashi, 2011; Shibata et al., 2016) ."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-195",
"text": "This direction of research is orthogonal to our approach, suggesting that the models could be further improved by being combined with these extra features."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-196",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-197",
"text": "**CONCLUSION**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-198",
"text": "In this study, we have proposed new Japanese PAS analysis models that integrate prediction information of arguments in multiple predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-199",
"text": "We extended the end-to-end style model using a deep bi-RNN and introduced the components that consider the multiple predicate interactions into the input and last layers."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-200",
"text": "As a result, we achieved a new state-of-the-art accuracy on the standard benchmark data."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-201",
"text": "Our detailed analysis showed that the proposed models successfully predict the correct arguments by using the information of the \"easy-to-predict\" arguments in other predicates."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-202",
"text": "In addition, the error analysis suggests that jointly predicting the arguments of verbal and nominal predicates may further improve the performance."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-203",
"text": "An intriguing issue we plan to address next is how to extend the proposed interaction layer to cross-sentential interactions of PASs."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-92",
"text": "**SETTINGS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-94",
"text": "**DATASET AND IMPLEMENTATION DETAILS**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-95",
"text": "The experiments were performed on NTC 1.5."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-96",
"text": "We divided the corpus into the commonly used divisions of training, development, and test sets (Taira et al., 2008) , each of which includes 24,283, 4,833, and 9,284 sentences, respectively."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-97",
"text": "NTC represents each argument of a predicate by indicating a coreference cluster in a text."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-98",
"text": "For each given predicate-argument slot, we count a system's output as correct if the output token is included in the coreference cluster corresponding to the slot fillers."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-99",
"text": "The evaluation is performed on the basis of the precision, recall, and F 1 score."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-100",
"text": "The hyperparameters were selected to obtain a maximum F 1 on the development set."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-101",
"text": "The details of the hyperparameter selection and preprocessing are described in the supplemental material."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-102",
"text": "In the following experiments, we train each model 10 times with the same training data and hyperparameters and then show the average scores."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-104",
"text": "**GRID RNN BASELINE (GRID)**"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-105",
"text": "In order to strictly compare the impact of our extensions to the method used for integrating multiple pieces of predicate information in the state-of-the-art end-to-end model, in addition to our base model, we replicated the method of Ouchi et al. (2017) by modifying Equations (1) of our base model as follows:"
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-106",
"text": "if 1 \u2264 i \u2264 q; otherwise, h k i,t = 0."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-107",
"text": "The performance of this replicated model may not be strictly the same as that reported in Ouchi et al. (2017) due to discrepancies in the embeddings of inputs, hyperparameters (a training batch size, a hidden unit size, etc.), and training strategy (an optimizing algorithm, a regularization method, an early stopping method, etc.)."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-108",
"text": "The predicate positions p = p 1 , ..., p q are arranged in ascending order."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-109",
"text": "Table 2 : p-values in one-sided permutation test using 10 overall F 1 scores for each model."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-110",
"text": "The bold values indicate that an average F 1 score of model A outperforms that of model B at the 5% significance level."
},
{
"sent_id": "d76946b009d67613326a8e7650ad36-C001-111",
"text": "----------------------------------"
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d76946b009d67613326a8e7650ad36-C001-14"
],
[
"d76946b009d67613326a8e7650ad36-C001-20"
],
[
"d76946b009d67613326a8e7650ad36-C001-129"
],
[
"d76946b009d67613326a8e7650ad36-C001-188"
]
],
"cite_sentences": [
"d76946b009d67613326a8e7650ad36-C001-14",
"d76946b009d67613326a8e7650ad36-C001-20",
"d76946b009d67613326a8e7650ad36-C001-129",
"d76946b009d67613326a8e7650ad36-C001-188"
]
},
"@MOT@": {
"gold_contexts": [
[
"d76946b009d67613326a8e7650ad36-C001-14",
"d76946b009d67613326a8e7650ad36-C001-16",
"d76946b009d67613326a8e7650ad36-C001-9"
],
[
"d76946b009d67613326a8e7650ad36-C001-20",
"d76946b009d67613326a8e7650ad36-C001-21"
]
],
"cite_sentences": [
"d76946b009d67613326a8e7650ad36-C001-14",
"d76946b009d67613326a8e7650ad36-C001-20"
]
},
"@USE@": {
"gold_contexts": [
[
"d76946b009d67613326a8e7650ad36-C001-31"
],
[
"d76946b009d67613326a8e7650ad36-C001-32"
],
[
"d76946b009d67613326a8e7650ad36-C001-40"
],
[
"d76946b009d67613326a8e7650ad36-C001-48"
],
[
"d76946b009d67613326a8e7650ad36-C001-105"
],
[
"d76946b009d67613326a8e7650ad36-C001-134"
]
],
"cite_sentences": [
"d76946b009d67613326a8e7650ad36-C001-31",
"d76946b009d67613326a8e7650ad36-C001-32",
"d76946b009d67613326a8e7650ad36-C001-40",
"d76946b009d67613326a8e7650ad36-C001-48",
"d76946b009d67613326a8e7650ad36-C001-105",
"d76946b009d67613326a8e7650ad36-C001-134"
]
},
"@EXT@": {
"gold_contexts": [
[
"d76946b009d67613326a8e7650ad36-C001-40"
]
],
"cite_sentences": [
"d76946b009d67613326a8e7650ad36-C001-40"
]
},
"@DIF@": {
"gold_contexts": [
[
"d76946b009d67613326a8e7650ad36-C001-56"
],
[
"d76946b009d67613326a8e7650ad36-C001-107"
],
[
"d76946b009d67613326a8e7650ad36-C001-131"
],
[
"d76946b009d67613326a8e7650ad36-C001-132"
],
[
"d76946b009d67613326a8e7650ad36-C001-134"
],
[
"d76946b009d67613326a8e7650ad36-C001-188",
"d76946b009d67613326a8e7650ad36-C001-189"
]
],
"cite_sentences": [
"d76946b009d67613326a8e7650ad36-C001-56",
"d76946b009d67613326a8e7650ad36-C001-107",
"d76946b009d67613326a8e7650ad36-C001-131",
"d76946b009d67613326a8e7650ad36-C001-132",
"d76946b009d67613326a8e7650ad36-C001-134",
"d76946b009d67613326a8e7650ad36-C001-188"
]
}
}
},
"ABC_2cedb1a0f0c0fbb9bd95d5b54e4967_8": {
"x": [
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-130",
"text": "This type of error also happens when a person uses an expression but with another meaning behind it."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-2",
"text": "This paper analyzes challenges in cloze-style reading comprehension on multiparty dialogue and suggests two new tasks for more comprehensive predictions of personal entities in daily conversations."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-3",
"text": "We first demonstrate that there are substantial limitations to the evaluation methods of previous work, namely that randomized assignment of samples to training and test data substantially decreases the complexity of cloze-style reading comprehension."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-131",
"text": "The model cannot find the right position of the corresponding entities."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-132",
"text": "Utterances Reasoning and Summary Utterances reasoning and summary is another frequent error type."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-133",
"text": "This Table 5 : Percentage of error types occurred in each task type of error occurs when the model needs to predict an entity based on several actions in the query which correspond to several continuous utterances in the dialogue."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-134",
"text": "We noticed that the model also needs to infer the result or effect of continuous or noncontinuous actions happening in several utterances to predict correct entities."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-4",
"text": "According to our analysis, replacing the random data split with a chronological data split reduces test accuracy on previous single-variable passage completion task from 72% to 34%, that leaves much more room to improve."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-5",
"text": "Our proposed tasks extend the previous single-variable passage completion task by replacing more character mentions with variables."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-6",
"text": "Several deep learning models are developed to validate these three tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-7",
"text": "A thorough error analysis is provided to understand the challenges and guide the future direction of this research."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-10",
"text": "Reading comprehension is an intriguing task that assesses a machine's ability in understanding evidence contexts through question answering."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-11",
"text": "Most previous work in reading comprehension has focused on either formal documents Rajpurkar et al. [2016] or children's stories Richardson, Burges, and Renshaw [2013] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-12",
"text": "Only few approaches have attempted comprehension on multiparty dialogue Ma, Jurczyk, and Choi [2018] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-13",
"text": "However, with the explosive expansion of social media, data on dialogue has become dominant on the web."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-14",
"text": "Inspired by various options of analytic models and the potential of the dialogue processing market, we extend the corpus presented by Ma, Jurczyk, and Choi [2018] for comprehensive predictions of personal entities in multiparty dialogue and develop deep learning models to make robust inference on their contexts."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-15",
"text": "Passage completion on multiparty dialogue is one of the reading comprehension tasks that requires a model to match the conversational dialogues with the formal (passage) writings."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-16",
"text": "By building a robust model for this task, people can tell the status of their favorite characters by checking synopsis to see if their favorite characters appear in specific episodes without watching the entire series, thus allowing them an efficient way to decide whether to watch a particular episode."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-17",
"text": "Involving matching contexts between colloquial (dialog) and formal (passage) writings makes this task extremely challenging."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-18",
"text": "Distinguished from the previous work that only focused on a single variable per passage Ma, Jurczyk, and Choi [2018] , we propose two new passage completion tasks on multiparty dialogue which increase the task complexity by replacing more character mentions with variables with a better motivated data split."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-19",
"text": "The details of these tasks are in Section ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-20",
"text": "Several deep neural network models are developed to validate the three reading comprehension tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-21",
"text": "Based on the experimental results, we aim to identify main challenges in these tasks and suggest future research directions."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-22",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-23",
"text": "**RELATED WORK**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-156",
"text": "the guy 's hammered ! @ent02 I 'm sorry @ent01 , as long as he 's here and he 's conscious we 're still shooting . -( he walks away and @ent01 does @ent04 's fist thing ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-24",
"text": "The CNN/Daily Mail dataset introduced by Hermann et al. [2015] is to infer the missing entity (answer a) of a question q from all the possible entities which appear in a passage p while the passage is a news article, the question is a clozestyle task, in which one of entities is replaced by a placeholder, and the answer is this questioned entity."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-25",
"text": "Many other models have been proposed to tackle this dataset, which are either RNN and attention based Chen, Bolton, and Manning [2016] or CNN and RNN based Trischler et al. [2016] or gated attention based Dhingra et al. [2017] or attention over attention based Cui et al. [2017] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-26",
"text": "Finally, the pretrained bi-directional transformer encoder(BERT) which was introduced by Devlin et al. [2018] shows that the pretraining of the language representations can bring better performance in several downstream tasks including machine reading comprehension."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-27",
"text": "More datasets are available for another type of a reading comprehension task, that is multiple choice question answering, such as MCTest Richardson, Burges, and Renshaw [2013] , TriviaQA Joshi et al. [2017] , RACE Lai et al. [2017] and Dream Sun et al. [2019] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-28",
"text": "Unlike the above tasks where documents and queries are written in a similar writing style, the multiparty dialogue reading comprehension task introduced by Ma, Jurczyk, and Choi [2018] has a very different writing style between dialogues and queries."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-29",
"text": "However, their randomized assignment arXiv:1911.00773v1 [cs.CL] 2 Nov 2019 of samples to training and test data substantially decreases the complexity of cloze-style reading comprehension."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-30",
"text": "In this paper, we address this issue by introducing a chronological data split and new variants of the cloze-style reading comprehension task to challenge even higher task complexity."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-32",
"text": "**CORPUS**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-33",
"text": "Our corpus comes from the transcripts of the TV show Friends with ten seasons collected by the Character Mining project."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-34",
"text": "1 Each season contains about 24 episodes, each episode is split into about 13 scenes, where each scene comprises a sequence of about 21 utterances."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-35",
"text": "This dataset contains several layers of annotation."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-36",
"text": "The first two seasons of the show for an entity linking task was annotated by Chen and Choi [2016] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-37",
"text": "Plot summaries of all episodes for the first eight seasons were collected by Jurczyk and Choi [2017] to evaluate a document retrieval task."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-38",
"text": "The rest of the plot summaries were collected by Ma, Jurczyk, and Choi [2018] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-39",
"text": "Table 1 shows the statistical data of the corpus from Ma, Jurczyk, and Choi [2018] ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-40",
"text": "Based on the above corpus we created a new data split different from Ma, Jurczyk, and Choi [2018] 's data split."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-41",
"text": "In the previous work of Ma, Jurczyk, and Choi [2018] , they used a random data split where 1,187 of 1,349 queries in the development set and 1,207 of 1,353 queries in the test set are generated from the same plot summaries as some queries in the training set with only masking the different character entities which makes the model can see the right answer in the training set."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-42",
"text": "To fix this issue, we created a new data split, the detail of which is in Section ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-43",
"text": "Tasks Table 2a and Table 2b show an example of a dialogue and its plots."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-44",
"text": "We propose three tasks, one is from Ma, Jurczyk, and Choi [2018] , and another two tasks are new tasks designed by us."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-46",
"text": "**SINGLE VARIABLE TASK(SV)**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-47",
"text": "The single variable task from Ma, Jurczyk, and Choi [2018] consists a dialogue passage p, a query q which is from plot summary of the dialogue passage and an answer a. In this 1 https://github.com/emorynlp/character-mining task, a query q replaces only one character entity with an unknown variable x and the machine is asked to infer the replaced character entity (answer a) from all the possible entities appear in the dialogue passage p. This task is evaluated by computing the accuracy of predictions (see Section )."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-48",
"text": "Table 2c shows an example of this task."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-50",
"text": "**MULTIPLE VARIABLE TASK ON THE SAME ENTITY(MVS)**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-51",
"text": "Similar to the first task, the multiple variable task on the same entity also consists a dialogue passage p, a query q which is from plot summary of the dialogue passage and an answer a. The difference is that the query q replaces all the same character entities with the same unknown variable x and the machine is asked to infer this same character entity (answer a) from all the possible entities appear in the dialogue passage p. The difficulty of this task is that the machine has to predict a character entity who does not appear in the query."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-52",
"text": "The evaluation of this task is the same as the first task since the model only predicts one entity (see Section )."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-53",
"text": "Table 2d shows an example of this task."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-55",
"text": "**TWO VARIABLES TASK(TV)**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-56",
"text": "To make this task more intriguing, we propose two new tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-57",
"text": "This task consists a dialogue passage p, a query q which is from plot summary of the dialogue passage and an answer pair a 1 and a 2 ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-58",
"text": "Here the query replaces two character entities with two different unknown variables x 1 and x 2 and the machine is asked to infer these two missing character entities(the answer pair a 1 and a 2 ) from all the possible entities appear in the dialogue passage p. Two replaced character entities can be the same or different and to avoid bias between two unknown variables, the different order of the variables will be considered as different queries."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-59",
"text": "Since there are several plot summaries which only have one character entity in it, the model needs to be designed to predict one or two entities, evaluated by the F1 score (Section )."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-61",
"text": "**APPROACHES**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-62",
"text": "We selected some published approaches which is to solve dialogue reading comprehension to validate on our new data split and three tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-63",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-64",
"text": "**BILSTM**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-65",
"text": "The LSTM model which is selected as one of the baselines methods in this project."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-66",
"text": "We deal with utterances and query separately."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-67",
"text": "First we generate the utterances and query embedding."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-68",
"text": "For utterances we combine them into documentlevel matrix and input them into Bidirectional LSTM and get the output h d ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-69",
"text": "For query, we directly input them into BiLSTM and get the output h q ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-70",
"text": "Finally we concatenate the output of h d and h q and pass them into a softmax classification layer."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-71",
"text": "By using Bidirectional LSTM, we can extract the sequence feature of utterances and queries."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-72",
"text": "Figure 1 shows the architecture for this model."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-74",
"text": "**CNN+BILSTM**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-75",
"text": "Based on Ma, Jurczyk, and Choi [2018] , we first use CNN to extract the gram-level features of utterances and then use @ent04 asks @ent00 how someone could get a hold of @ent00 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-76",
"text": "(c) Queries generated from the passages in (b) for single variable task ID Queries 1.a x spent $ 69.95 on a Wonder Mop 2.a x asks @ent00 how someone could get a hold of @ent00 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-77",
"text": "2.b @ent04 asks x how someone could get a hold of @ent00 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-78",
"text": "2.c @ent04 asks @ent00 how someone could get a hold of x 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-79",
"text": "2.d @ent04 asks @ent00 how someone could get a hold of @ent00 's credit card number and x is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-80",
"text": "(d) Queries generated from the passages in (b) for multiple variables task on the same entity ID Queries 1.a x spent $ 69.95 on a Wonder Mop 2.a x asks @ent00 how someone could get a hold of @ent00 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-81",
"text": "2.b @ent04 asks x how someone could get a hold of x 's credit card number and x is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-82",
"text": "(e) Queries generated from the passages in (b) for two variables task ID Queries 1.a x1 spent $ 69.95 on a Wonder Mop 1.b x2 spent $ 69.95 on a Wonder Mop 2.a x1 asks x2 how someone could get a hold of @ent00 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-83",
"text": "2.b x2 asks x1 how someone could get a hold of @ent00 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-84",
"text": "2.c x1 asks @ent00 how someone could get a hold of x2 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-85",
"text": "2.d x2 asks @ent00 how someone could get a hold of x1 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-86",
"text": "2.e @ent04 asks x1 how someone could get a hold of x2 's credit card number and @ent00 is surprised at how much was spent ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-87",
"text": "... ... Table 2 : An dialogue and its plots from the corpus and generated queries for each task."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-88",
"text": "x, x1 and x2 denote unknown variables."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-89",
"text": "LSTM to capture the sequence feature of both utterances and query."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-90",
"text": "CNN is applied into each utterance to extract more token level n-gram features then we take max pooing and combine them into a document-level matrix, and input it into the Bidirectional LSTM and get the output h q ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-91",
"text": "For query, we directly into them into Bidirectional LSTM and get the output h q , too."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-92",
"text": "Finally we concatenate the output of h d and h q and pass them into a softmax classification layer."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-93",
"text": "By using cnn and Bidirectional LSTM theoretically we can both capture the n-grams features and sequence features."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-95",
"text": "**CNN+BILSTM+UA+DA**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-96",
"text": "This method is the SOTA method last year in Ma, Jurczyk, and Choi [2018] 's data split which is also selected as one of our experimental methods."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-97",
"text": "Similar to the last one, they still use CNN to extract token-level n-gram features of utterances and use LSTM to capture the sequence feature, however, they added the utterance-level attention and documentlevel attention to extract more features related to similarity between query and utterances."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-98",
"text": "This utterance-level attention basically is to compute the similarity between each utterance and query."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-99",
"text": "This document level attention is to compute the similarity between query and whole document matrix."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-100",
"text": "Their output is concatenated with original Bidirectional LSTM output and pass them into a softmax classification layer."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-101",
"text": "By adding the attention layers, theoretically we can capture the some similarity features between utterances and query."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-102",
"text": "Figure 2 shows the architecture for this model."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-104",
"text": "**EXPERIMENTS DATA SPLIT OVERVIEW**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-105",
"text": "Our data split has three parts: for each season, we use episodes 1-18 as training data, episodes 19-21 as development data, and the rest of episodes as test data."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-106",
"text": "This data split is chosen to mimic the training data which a model would use in an application setting for reading comprehension, where historic data is used for training while recent data is the subject for prediction."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-108",
"text": "**EVALUATION METRICS**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-109",
"text": "We use different evaluation metrics for different tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-110",
"text": "For the SV task and the MVS task: we use simple accuracy since the model only predicts one variable:"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-111",
"text": "Where C r is the number of the right predictions and C t is the number of the total predictions."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-112",
"text": "For the TV task: we use F1 score, since model predicts one variable or two variables:"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-113",
"text": "Where C r is the number of the right predictions and C a is the number of the actual predictions and C g is the number of the gold answer."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-115",
"text": "**RESULTS AND ERROR ANALYSIS**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-116",
"text": "Results Table 4 shows the results of our experiment."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-117",
"text": "BiL-STM is good at capturing the sequence information of sentences; however, since it only finds some kind of answer distributions on the sequence information, it cannot capture the information of the relation between query and utterance."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118",
"text": "Adding a CNN can achieve even lower accuracy because passing sequences to the CNN only keeps important information after the pooling operation, but for dialogue data, most of the time the replaced entity needs to be decided by Ma, Jurczyk, and Choi [2018] are not helpful for these tasks on our data split because dialogues contain so many informal expressions and the size of the corpus is small."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-119",
"text": "With limited data size, the attention mechanism does not perform well to match different expressions with the same meaning."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-120",
"text": "The BERT model only uses attention mechanism to model the language feature, however, the pre-trained model is from formal text(wikipeida, etc) not from dialogue corpus, so it also performs poorly on our data split and tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-121",
"text": "A counter-intuitive part is that the MVS task appears to be so much easier than the SV task."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-122",
"text": "The main reason is because the model can easily learn that the entity that appears in the query will not appear in the answer which will reduce many answer choices."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-123",
"text": "Another counter-intuitive part is that BiLSTM without any additive method is the best model in two of the three tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-124",
"text": "The main reason of this is because the attention mechanism does not work well for these tasks on this dataset."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-125",
"text": "Error analysis For further research, we extracted 100 samples from test set of each task on their best performing model to analyze what kind of error may occur."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-126",
"text": "We found that the three main types of errors for these tasks are: Hidden Meaning, Utterances Reasoning & Summary as well as Coreference Resolution."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-127",
"text": "Table 5 shows the error types distribution of each task."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-128",
"text": "Table 6 shows the examples of three main error types."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-129",
"text": "Hidden Meaning Hidden meaning is one of the most frequent error types and it happens when a query uses formal expressions to represent the same meaning as an utterance with informal expressions."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-135",
"text": "Coreference Resolution This type of error occurs when there are coreferences in the utterances and query and it is confusing to determine what is the correct entity behind the pronoun."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-136",
"text": "This type of error occurs when there are numerous entities in the dialogue and the model cannot differentiate them because of less recognition of coreferences."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-137",
"text": "Object Linking This type of error refers to a situation when a query needs to know which entity has some kind of relationship with some kind of object(for example, another entity, substance, etc.)."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-138",
"text": "It can be especially confusing when the object and entity are not in the same utterance."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-139",
"text": "Annotation Issues This type of error refers to some issues when doing annotation, such as mislabeling of answers, unanswerable or irrelevant queries, misspellings in the utterance, mislabeling the entity in the dialogues, etc."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-140",
"text": "This type of error is trivial and hard to solve but still reflects the real situation in human communication."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-141",
"text": "Handle Single Variable This type of error occurs when solving the TV task since the TV task can either have a single variable or two variables."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-142",
"text": "The model cannot tell the differences between the single variable and two variables, and most of the time this occurs when the correct answer is only one variable, the model predicts two variables."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-143",
"text": "Miscellaneous This type of error represents error samples in which we cannot find apparent reason for the wrong prediction."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-144",
"text": "We need to research more to find them."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-146",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-147",
"text": "In this paper, we addressed the issue with the evaluation method of previous work by introducing a new data split and introduced new variants of the cloze-style reading comprehension task to increase task complexity."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-148",
"text": "In addition, we ran several neural network models to validate our new data split and new tasks."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-149",
"text": "By introducing evaluation methods that better motivate the dialogue reading comprehension task, we discovered many challenges that the model cannot easily tackle in the real situation."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-150",
"text": "Finally, we propose what we will do in the next step to address some of these challenges."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-151",
"text": "In order to understand the hidden meaning in the dialogues, we plan to create dialogue specific meaning representation to enable the model to learn the hidden meaning (a) Error example for hidden meaning: \"you two together\" = \"have a relationship\" Answer Query @ent02 @ent03 was with x picking out the ring."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-152",
"text": "Speaker Utterance ... ... @ent03 what ? @ent04 @ent02 's gonna ask @ent05 to marry him ! @ent03 oh I know , I helped pick out the ring . -( @ent02 laughs , turns , and sees that @ent04 and @ent00 aren't happy . ) ... ..."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-153",
"text": "(c) Error example for coreference resolution: resolution of \"he\""
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-154",
"text": "Answer Query @ent05 @ent01 tries to leave the set , but @ent02 tells him that so long as x is conscious and present on set , they will continue to shoot the film ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-155",
"text": "Speaker Utterance ... ... @ent01 @ent03 ! you got ta let me go ."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-157",
"text": "he then enters @ent05 's dressing room , to find @ent05 cutting his steak with his sword . ) @ent05 you would n't happen to have a very big fork ? ... ... of what the entity is trying to express."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-158",
"text": "However, this kind of approach is also limited by its labeled data size."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-159",
"text": "We will try to find unsupervised ways to enable the model to learn the semantic information itself."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-160",
"text": "Additionally, we will intrigue more approaches to enable model's understanding of hidden meaning in the dialogue."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-161",
"text": "For example, we will design more appropriate unsupervised pre-training tasks targeting to enable the model to learn the corresponding hidden meaning of expressions which have hidden meaning."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-162",
"text": "Furthermore, to overcome reasoning and summary challenges, we plan to incorporate knowledge base reasoning by either constructing a knowledge graph or using more dialogue data to design pre-training tasks to enable the model's ability to do inference from dialgoues."
},
{
"sent_id": "2cedb1a0f0c0fbb9bd95d5b54e4967-C001-163",
"text": "Another approach we plan to try is to use a neural network model such as graphic convolution network Defferrard, Bresson, and Van-dergheynst [2016] that can represent a knowledge graph structure based on dialogue dataset."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-12"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-28"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-37",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-38"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-47"
]
],
"cite_sentences": [
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-12",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-28",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-38",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-47"
]
},
"@MOT@": {
"gold_contexts": [
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-12"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-41"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118"
]
],
"cite_sentences": [
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-12",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-41",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118"
]
},
"@USE@": {
"gold_contexts": [
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-14"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-39",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-40"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-44"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-75"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-96"
],
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-116",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-117",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118"
]
],
"cite_sentences": [
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-14",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-39",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-40",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-44",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-75",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-96",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118"
]
},
"@EXT@": {
"gold_contexts": [
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-18"
]
],
"cite_sentences": [
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-18"
]
},
"@DIF@": {
"gold_contexts": [
[
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-116",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-117",
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118"
]
],
"cite_sentences": [
"2cedb1a0f0c0fbb9bd95d5b54e4967-C001-118"
]
}
}
},
"ABC_d7dba136667d6058bf46d6ede3f2ef_8": {
"x": [
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-81",
"text": "**EASY-TO-HARD DECODING**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-129",
"text": "Our Model."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-201",
"text": "Implementation details."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-2",
"text": "Existing entity alignment methods mainly vary on the choices of encoding the knowledge graph, but they typically use the same decoding method, which independently chooses the local optimal match for each source entity."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-3",
"text": "This decoding method may not only cause the \"many-to-one\" problem but also neglect the coordinated nature of this task, that is, each alignment decision may highly correlate to the other decisions."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-4",
"text": "In this paper, we introduce two coordinated reasoning methods, i.e., the Easy-to-Hard decoding strategy and joint entity alignment algorithm."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-5",
"text": "Specifically, the Easy-to-Hard strategy first retrieves the model-confident alignments from the predicted results and then incorporates them as additional knowledge to resolve the remaining modeluncertain alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-6",
"text": "To achieve this, we further propose an enhanced alignment model that is built on the current state-of-the-art baseline."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-7",
"text": "In addition, to address the many-to-one problem, we propose to jointly predict entity alignments so that the one-to-one constraint can be naturally incorporated into the alignment prediction."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-8",
"text": "Experimental results show that our model achieves the state-of-the-art performance and our reasoning methods can also significantly improve existing baselines."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-11",
"text": "Knowledge graphs (KGs), such as Freebase (Bollacker et al., 2008) and DBpedia (Auer et al., 2007) , represent worldlevel factoid information of entities and their relations in a graph-based format."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-12",
"text": "They have been successfully used in many natural language processing applications, such as question answering (Berant et al., 2013; Bao et al., 2014; Yih et al., 2015; Xu et al., 2016; Das et al., 2017) and relation extraction (Mintz et al., 2009; Hoffmann et al., 2011; Min et al., 2013; Zeng et al., 2015) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-13",
"text": "To date, there have been many KGs in different languages, with each being created in one language (Franco-Salvador, Rosso, and Montesy G\u00f3mez, 2016) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-14",
"text": "They share lots of the same facts, and each also provides rich additional information that the others do not cover."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-15",
"text": "Thus, it is very beneficial to establish the crosslingual alignments between KGs, so that the combined KG can provide richer knowledge for downstream tasks."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-16",
"text": "There-Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org)."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-17",
"text": "All rights reserved."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-18",
"text": "fore, the cross-lingual KG alignment task, which automatically matches entities between multilingual KGs, is proposed to address this problem."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-19",
"text": "Most recently, several approaches based on cross-lingual entity embeddings (Hao et al., 2016; Chen et al., 2017; Sun, Hu, and Li, 2017) or graph neural networks Xu et al., 2019; Wu et al., 2019) have been proposed for this task."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-20",
"text": "In particular, Xu et al. (2019) introduces the topic entity graph to capture the local context information of an entity within the KG, and further tackles this task as a graph matching problem by proposing a graph matching network."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-21",
"text": "This work significantly advanced the state-of-theart accuracies across several datasets."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-22",
"text": "Despite the excitingly progressive results that have been shown, all previous works fail to consider the coordinated nature of this task, that is, each alignment decision may highly correlate to the other decisions."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-23",
"text": "For example, all existing models independently align each source entity, which may result in the many-to-one mapping, i.e., more than one source entities are aligned to the same target entity."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-24",
"text": "In particular, we analyze the results of Xu et al. (2019) and find that nearly 8% of the alignments are many-to-one mappings."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-25",
"text": "One intuitive solution is to align these entities in a greedy fashion, that is, assign one alignment at each time with a constraint that all alignments are one-to-one mappings."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-26",
"text": "However, this may introduce the error propagation, since each decision error may propagate to the future decisions."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-27",
"text": "On the other hand, given the fact that the KGs are large, it is also impractical to jointly assign all alignments, due to the massive search space."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-28",
"text": "We analyze the results of existing alignment baselines and find the second type of errors are caused by the existence of adversarial entities that have similar surface strings and KG neighbors with the ground truth."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-29",
"text": "It is challenging for exist-ing approaches to disambiguate these entities since previous methods mainly rely on the embeddings that are derived by encoding the surface strings and KG neighbors."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-30",
"text": "Figure 1 gives such an example, where it is ambivalent for a model to align \u4e54\u6cbb\u00b7\u5e03\u4ec0 (George Bush) to \"George W. Bush\" or \"George H. W. Bush\", because both candidates have similar surface strings and share several common neighbors (such as \"Republic Party\" and \"U.S. president\")."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-31",
"text": "In this paper, we propose to alleviate these two types of errors using two coordinated reasoning methods, i.e., the Easy-to-Hard strategy and joint entity alignment algorithm."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-32",
"text": "Specifically, the Easy-to-Hard strategy leverages an iterative approach, where the most model-confident (easy) alignments predicted in the previous iteration are provided as additional inputs to the current iteration for resolving the remaining model-uncertain (hard) alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-33",
"text": "This idea is motivated by our observation that the model-confident alignments are mostly correct, and thus they can provide reliable clues for other decisions with less model confidence."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-34",
"text": "To address the many-to-one problem, we propose a joint entity alignment algorithm that finds the global optimal entity alignments that satisfy the one-to-one constraint."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-35",
"text": "This problem is essentially a fundamental combinatorial optimization problem whose exact solution can be found by the Hungarian algorithm (Kuhn, 1955) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-36",
"text": "However, since this algorithm takes a high time complexity of O(N 4 ) for KGs of N nodes, it is impractical to apply this algorithm in our framework directly."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-37",
"text": "To address this, we propose a simple yet effective solution that breaks down the whole search space into small isolated pieces, so that each piece could be efficiently solved with the Hungarian algorithm."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-38",
"text": "Experiments on the benchmark datasets show that our proposed coordinated reasoning methods can not only improve the current state-of-the-art performance but also significantly boost the performance of previous approaches."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-39",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-40",
"text": "**RELATED WORK**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-41",
"text": "Our work is mainly related to two lines of research: network embedding and knowledge graph alignment."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-43",
"text": "**GRAPH CONVOLUTIONAL NETWORKS**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-44",
"text": "Recently, there has been an increasing interest in extending neural networks to deal with graphs."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-45",
"text": "Defferrard, Bresson, and Vandergheynst (2016) proposed a spectral graph theoretical formulation of CNNs on graphs and a convolutional network extending the conventional CNNs to non-Euclidean space."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-46",
"text": "Kipf and Welling (2017) further extended this idea and proposed graph convolutional neural networks (GCNs) to integrate the connectivity patterns and feature attributes of graph-structured data, and achieved decent results in semi-supervised classification."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-47",
"text": "Thereafter, a series of improvements and extensions were proposed based on GCN."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-48",
"text": "GAT (Veli\u010dkovi\u0107 et al., 2017) employs the attention mechanism to GCNs, in which each node gets an importance score based on its neighborhood, thus providing more expressive representations for nodes."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-49",
"text": "Furthermore, the R-GCNs (Schlichtkrull et al., 2018 ) are proposed to model relational data and have been successfully exploited in link prediction and entity classification."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-50",
"text": "Inspired by the capability of GCNs on learning node representations, we employ the GCN to build our entity alignment framework."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-52",
"text": "**ENTITY ALIGNMENT**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-53",
"text": "Earliest approaches of entity alignment usually require expensive expert efforts to design model features (Mahdisoltani, Biega, and Suchanek, 2013) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-54",
"text": "Recently, embedding based methods have been proposed to address this issue."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-55",
"text": "MTransE (Chen et al., 2017) employs TransE (Bordes et al., 2013) to embed entities and relations of each knowledge graph in a separate space, and then provides five different variants of transformation functions to project the embedded vectors from one subspace to another."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-56",
"text": "The candidate set of one entity's correspondence in the other knowledge graph can be obtained by ranking the distance between them in the transformed space."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-57",
"text": "ITransE (Zhu et al., 2017) utilizes TransE to learn one common low-dimensional subspace for all knowledge graphs, with the constraint that the observed anchor seeds from different knowledge graphs share the same vector representation in the subspace."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-58",
"text": "AlignE (Sun, Hu, and Li, 2017) also adopts TransE to learn network embeddings, and applies parameter swapping to encode network into a unified space."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-59",
"text": "NTAM (Li et al., 2018 ) utilizes a probabilistic model for the alignment task."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-60",
"text": "Instead of using TransE to derive entity embeddings from the knowledge graph, various GCN based methods Ye et al., 2019; Wu et al., 2019) that use the conventional GCN to encode the entities and relations have been proposed to perform the alignment."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-61",
"text": "Different with those methods that still follow previous works that rely on learned entity embeddings to rank alignments, Xu et al. (2019) views this task as a graph matching problem and further proposes a graph matching neural network that additionally considers the matching information of an entity's neighborhood to perform the prediction."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-62",
"text": "Despite these approaches achieve progressive results, all current works focus on encoding the entities and relations, while neglecting the fact that the decoding strategy may have a considerable impact over the final performance."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-63",
"text": "In this paper, we explore the coordinated nature of this task and propose two types of reasoning methods to improve the performance of these baselines."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-65",
"text": "**PROBLEM FORMULATION**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-66",
"text": "Formally, a KG is represented as G = (E, R, T ), where E, R, T are the sets of entities, relations, and triples, respectively."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-67",
"text": "Let G 1 = (E 1 , R 1 , T 1 ) and G 2 = (E 2 , R 2 , T 2 ) be two heterogeneous KGs to be aligned."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-68",
"text": "That is, an entity in G 1 (source entity) may have its counterpart in G 2 (target entity) in a different language or different surface names."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-69",
"text": "As a starting point, we can collect a small number of equivalent entity pairs between G 1 and G 2 as the alignment seeds."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-70",
"text": "We define the entity alignment task as automatically finding more equivalent entities using the alignment seeds as training data."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-72",
"text": "**COORDINATED REASONING**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-73",
"text": "All existing works follow the conventional framework that first encodes the context information of the source entity within the KG into a distributional representation and then ranks the candidate target entities according to the representation similarities."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-74",
"text": "These works may differ in the choice of the encoder, such as TransE or GCN, but all of them utilize the same decoding method, which simply picks the local optimal candidate for each source entity without considering the global alignment coherence."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-75",
"text": "For example, more than one source entities may be aligned to the same target entity, causing the many-to-one problem."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-76",
"text": "This simple decoding strategy also neglects the coordinated nature of this task, that is, previously predicted alignments are also helpful to future predictions."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-77",
"text": "Motivated by these observations, we propose two types of coordinated reasoning methods."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-78",
"text": "First, to address the manyto-one problem, we jointly predict alignments by explicitly incorporating the one-to-one constraint into the decoding."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-79",
"text": "Second, we propose a new Easy-to-Hard decoding strategy that first resolves the most model-confident alignments and then uses them as additional evidence to better handle the model-uncertain alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-82",
"text": "All existing models independently predict alignments for source entities while neglecting the fact that the decoding strategy may have a significant impact over the performance."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-83",
"text": "Figure 1 illustrates such an example where the goal is to align \u4e54\u6cbb\u00b7\u5e03\u4ec0 (George Bush) from the Chinese KG into the English KG."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-84",
"text": "Given its two candidates, i.e., George W. Bush and George H. W. Bush, it is challenging for previous methods to find the correct alignment (George W. Bush) since these candidates have almost the same neighbors, except that George W. Bush graduated from Harvard University while George H. W. Bush not."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-85",
"text": "On the other hand, we can see that the Chinese KG includes a fact, <\u4e54\u6cbb\u00b7\u5e03\u4ec0 graduated from \u54c8\u4f5b\u5927\u5b66 (Harvard University)>, which is strong evidence for aligning \u4e54\u6cbb\u00b7\u5e03\u4ec0 to George W. Bush."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-86",
"text": "Intuitively, if a model could first align \u54c8\u4f5b\u5927\u5b66 to the Harvard University and introduce this as additional knowledge, it could be more easy for the model to find the correct alignment for \u4e54\u6cbb\u00b7\u5e03\u4ec0."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-87",
"text": "Compared to the alignment for \u4e54\u6cbb\u00b7\u5e03 \u4ec0, which is Hard to resolve, the alignment for \u54c8\u4f5b\u5927\u5b66 is relatively Easier."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-88",
"text": "Inspired by the above observation, in this paper, we propose a new decoding method, namely Easy-to-Hard strategy, which first attempts to resolve \"easy\" alignments in the test set and then incorporates them as additional knowledge into the model to better tackle the remaining \"hard\" alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-89",
"text": "There are two main challenges here."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-90",
"text": "First of all, it is difficult to determine whether an alignment is easy or hard to resolve."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-91",
"text": "Second, existing dominant models are mainly built on the graph neural networks, and it is unclear how to integrate such additional knowledge into their models."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-92",
"text": "We analyze the alignment results of three baseline methods, i.e., Wang et al. (2018) , Xu et al. (2019) and Wu et al. (2019) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-93",
"text": "Interestingly, we find that all these baselines could achieve at least 99.5% accuracy for those alignments with normalized probabilities over 0.9."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-94",
"text": "This result is coherent with our expectation since a higher probability typically suggests that the model is more confident about the prediction and also indicates that this alignment is easier for the model to resolve."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-95",
"text": "Therefore, we apply the following steps to decode the test set iteratively."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-96",
"text": "Step Description 1 Employ an alignment model to predict alignments for all source entities in the test set."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-97",
"text": "2"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-98",
"text": "Use a predefined probability threshold \u03b1 to refine those alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-99",
"text": "In particular, assignments with probabilities higher than \u03b1 are regarded as easy alignments while the others are viewed as hard alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-100",
"text": "3"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-101",
"text": "If more than K easy alignments are found in"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-102",
"text": "Step 2, take these easy alignments as additional knowledge and incorporate them into the alignment model to establish alignments for the remaining entities (go to Step 1); otherwise, return all alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-103",
"text": "After establishing easy assignments in each decoding step, we need to incorporate them as additional knowledge into the alignment model for the next round decoding."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-104",
"text": "This design heavily depends on alignment model architecture."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-105",
"text": "In this paper, we use the state-of-the-art alignment model (Xu et al., 2019) as our baseline method and propose two ways to enhance this model by incorporating easy assignment information."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-106",
"text": "Alignment Model Baseline."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-107",
"text": "Xu et al. (2019) utilized a graph (namely topic graph) to capture the context information of an entity (namely topic entity) within the KG."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-108",
"text": "For instance, Figure 2 gives the topic graphs of George Bush in both the Chinese and English KG."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-109",
"text": "The entity alignment task is then viewed as a graph matching problem, whose goal is to calculate the similarity of these two topic graphs, say G 1 and G 2 ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-110",
"text": "To achieve this, they further propose a neural graph matching model that includes the following four layers:"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-111",
"text": "\u2022 Input Representation Layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-112",
"text": "The goal of this layer is to learn embeddings for entities that occurred in topic entity graphs by using a graph convolution neural network (GCN) (Kipf and Welling, 2017) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-113",
"text": "\u2022 Node-Level Matching Layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-114",
"text": "This layer is designed to capture local matching information by comparing each entity embedding of one topic entity graph against all entity embeddings of the other graph in both ways (from G 1 to G 2 and G 2 to G 1 )."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-115",
"text": "\u2022 Graph-Level Matching Layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-116",
"text": "In this layer, the model applies another GCN to propagate the local matching information throughout the graph."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-117",
"text": "The motivation behind it is that this GCN layer can encode the global matching state between the pairs of whole graphs."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-118",
"text": "The model then feeds these matching representations to a fully-connected neural network and applies the element-wise max and mean pooling method to generate a fixed-length graph matching representation."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-119",
"text": "Yale ,"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-121",
"text": "**ALIGNMENT RESULTS**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-123",
"text": "**INFERENCE**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-124",
"text": "George W. Bush Figure 2 : A running example of our Easy-to-Hard decoding strategy for aligning George Bush in the English and Chinese knowledge graph."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-125",
"text": "After the first round decoding, the baseline model aligns \u54c8\u4f5b\u5927\u5b66 to Harvard and \u8036\u9c81\u5927\u5b66 to Yale, because their probabilities predicted by M 0 is higher than \u03b1."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-126",
"text": "After introducing these information, our enhanced model M 1 increased the probability of aligning \u4e54\u6cbb\u00b7\u5e03\u4ec0 to George W. Bush while decreasing the probability of its alignment to George H. W. Bush."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-127",
"text": "\u2022 Prediction Layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-128",
"text": "The model finally uses a two-layer feed-forward neural network to consume the fixed-length graph matching representation and applies the softmax function in the output layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-130",
"text": "In contrast to Xu et al. (2019) that only takes two topic graphs as input, we can utilize additional information such as easy assignments found in previous decoding steps to resolve hard assignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-131",
"text": "In particular, we introduce two ways to enhance this baseline model by explicitly integrating the easy assignment information into two layers of Xu et al. (2019) :"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-132",
"text": "\u2022 Enhanced Input Representation Layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-133",
"text": "In this layer, Xu et al. (2019) utilizes a GCN to learn entity embeddings from the topic graph, where the entity surface form has been proved to be a key feature in deriving their embeddings."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-134",
"text": "Therefore, we require that the aligned entities found in the easy alignments should have the same surface forms so that they could share the common embeddings."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-135",
"text": "For example, in Figure 2 , after the first round of decoding, \u54c8\u4f5b\u5927\u5b66 (Harvard University) is aligned to Harvard, we then change the surface form of \"\u54c8\u4f5b\u5927\u5b66\" to \"Harvard\" in the second decoding step."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-136",
"text": "\u2022 Enhanced Node-Level Matching Layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-137",
"text": "As concluded in Xu et al. (2019) , the node-level matching layer has a significant impact on the matching performance, since it captures the local entity matching information."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-138",
"text": "In the baseline model, the entity similarities are calculated based on the entity embeddings derived from the first GCN layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-139",
"text": "Although in the enhanced input representation layer the aligned entities have the same surface forms, it can still not guarantee that their embeddings are close."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-140",
"text": "It is because the first GCN layer is supposed to encode not only the surface form but also the structural information into their representations."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-141",
"text": "Therefore, we explicitly incorporate the easy alignment information into this layer by enforcing that the normalized similarities between the aligned entities to be 1.0."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-142",
"text": "Then, we feed the revised entity similarities to the graph-level matching layer and the final prediction layer."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-143",
"text": "Notice that, in practice, there are two possible options to build the enhanced alignment model in our framework."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-144",
"text": "First, we can directly use the pre-trained baseline but replace its first two layers with our proposed enhanced layers."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-145",
"text": "Because we do not modify the model architecture, no more parameters are needed to be learned."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-146",
"text": "The second way is to train a new enhanced alignment model with randomly sampled alignments as simulated easy alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-147",
"text": "The motivation behind is that given more easy alignments, the model could more focus on learning to disambiguate hard alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-148",
"text": "Experimental results show that the latter achieves much better performance."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-149",
"text": "We will discuss these two options in the experiment section."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-151",
"text": "**JOINT ENTITY ALIGNMENT**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-152",
"text": "As shown in Figure 3 (a), our model typically outputs a 2dimensional matrix of probabilities after decoding, where each cell item (such as p(e t |e s )) represents the likelihood of aligning source entity e s to target entity e t ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-153",
"text": "The goal of the entity alignment task is then equal to find the best solution (a set of one-to-one alignments) with the highest probability:"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-154",
"text": "where A represents one solution."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-155",
"text": "Since knowledge graphs are usually huge, this problem cannot be solved by naive enumeration, which takes O(N !) time for KGs with N entities."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-156",
"text": "Existing works choose the optimal local match for each source entity while neglecting the one-to-one nature, and as a result, multiple source entities may be mapped to one target entity."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-157",
"text": "Here, for the first time, we propose to explicitly incorporate this one-to-one constraint into the alignment prediction."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-158",
"text": "To achieve this, we first reformat the goal from maximizing the product of probabilities (Equation 1) to minimizing the sum of negative log-likelihoods."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-160",
"text": "**ARG MIN**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-161",
"text": "As a result, the entity alignment problem is equivalently converted to the well-studied \"task assignment\" problem 1 , where each agent/task is assigned to exactly one task/agent, and each agent-task assignment has a fixed cost that does not depend on the other assignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-162",
"text": "The Hungarian algorithm (Kuhn, 1955) has been proven to be efficient for finding the best solution for this problem."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-163",
"text": "It takes a cost matrix as input, which can be easily achieved by padding rows or columns of a constant value for the nonsquare matrix."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-164",
"text": "For a brief introduction, the algorithm takes 1 https://en.wikipedia.org/wiki/Assignment problem the following four main steps for the cost matrix with N \u00d7N elements, where the last two steps repeat until a solution is found."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-165",
"text": "2 It is guaranteed that a solution could be found within O(N 4 ) time."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-166",
"text": "Step Description 1 Find the lowest item for each row and subtract it from the others in that row."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-167",
"text": "2"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-168",
"text": "Similarly, find the lowest item for each column and subtract it from the others in that column."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-169",
"text": "3"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-170",
"text": "Cover all zeros in the resulting matrix using a minimum number of horizontal and vertical lines."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-171",
"text": "If less than N lines are required, go to"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-172",
"text": "Step 4; otherwise, a solution is found."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-173",
"text": "4"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-174",
"text": "Find the smallest item v not covered by any line in Step 3. Subtract v from all uncovered items, and add v to all items covered by two lines."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-175",
"text": "Go to Step 3."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-176",
"text": "One can see that naively applying Hungarian is impractical, as it still takes O(N 4 ) computation time for matching two KGs of N nodes."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-177",
"text": "To further decrease the time consumption, we break the whole search space into many isolated sub-spaces, where each sub-space contains only a subset of source and target entities for making alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-178",
"text": "Specifically, we discard the candidate alignments with a probability lower than a predefined threshold \u03c4 from the original search space."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-179",
"text": "Based on this, we define two source entities being connected only if they share common candidates in the target."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-180",
"text": "Doing in this way fits the intuition where a large KG usually contains many domains, such as politics, sports and science, and only the entities within each domain have densely interacted."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-181",
"text": "Our experiments show that \u03c4 has little effect on performance, while it dramatically reduces the search time."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-182",
"text": "Figure 3 illustrates the search space separation process, where thin and dotted lines correspond to low-confident alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-183",
"text": "After dropping out these alignments with low model scores, the whole search space is split into two independent sub-spaces, as shown in Figure 3(b) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-184",
"text": "Here A and B are in the same sub-space, as they share the same target can- Removed connections (such as A to 2 ) are considered as infinite cost."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-185",
"text": "As the next step, each sub-space is solved with the Hungarian algorithm, before their results are combined to form our final outputs."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-186",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-187",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-188",
"text": "Datasets."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-189",
"text": "We evaluate our approach on three large-scale cross-lingual datasets from DBP15K (Sun, Hu, and Li, 2017) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-190",
"text": "These datasets are built upon Chinese, English, Japanese and French versions of DBpedia (Auer et al., 2007) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-191",
"text": "Each dataset contains 15,000 inter-language links connecting equivalent entities in two KGs of different languages."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-192",
"text": "We use the same training/testing split as previous works, 30% for training, and 70% for testing."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-193",
"text": "Table 1 lists their statistical summaries."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-194",
"text": "Evaluation Metrics."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-195",
"text": "Like previous works, we use Hits@1 to evaluate our model, where a Hits@1 score (higher is better) is computed by measuring the proportion of correctly aligned entities ranked in the top one."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-196",
"text": "Comparison Models."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-197",
"text": "We compare our approach against existing alignment methods: JE (Hao et al., 2016) , MTransE (Chen et al., 2017) , JAPE (Sun, Hu, and Li, 2017) , IPTransE (Zhu et al., 2017) , BootEA (Sun, Hu, and Li, 2017) , GCN , GM (Xu et al., 2019) and RDGCN (Wu et al., 2019) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-198",
"text": "Model Variants."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-199",
"text": "To evaluate different reasoning methods, we provide three implementation variants of our model for ablation studies, including (1) X-EHD: the baseline model X that only uses our proposed Easy-to-Hard Decoding strategy;"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-200",
"text": "(2) X-JEA: the baseline model X that only uses our proposed Joint Entity Alignment method; (3) X-EHD-JEA: the baseline model X that uses both of these two reasoning methods."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-202",
"text": "For the configurations of the alignment model, we use the same settings as Xu et al. (2019) ."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-203",
"text": "Specifically, we use the Adam optimizer (Kingma and Ba, 2014) to update parameters with mini-batch size 32."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-204",
"text": "The learning rate is set to 0.001."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-205",
"text": "The hop size of two GCN layers is set to 2 and 3, respectively."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-206",
"text": "Following Wu et al. (2019) , we use Google Translate to translate Chinese, Japanese, and French entity names into English, and then use Glove embeddings (Pennington, Socher, and Manning, 2014) to construct the initial entity representations in the model."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-207",
"text": "For all datasets, we first use the baseline model to retrieve the top 10 alignments, normalize their scores as probabilities and then perform the proposed coordinated reasoning methods over them."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-208",
"text": "For the Easy-to-Hard decoding method, \u03b1 is set to 0.75, and K is set to 20."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-209",
"text": "For the joint entity alignment, \u03c4 is set to 0.10."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-210",
"text": "For training the enhanced alignment model, for each topic graph pair, we randomly choose at most two gold alignments from the ground truth as the simulated easy alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-211",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-212",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-213",
"text": "Main Results Table 2 shows the performance of all compared approaches on the evaluation datasets."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-214",
"text": "We can see that both of the Easyto-Hard decoding strategy (referred as EHD in Table 2 ) and the joint entity alignment method (referred as JEA in Table 2) could significantly improve the performance of GM."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-215",
"text": "When these two methods are combined, the overall performance is further improved, outperforming previous works."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-216",
"text": "We also investigate whether our proposed reasoning methods could also boost existing baselines."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-217",
"text": "From Table 2 , we can see also that the joint entity alignment method could also improve the performance of GCN, BootEA and RDGCN, indicating that our method is able to avoid the many-to-one problem effectively."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-218",
"text": "Recall that, the Easy-to-Hard decoding method requires an enhanced alignment model that could integrate the easy alignment information."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-219",
"text": "Since designing enhanced versions for these baselines is beyond our goal, here we only enforce that that the aligned entities found in the easy alignments have the same surface form."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-220",
"text": "We find that this simplified strategy could still improve these baselines, which also suggests that our proposed decoding strategy is generally helpful to the alignment models."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-222",
"text": "**DISCUSSION**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-223",
"text": "Let us first look at the impacts of alignment-dropping threshold \u03c4 to both the performance and running time for our joint entity alignment algorithm."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-224",
"text": "From decreasing \u03c4 can slightly improve the performance but with a huge cost of computation time."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-225",
"text": "For example, when \u03c4 is changed from 0.15 to 0.10, the accuracy could increase by 0.12% but the computation time dramatically increases from 39s to almost 25 minutes."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-226",
"text": "Moreover, if \u03c4 is set to 0.05, we cannot even get the results."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-227",
"text": "As shown in Table 3 , in order to better understand why the running time changes, we additionally analyze the size of the largest sub-space."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-228",
"text": "We find that the size of the maximal sub-space under \u03c4 = 0.05 is 3 times more than the size under \u03c4 = 0.10, thus the running time under \u03c4 = 0.05 is expected to be roughly 32 hours, which is 81 (3 4 ) times than the time under \u03c4 = 0.10."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-229",
"text": "The running time does not significantly change when increasing \u03c4 from 0.15 to 0.20, because the Hungarian algorithm does not take much time for this situation, and the most time consumption is data processing."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-230",
"text": "We also investigated the impact of the probability threshold \u03b1 on the performance for our Easy-to-Hard decoding method."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-231",
"text": "We experimented with different \u03b1 values and evaluated our model on the development set of the DBP15K."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-232",
"text": "Table 4 reports hit@1 accuracies on these datasets."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-233",
"text": "We can see that our model could benefit from decreasing \u03b1 until it reaches 0.75."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-234",
"text": "It is expected to find that lower \u03b1 may hurt the performance since it incorporates some incorrect predictions as easy (gold) alignments into the model."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-235",
"text": "Recall that in our decoding algorithm, we continuously perform the inference until less than K new easy alignments are found in the previous round."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-236",
"text": "As shown in Table 4 , we observed that decreasing \u03b1 not only achieves worse performance but also requires more converge rounds."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-237",
"text": "To better understand why more converge rounds are required, we analyzed the intermediate established alignments during the inference."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-238",
"text": "We find this is due to those incorrect alignments introduced by reducing \u03b1 produce a chain reaction, which offers the model more confidence about some uncertain but incorrect alignments, resulting in more decoding rounds."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-239",
"text": "Recall that there are two options to build the enhanced alignment model, where the first one directly replaces two layers of a pre-trained GM model with our proposed enhanced layers while keeping the parameters the same; the second one trains a new GM model with simulated easy alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-240",
"text": "We evaluate these two options on several datasets and observe that both of these two ways could improve the performance but the model could gain more performance improvement from the second way."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-241",
"text": "We further manually analyze the predicted alignments of these two options and find that the new trained GM model could resolve more ambiguous (hard) alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-242",
"text": "We think this is due to that introducing the simulated easy alignments into the training phase could allow the model to learn how to properly utilize these additional evidence to disambiguate the hard alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-243",
"text": "Here one natural question is how many simulated easy alignments are needed for training the new GM model."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-244",
"text": "In experiments, we find that using two simulated easy alignments to train the model could achieve the best performance; introducing more easy alignments to train the model could not further improve the results."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-245",
"text": "However, this observation is in conflict with our intuition, that is, more easy alignment information could better help the model to disambiguate those uncertain predictions."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-246",
"text": "By analyzing the entities in the test set, we find this is due to that among these entities, at most three entities co-occur in the same topic graphs, and consequently, during the decoding, the model could only introduce at most two easy alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-247",
"text": "Motivated by this observation, we conducted an additional experiment that predicts alignments for all entities in the KGs except the training seeds."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-248",
"text": "We find that our reasoning methods could achieve more performance improvement, and considering more than two easy alignments into the training also further improves the overall performance as we expected."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-249",
"text": "Note that, although this experiment may consume almost 5 times more than the original decoding time, we believe that some optimization could be adopted to reduce the time complexity, which we leave for the future work."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-250",
"text": "----------------------------------"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-251",
"text": "**CONCLUSION**"
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-252",
"text": "Previous entity alignment methods mainly use the same decoding strategy that independently chooses the optimal local match for each source entity without considering the global alignment coherence, thereby may cause the manyto-one problem."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-253",
"text": "To address this, we propose two reasoning method, including a new Easy-to-Hard decoding strategy and joint entity alignment method."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-254",
"text": "Specifically, the Easyto-Hard decoding method iteratively decodes the test set by taking the most model-confident alignments predicted in the previous iteration as additional inputs to the current iteration for resolving the model-uncertain alignments."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-255",
"text": "The joint entity alignment method views the entity alignment as the task assignment problem and employs the Hungarian algorithm to guarantee the predicted alignments are one-to-one mappings."
},
{
"sent_id": "d7dba136667d6058bf46d6ede3f2ef-C001-256",
"text": "Experimental results on the DBP15K dataset show that our reasoning methods are general to these baselines and can significantly improve their performance."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d7dba136667d6058bf46d6ede3f2ef-C001-19"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-20"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-61"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-137"
]
],
"cite_sentences": [
"d7dba136667d6058bf46d6ede3f2ef-C001-19",
"d7dba136667d6058bf46d6ede3f2ef-C001-20",
"d7dba136667d6058bf46d6ede3f2ef-C001-61",
"d7dba136667d6058bf46d6ede3f2ef-C001-137"
]
},
"@USE@": {
"gold_contexts": [
[
"d7dba136667d6058bf46d6ede3f2ef-C001-24"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-92"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-105"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-197"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-202"
]
],
"cite_sentences": [
"d7dba136667d6058bf46d6ede3f2ef-C001-24",
"d7dba136667d6058bf46d6ede3f2ef-C001-92",
"d7dba136667d6058bf46d6ede3f2ef-C001-105",
"d7dba136667d6058bf46d6ede3f2ef-C001-197",
"d7dba136667d6058bf46d6ede3f2ef-C001-202"
]
},
"@EXT@": {
"gold_contexts": [
[
"d7dba136667d6058bf46d6ede3f2ef-C001-105"
],
[
"d7dba136667d6058bf46d6ede3f2ef-C001-131"
]
],
"cite_sentences": [
"d7dba136667d6058bf46d6ede3f2ef-C001-105",
"d7dba136667d6058bf46d6ede3f2ef-C001-131"
]
},
"@DIF@": {
"gold_contexts": [
[
"d7dba136667d6058bf46d6ede3f2ef-C001-130"
]
],
"cite_sentences": [
"d7dba136667d6058bf46d6ede3f2ef-C001-130"
]
}
}
},
"ABC_00a2e4d0cacfb1fb7098bd324d960a_9": {
"x": [
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-18",
"text": "p(w 1:T ) = p(w 1 )"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-19",
"text": "T \u22121 t=1 p(w t+1 |w 1:t )."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-16",
"text": "Let w 1:T be a word sequence with length T : w 1 , ..., w T ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-2",
"text": "This paper proposes a state-of-the-art recurrent neural network (RNN) language model that combines probability distributions computed not only from a final RNN layer but also from middle layers."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-3",
"text": "Our proposed method raises the expressive power of a language model based on the matrix factorization interpretation of language modeling introduced by Yang et al. (2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-4",
"text": "The proposed method improves the current state-of-the-art language model and achieves the best score on the Penn Treebank and WikiText-2, which are the standard benchmark datasets."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-5",
"text": "Moreover, we indicate our proposed method contributes to two application tasks: machine translation and headline generation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-6",
"text": "Our code is publicly available at: https://github.com/nttcslabnlp/doc lm."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-9",
"text": "Neural network language models have played a central role in recent natural language processing (NLP) advances."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-10",
"text": "For example, neural encoderdecoder models, which were successfully applied to various natural language generation tasks including machine translation , summarization (Rush et al., 2015) , and dialogue (Wen et al., 2015) , can be interpreted as conditional neural language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-11",
"text": "Neural language models also positively influence syntactic parsing (Dyer et al., 2016; Choe and Charniak, 2016) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-12",
"text": "Moreover, such word embedding methods as Skipgram (Mikolov et al., 2013) and vLBL (Mnih and Kavukcuoglu, 2013) originated from neural language models designed to handle much larger vocabulary and data sizes."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-13",
"text": "Neural language models can also be used as contextualized word representations (Peters et al., 2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-14",
"text": "Thus, language modeling is a good benchmark task for investigating the general frameworks of neural methods in NLP field."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-17",
"text": "We obtain the joint probability of word sequence w 1:T as follows:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-15",
"text": "In language modeling, we compute joint probability using the product of conditional probabilities."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-20",
"text": "( 1) p(w 1 ) is generally assumed to be 1 in this literature, that is, p(w 1 ) = 1, and thus we can ignore its calculation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-21",
"text": "See the implementation of Zaremba et al. (2014) 1 , for an example."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-22",
"text": "RNN language models obtain conditional probability p(w t+1 |w 1:t ) from the probability distribution of each word."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-23",
"text": "To compute the probability distribution, RNN language models encode sequence w 1:t into a fixed-length vector and apply a transformation matrix and the softmax function."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-24",
"text": "Previous researches demonstrated that RNN language models achieve high performance by using several regularizations and selecting appropriate hyperparameters (Melis et al., 2018; Merity et al., 2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-25",
"text": "However, Yang et al. (2018) proved that existing RNN language models have low expressive power due to the Softmax bottleneck, which means the output matrix of RNN language models is low rank when we interpret the training of RNN language models as a matrix factorization problem."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-26",
"text": "To solve the Softmax bottleneck, Yang et al. (2018) proposed Mixture of Softmaxes (MoS), which increases the rank of the matrix by combining multiple probability distributions computed from the encoded fixed-length vector."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-27",
"text": "In this study, we propose Direct Output Connection (DOC) as a generalization of MoS. For stacked RNNs, DOC computes the probability distributions from the middle layers including input embeddings."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-28",
"text": "In addition to raising the rank, the proposed method helps weaken the vanishing gradient problem in backpropagation because DOC provides a shortcut connection to the output."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-29",
"text": "We conduct experiments on standard benchmark datasets for language modeling: the Penn Treebank and WikiText-2."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-30",
"text": "Our experiments demonstrate that DOC outperforms MoS and achieves state-of-theart perplexities on each dataset."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-31",
"text": "Moreover, we investigate the effect of DOC on two applications: machine translation and headline generation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-32",
"text": "We indicate that DOC can improve the performance of an encoder-decoder with an attention mechanism, which is a strong baseline for such applications."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-33",
"text": "In addition, we conduct an experiment on the Penn Treebank constituency parsing task to investigate the effectiveness of DOC."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-35",
"text": "**RNN LANGUAGE MODEL**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-36",
"text": "In this section, we briefly overview RNN language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-37",
"text": "Let V be the vocabulary size and let P t \u2208 R V be the probability distribution of the vocabulary at timestep t. Moreover, let D h n be the dimension of the hidden state of the n-th RNN, and let D e be the dimensions of the embedding vectors."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-38",
"text": "Then the RNN language models predict probability distribution P t+1 by the following equation:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-39",
"text": "where W \u2208 R V \u00d7D h N is a weight matrix 2 , E \u2208 R De\u00d7V is a word embedding matrix, x t \u2208 {0, 1} V is a one-hot vector of input word w t at timestep t, and h n t \u2208 R D h n is the hidden state of the n-th RNN at timestep t. We define h n t at timestep t = 0 as a zero vector: h n 0 = 0."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-40",
"text": "Let f (\u00b7) represent an abstract function of an RNN, which might be the Elman network (Elman, 1990) , the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) , the Recurrent Highway Network (RHN) (Zilly et al., 2017) , or any other RNN variant."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-41",
"text": "In this research, we stack three LSTM layers based on Merity et al. (2018) because they achieved high performance."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-42",
"text": "3 Language Modeling as Matrix Factorization Yang et al. (2018) indicated that the training of language models can be interpreted as a matrix 2 Actually, we apply a bias term in addition to the weight matrix but we omit it to simplify the following discussion."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-43",
"text": "factorization problem."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-44",
"text": "In this section, we briefly introduce their description."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-45",
"text": "Let word sequence w 1:t be context c t ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-46",
"text": "Then we can regard a natural language as a finite set of the pairs of a context and its conditional probability distribution: L = {(c 1 , P * (X|c 1 )), ..., (c U , P * (X|c U ))}, where U is the number of possible contexts and X \u2208 {0, 1} V is a variable representing a onehot vector of a word."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-47",
"text": "Here, we consider matrix A \u2208 R U \u00d7V that represents the true log probability distributions and matrix H \u2208 R U \u00d7D h N that contains the hidden states of the final RNN layer for each context c t :"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-48",
"text": "Then we obtain set of matrices F (A) = {A + \u039bS}, where S \u2208 R U \u00d7V is an all-ones matrix, and \u039b \u2208 R U \u00d7U is a diagonal matrix."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-49",
"text": "F (A) contains matrices that shifted each row of A by an arbitrary real number."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-50",
"text": "In other words, if we take a matrix from F (A) and apply the softmax function to each of its rows, we obtain a matrix that consists of true probability distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-51",
"text": "Therefore, for some A \u2208 F (A), training RNN language models is to find the parameters satisfying the following equation:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-52",
"text": "Equation 6 indicates that training RNN language models can also be interpreted as a matrix factorization problem."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-53",
"text": "In most cases, the rank of matrix HW is D h N because D h N is smaller than V and U in common RNN language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-54",
"text": "Thus, an RNN language model cannot express true distributions if D h N is much smaller than rank(A )."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-55",
"text": "Yang et al. (2018) also argued that rank(A ) is as high as vocabulary size V based on the following two assumptions:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-56",
"text": "1. Natural language is highly context-dependent."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-57",
"text": "In addition, since we can imagine many kinds of contexts, it is difficult to assume a basis that represents a conditional probability distribution for any contexts."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-58",
"text": "In other words, compressing U is difficult."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-59",
"text": "2. Since we also have many kinds of semantic meanings, it is difficult to assume basic meanings that can create all other semantic meanings by such simple operations as addition and subtraction; compressing V is difficult."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-60",
"text": "In summary, Yang et al. (2018) indicated that D h N is much smaller than rank(A) because its scale is usually 10 2 and vocabulary size V is at least 10 4 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-61",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-62",
"text": "**PROPOSED METHOD: DIRECT OUTPUT CONNECTION**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-63",
"text": "To construct a high-rank matrix, Yang et al. (2018) proposed Mixture of Softmaxes (MoS)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-64",
"text": "MoS computes multiple probability distributions from the hidden state of final RNN layer h N and regards the weighted average of the probability distributions as the final distribution."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-65",
"text": "In this study, we propose Direct Output Connection (DOC), which is a generalization method of MoS. DOC computes probability distributions from the middle layers in addition to the final layer."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-66",
"text": "In other words, DOC directly connects the middle layers to the output."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-67",
"text": "Figure 1 shows an overview of DOC, that uses the middle layers (including word embeddings) to compute the probability distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-68",
"text": "Figure 1 computes three probability distributions from all the layers, but we can vary the number of probability distributions for each layer and select some layers to avoid."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-69",
"text": "In our experiments, we search for the appropriate number of probability distributions for each layer."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-70",
"text": "Formally, instead of Equation 2, DOC computes the output probability distribution at timestep t + 1 by the following equation:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-71",
"text": "s.t."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-72",
"text": "where \u03c0 j,ct is a weight for each probability distribution, k j,ct \u2208 R d is a vector computed from each hidden state h n , andW \u2208 R V \u00d7d is a weight matrix."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-73",
"text": "Thus, P t+1 is the weighted average of J probability distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-74",
"text": "We define the U \u00d7 U diagonal matrix whose elements are weight \u03c0 j,c for each context c as \u03a6. Then we obtain matrix\u00c3 \u2208 R U \u00d7V :"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-75",
"text": "where K j \u2208 R U \u00d7d is a matrix whose rows are vector k j,c .\u00c3 can be an arbitrary high rank because the righthand side of Equation 9 computes not only the matrix multiplication but also a nonlinear function."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-76",
"text": "Therefore, an RNN language model with DOC can output a distribution matrix whose rank is identical to one of the true distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-77",
"text": "In other words,\u00c3 is a better approximation of A than the output of a standard RNN language model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-78",
"text": "Next we describe how to acquire weight \u03c0 j,ct and vector k j,ct ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-79",
"text": "Let \u03c0 ct \u2208 R J be a vector whose elements are weight \u03c0 j,ct ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-80",
"text": "Then we compute \u03c0 ct from the hidden state of the final RNN layer:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-81",
"text": "where W \u03c0 \u2208 R J\u00d7D h N is a weight matrix."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-82",
"text": "We next compute k j,ct from the hidden state of the n-th RNN layer:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-83",
"text": "where W j \u2208 R d\u00d7D h n is a weight matrix."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-84",
"text": "In addition, let i n be the number of k j,ct from h n t ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-85",
"text": "Then we define the sum of i n for all n as J; that is, N n=0 i n = J. In short, DOC computes J probability distributions from all the layers, including the input embedding (h 0 )."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-86",
"text": "For i N = J, DOC becomes identical to MoS. In addition to increasing the rank, we expect that DOC weakens the vanishing gradient problem during backpropagation because a middle layer is directly connected to the output, such as with the auxiliary classifiers described in Szegedy et al. (2015) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-87",
"text": "For a network that computes the weights for several vectors, such as Equation 10, Shazeer et al. (2017) indicated that it often converges to a state where it always produces large weights for few vectors."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-88",
"text": "In fact, we observed that DOC tends to assign large weights to shallow layers."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-115",
"text": "Table 2 summarizes the hyperparameters of our experiments."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-89",
"text": "To prevent this phenomenon, we compute the coefficient of variation of Equation 10 in each mini-batch as a regularization term following Shazeer et al. (2017) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-90",
"text": "In other words, we try to adjust the sum of the weights for each probability distribution with identical values in each mini-batch."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-91",
"text": "Formally, we compute the following equation for a mini-batch consisting of w b , w b+1 , ..., wb:"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-92",
"text": "where functions std(\u00b7) and avg(\u00b7) are functions that respectively return an input's standard deviation and its average."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-93",
"text": "In the training step, we add \u03bb \u03b2 multiplied by weight coefficient \u03b2 to the loss function."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-95",
"text": "**EXPERIMENTS ON LANGUAGE MODELING**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-96",
"text": "We investigate the effect of DOC on the language modeling task."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-97",
"text": "In detail, we conduct word-level prediction experiments and show that DOC improves the performance of MoS, which only uses the final layer to compute the probability distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-98",
"text": "Moreover, we evaluate various combinations of layers to explore which combination achieves the best score."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-100",
"text": "**DATASETS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-101",
"text": "We used the Penn Treebank (PTB) (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) datasets, which are the standard benchmark datasets for the word-level language modeling task."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-102",
"text": "Mikolov et al. (2010) and Merity et al. (2017) respectively published preprocessed PTB 3 and WikiText-2 4 datasets."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-103",
"text": "Table 1 describes their statistics."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-104",
"text": "We used these preprocessed datasets for fair comparisons with previous studies."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-106",
"text": "**HYPERPARAMETERS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-107",
"text": "Our implementation is based on the averaged stochastic gradient descent Weight-Dropped LSTM (AWD-LSTM) 5 proposed by Merity et al. (2018 Table 3 : Perplexities of AWD-LSTM with DOC on the PTB dataset."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-108",
"text": "We varied the number of probability distributions from each layer in situation J = 20 except for the top row."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-109",
"text": "The top row ( \u2020) represents MoS scores reported in Yang et al. (2018) as a baseline."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-110",
"text": "\u2021 represents the perplexity obtained by the implementation of Yang et al. (2018) 6 with identical hyperparameters except for i 3 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-111",
"text": "dropout rate for vector k j,ct and the non-monotone interval."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-112",
"text": "Since we found that the dropout rate for vector k j,ct greatly influences \u03b2 in Equation 13, we varied it from 0.3 to 0.6 with 0.1 intervals."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-113",
"text": "We selected 0.6 because this value achieved the best score on the PTB validation dataset."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-114",
"text": "For the nonmonotone interval, we adopted the same value as Zolna et al. (2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-116",
"text": "represents the number of probability distributions from hidden state h n t ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-117",
"text": "To find the best combination, we varied the number of probability distributions from each layer by fixing their total to 20: J = 20."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-118",
"text": "Moreover, the top row of Table 3 shows the perplexity of AWD-LSTM with MoS reported in Yang et al. (2018) for comparison."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-119",
"text": "Table 3 indicates that language models using middle layers outperformed one using only the final layer."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-120",
"text": "In addition, Table 3 shows that increasing the distributions from the final layer (i 3 = 20) degraded the score from the language model with i 3 = 15 (the top row of Table 3)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-121",
"text": "Thus, to obtain a superior language model, we should not increase the number of distributions from the final layer; we should instead use the middle layers, as with our proposed DOC."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-122",
"text": "Table 3 shows that the i 3 = 15, i 2 = 5 setting achieved the best performance and the other settings with shallow layers have a little effect."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-123",
"text": "This result implies that we need some layers to output accurate distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-124",
"text": "In fact, most previous studies adopted two LSTM layers for language modeling."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-125",
"text": "This suggests that we need at least two layers to obtain high-quality distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-127",
"text": "**RESULTS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-128",
"text": "For the i 3 = 15, i 2 = 5 setting, we explored Table 6 : Perplexities of our implementations and reruns on the PTB dataset."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-129",
"text": "We set the non-monotone interval to 60."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-130",
"text": "\u2020 represents results obtained by original implementations with identical hyperparameters except for non-monotone interval."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-131",
"text": "\u2021 indicates the result obtained by our AWD-LSTM-MoS implementation with identical dropout rates as AWD-LSTM-DOC."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-132",
"text": "For (fin), we repeated fine-tuning until convergence."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-133",
"text": "the effect of \u03bb \u03b2 in {0, 0.01, 0.001, 0.0001}. Although Table 3 shows that \u03bb \u03b2 = 0.001 achieved the best perplexity, the effect is not consistent."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-134",
"text": "Table 4 shows the coefficient of variation of Equation 10, i.e., \u221a \u03b2 in the PTB dataset."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-135",
"text": "This table demonstrates that the coefficient of variation decreases with growth in \u03bb \u03b2 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-136",
"text": "In other words, the model trained with a large \u03bb \u03b2 assigns balanced weights to each probability distribution."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-137",
"text": "These results indicate that it is not always necessary to equally use each probability distribution, but we can acquire a better model in some \u03bb \u03b2 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-138",
"text": "Hereafter, we refer to the setting that achieved the best score (i 3 = 15, i 2 = 5, \u03bb \u03b2 = 0.001) as AWD-LSTM-DOC."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-139",
"text": "Table 5 shows the ranks of matrices containing log probability distributions from each method."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-140",
"text": "In other words, Table 5 describes\u00c3 in Equation 9 for each method."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-218",
"text": "**EXPERIMENTS ON CONSTITUENCY PARSING**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-141",
"text": "As shown by this table, the output of AWD-LSTM is restricted to D 3 7 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-142",
"text": "In contrast, AWD-LSTM-MoS (Yang et al., 2018) and AWD-LSTM-DOC outputted matrices whose ranks equal the vocabulary size."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-143",
"text": "This fact indicates that DOC (including MoS) can output the same matrix as the true distributions in view of a rank."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-144",
"text": "Figure 2 illustrates the learning curves of each method on PTB."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-145",
"text": "This figure contains the validation scores of AWD-LSTM, AWD-LSTM-MoS, and AWD-LSTM-DOC at each training epoch."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-146",
"text": "We trained AWD-LSTM and AWD-LSTM-MoS by setting the non-monotone interval to 60, as with AWD-LSTM-DOC."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-147",
"text": "In other words, we used hyperparameters identical to the original ones to train AWD-LSTM and AWD-LSTM-MoS, except for the non-monotone interval."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-148",
"text": "We note that the optimization method converts the ordinary stochastic gradient descent (SGD) into the averaged SGD at the point where convergence almost occurs."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-149",
"text": "In Figure 2 , the turning point is the epoch when each method drastically decreases the perplexity."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-150",
"text": "Figure 2 shows that each method similarly reduces the perplexity at the beginning."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-151",
"text": "AWD-LSTM and AWD-LSTM-MoS were slow to decrease the perplexity from 50 epochs."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-152",
"text": "In contrast, AWD-LSTM-DOC constantly decreased the perplexity and achieved a lower value than the other methods with ordinary SGD."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-153",
"text": "Therefore, we conclude that DOC positively affects the training of language modeling."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-154",
"text": "Table 6 shows the AWD-LSTM, AWD-LSTMMoS, and AWD-LSTM-DOC results in our configurations."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-155",
"text": "For AWD-LSTM-MoS, we trained our implementation with the same dropout rates as AWD-LSTM-DOC for a fair comparison."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-156",
"text": "AWD-LSTM-DOC outperformed both the original AWD-LSTM-MoS and our implementation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-157",
"text": "In other words, DOC outperformed MoS."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-158",
"text": "Since the averaged SGD uses the averaged parameters from each update step, the parameters of the early steps are harmful to the final parameters."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-159",
"text": "Therefore, when the model converges, recent studies and ours eliminate the history of and then retrains the model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-160",
"text": "Merity et al. (2018) referred to this retraining process as fine-tuning."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-161",
"text": "Although most previous studies only conducted fine-tuning once, Zolna et al. (2018) argued that two finetunings provided additional improvement."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-162",
"text": "Thus, we repeated fine-tuning until we achieved no more improvements in the validation data."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-163",
"text": "We refer to the model as AWD-LSTM-DOC (fin) in Table 6 , which shows that repeated fine-tunings improved the perplexity by about 0.5."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-164",
"text": "Tables 7 and 8 respectively show the perplexities of AWD-LSTM-DOC and previous studies on PTB and WikiText-2 8 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-165",
"text": "These tables show that AWD-LSTM-DOC achieved the best perplexity."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-244",
"text": "**RESULTS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-166",
"text": "AWD-LSTM-DOC improved the perplexity by almost 2.0 on PTB and 3.5 on WikiText-2 from the state-of-the-art scores."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-167",
"text": "The ensemble technique provided further improvement, as described in previous studies (Zaremba et al., 2014; , and improved the perplexity by at least 4 points on both datasets."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-168",
"text": "Finally, the ensemble of the repeated finetuning models achieved 47.17 on the PTB test and 53.09 on the WikiText-2 test."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-170",
"text": "**EXPERIMENTS ON APPLICATION TASKS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-171",
"text": "As described in Section 1, a neural encoder-decoder model can be interpreted as a conditional language model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-172",
"text": "To investigate the effect of DOC on an encoder-decoder model, we incorporate DOC into the decoder and examine its performance."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-173",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-174",
"text": "**DATASET**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-175",
"text": "We conducted experiments on machine translation and headline generation tasks."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-176",
"text": "For machine translation, we used two kinds of sentence pairs (EnglishGerman and English-French) in the IWSLT 2016 dataset 9 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-177",
"text": "The training set respectively contains about 189K and 208K sentence pairs of EnglishGerman and English-French."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-178",
"text": "We experimented in four settings: from English to German (En-De), its reverse (De-En), from English to French (En-Fr), and its reverse (Fr-En)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-179",
"text": "Headline generation is a task that creates a short summarization of an input sentence (Rush et al., 2015) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-180",
"text": "Rush et al. (2015) constructed a headline generation dataset by extracting pairs of first sentences of news articles and their headlines from the annotated English Gigaword corpus (Napoles et al., 2012) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-181",
"text": "They also divided the extracted sentenceheadline pairs into three parts: training, validation, and test sets."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-182",
"text": "The training set contains about 3.8M sentence-headline pairs."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-183",
"text": "For our evaluation, we used the test set constructed by Zhou et al. (2017) because the one constructed by Rush et al. (2015) contains some invalid instances, as reported in Zhou et al. (2017) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-185",
"text": "**ENCODER-DECODER MODEL**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-186",
"text": "For the base model, we adopted an encoder-decoder with an attention mechanism described in Kiyono et al. (2017) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-187",
"text": "The encoder consists of a 2-layer bidirectional LSTM, and the decoder consists of a 2-layer LSTM with attention proposed by Luong et al. (2015) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-188",
"text": "We interpreted the layer after computing the attention as the 3rd layer of the decoder."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-189",
"text": "We refer to this encoder-decoder as EncDec."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-190",
"text": "For the hyperparameters, we followed the setting of Kiyono et al. (2017) except for the sizes of hidden states and embeddings."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-245",
"text": "8 Related Work Bengio et al. (2003) are pioneers of neural language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-191",
"text": "We used 500 for machine 9 https://wit3.fbk.eu/ (Rush et al., 2015) 37.41 15.87 34.70 SEASS (Zhou et al., 2017) 46.86 24.58 43.53 Kiyono et al. (2017) 46.34 24.85 43.49 Table 10 : ROUGE F1 scores in headline generation test data provided by Zhou et al. (2017) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-192",
"text": "RG in table denotes ROUGE."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-193",
"text": "For our implementations (the upper part), we report averages of three runs."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-194",
"text": "translation and 400 for headline generation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-195",
"text": "We constructed a vocabulary set by using Byte-PairEncoding 10 (BPE) (Sennrich et al., 2016) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-196",
"text": "We set the number of BPE merge operations at 16K for the machine translation and 5K for the headline generation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-197",
"text": "In this experiment, we compare DOC to the base EncDec."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-198",
"text": "We prepared two DOC settings: using only the final layer, that is, a setting that is identical to MoS, and using both the final and middle layers."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-199",
"text": "We used the 2nd and 3rd layers in the latter setting because this case achieved the best performance on the language modeling task in Section 5.3."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-200",
"text": "We set i 3 = 2 and i 2 = 2, i 3 = 2."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-201",
"text": "For this experiment, we modified a publicly available encode-decoder implementation 11 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-202",
"text": "Table 9 shows the BLEU scores of each method."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-203",
"text": "Since an initial value often drastically varies the result of a neural encoder-decoder, we reported the average of three models trained from different initial values and random seeds."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-204",
"text": "Table 9 indicates that EncDec+DOC outperformed EncDec."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-205",
"text": "Table 10 shows the ROUGE F1 scores of each method."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-206",
"text": "In addition to the results of our implementations (the upper part), the lower part represents the published scores reported in previous studies."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-207",
"text": "For the upper part, we reported the average of three models (as in Table 9 )."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-208",
"text": "EncDec+DOC outperformed EncDec on all scores."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-209",
"text": "Moreover, EncDec outperformed the state-of-the-art method (Zhou et al., 2017) on the ROUGE-2 and ROUGE-L F1 scores."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-210",
"text": "In other words, our baseline is already very strong."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-211",
"text": "We believe that this is because we adopted a larger embedding size than Zhou et al. (2017) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-212",
"text": "It is noteworthy that DOC improved the performance of EncDec even though EncDec is very strong."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-213",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-214",
"text": "**RESULTS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-215",
"text": "These results indicate that DOC positively influences a neural encoder-decoder model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-216",
"text": "Using the middle layer also yields further improvement because EncDec+DOC (i 3 = i 2 = 2) outperformed EncDec+DOC (i 3 = 2)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-217",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-219",
"text": "Choe and Charniak (2016) achieved high F1 scores on the Penn Treebank constituency parsing task by transforming candidate trees into a symbol sequence (S-expression) and reranking them based on the perplexity obtained by a neural language model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-220",
"text": "To investigate the effectiveness of DOC, we evaluate our language models following their configurations."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-221",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-222",
"text": "**DATASET**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-223",
"text": "We used the Wall Street Journal of the Penn Treebank dataset."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-224",
"text": "We used the section 2-21 for training, 22 for validation, and 23 for testing."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-225",
"text": "We applied the preprocessing codes of Choe and Charniak (2016) 12 to the dataset and converted a token that appears fewer than ten times in the training dataset into a special token unk."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-226",
"text": "For reranking, we prepared 500 candidates obtained by the Charniak parser (Charniak, 2000) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-227",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-228",
"text": "**MODELS**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-229",
"text": "We compare AWD-LSTM-DOC with AWD-LSTM (Merity et al., 2018) and AWD-LSTMMoS (Yang et al., 2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-230",
"text": "We trained each model with the same hyperparameters from our language modeling experiments (Section 5)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-231",
"text": "We selected the model that achieved the best perplexity on the validation set during the training."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-232",
"text": "State-of-the-art results Dyer et al. (2016) 91.7 93.3 Fried et al. (2017) (ensemble) 92.72 94.25 Suzuki et al. (2018) (ensemble) 92.74 94.32 Kitaev and Klein (2018) 95.13 - Moreover, AWD-LSTM-DOC outperformed AWD-LSTM and AWD-LSTM-MoS. These results correspond to the performance on the language modeling task (Section 5.3)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-233",
"text": "The middle part shows that AWD-LSTM-DOC also outperformed AWD-LSTM and AWD-LSTMMoS in the ensemble setting."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-234",
"text": "In addition, we can improve the performance by exchanging the base parser with a stronger one."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-235",
"text": "In fact, we achieved 94.29 F1 score by reranking the candidates from retrained Recurrent Neural Network Grammars (RNNG) (Dyer et al., 2016) 13 , that achieved 91.2 F1 score in our configuration."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-236",
"text": "Moreover, the lowest row of the middle part indicates the result by reranking the candidates from the retrained neural encoder-decoder based parser (Suzuki et al., 2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-237",
"text": "Our base parser has two different parts from Suzuki et al. (2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-238",
"text": "First, we used the sum of the hidden states of the forward and backward RNNs as the hidden layer for each RNN 14 ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-239",
"text": "Second, we tied the embedding matrix to the weight matrix to compute the probability distributions in the decoder."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-240",
"text": "The retrained parser achieved 93.12 F1 score."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-241",
"text": "Finally, we achieved 94.47 F1 score by reranking its candidates with AWD-LSTM-DOC."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-242",
"text": "We expect that we can achieve even better score by replacing the base parser with the current state-of-the-art one (Kitaev and Klein, 2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-243",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-246",
"text": "To address the curse of dimensionality in language modeling, they proposed a method using word embeddings and a feed-forward neural network (FFNN)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-247",
"text": "They demonstrated that their approach outperformed n-gram language models, but FFNN can only handle fixed-length contexts."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-248",
"text": "Instead of FFNN, Mikolov et al. (2010) applied RNN (Elman, 1990) to language modeling to address the entire given sequence as a context."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-249",
"text": "Their method outperformed the Kneser-Ney smoothed 5-gram language model (Kneser and Ney, 1995; Chen and Goodman, 1996) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-250",
"text": "Researchers continue to try to improve the performance of RNN language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-251",
"text": "Zaremba et al. (2014) used LSTM (Hochreiter and Schmidhuber, 1997) instead of a simple RNN for language modeling and significantly improved an RNN language model by applying dropout (Srivastava et al., 2014) to all the connections except for the recurrent connections."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-252",
"text": "To regularize the recurrent connections, Gal and Ghahramani (2016) proposed variational inference-based dropout."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-253",
"text": "Their method uses the same dropout mask at each timestep."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-254",
"text": "Zolna et al. (2018) proposed fraternal dropout, which minimizes the differences between outputs from different dropout masks to be invariant to the dropout mask."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-255",
"text": "Melis et al. (2018) used black-box optimization to find appropriate hyperparameters for RNN language models and demonstrated that the standard LSTM with proper regularizations can outperform other architectures."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-256",
"text": "Apart from dropout techniques, Inan et al. (2017) and Press and Wolf (2017) proposed the word tying method (WT), which unifies word embeddings (E in Equation 4) with the weight matrix to compute probability distributions (W in Equation 2)."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-257",
"text": "In addition to quantitative evaluation, Inan et al. (2017) provided a theoretical justification for WT and proposed the augmented loss technique (AL), which computes an objective probability based on word embeddings."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-258",
"text": "In addition to these regularization techniques, Merity et al. (2018) used DropConnect (Wan et al., 2013) and averaged SGD (Polyak and Juditsky, 1992) for an LSTM language model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-259",
"text": "Their AWD-LSTM achieved lower perplexity than Melis et al. (2018) on PTB and WikiText-2."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-260",
"text": "Previous studies also explored superior architecture for language modeling."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-261",
"text": "Zilly et al. (2017) proposed recurrent highway networks that use highway layers (Srivastava et al., 2015) to deepen recurrent connections."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-262",
"text": "Zoph and Le (2017) adopted reinforcement learning to construct the best RNN structure."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-263",
"text": "However, as mentioned, Melis et al. (2018) established that the standard LSTM is superior to these architectures."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-264",
"text": "Apart from RNN architecture, proposed the input-tooutput gate (IOG), which boosts the performance of trained language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-265",
"text": "As described in Section 3, Yang et al. (2018) interpreted training language modeling as matrix factorization and improved performance by computing multiple probability distributions."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-266",
"text": "In this study, we generalized their approach to use the middle layers of RNNs."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-267",
"text": "Finally, our proposed method, DOC, achieved the state-of-the-art score on the standard benchmark datasets."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-268",
"text": "Some studies provided methods that boost performance by using statistics obtained from test data."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-269",
"text": "Grave et al. (2017) extended a cache model (Kuhn and De Mori, 1990) for RNN language models."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-270",
"text": "Krause et al. (2017) proposed dynamic evaluation that updates parameters based on a recent sequence during testing."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-271",
"text": "Although these methods might also improve the performance of DOC, we omitted such investigation to focus on comparisons among methods trained only on the training set."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-272",
"text": "----------------------------------"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-273",
"text": "**CONCLUSION**"
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-274",
"text": "We proposed Direct Output Connection (DOC), a generalization method of MoS introduced by Yang et al. (2018) ."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-275",
"text": "DOC raises the expressive power of RNN language models and improves quality of the model."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-276",
"text": "DOC outperformed MoS and achieved the best perplexities on the standard benchmark datasets of language modeling: PTB and WikiText-2."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-277",
"text": "Moreover, we investigated its effectiveness on machine translation and headline generation."
},
{
"sent_id": "00a2e4d0cacfb1fb7098bd324d960a-C001-278",
"text": "Our results show that DOC also improved the performance of EncDec and using a middle layer positively affected such application tasks."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-3"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-109"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-118"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-229"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-274"
]
],
"cite_sentences": [
"00a2e4d0cacfb1fb7098bd324d960a-C001-3",
"00a2e4d0cacfb1fb7098bd324d960a-C001-109",
"00a2e4d0cacfb1fb7098bd324d960a-C001-118",
"00a2e4d0cacfb1fb7098bd324d960a-C001-229",
"00a2e4d0cacfb1fb7098bd324d960a-C001-274"
]
},
"@MOT@": {
"gold_contexts": [
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-25"
]
],
"cite_sentences": [
"00a2e4d0cacfb1fb7098bd324d960a-C001-25"
]
},
"@BACK@": {
"gold_contexts": [
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-26"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-55",
"00a2e4d0cacfb1fb7098bd324d960a-C001-56",
"00a2e4d0cacfb1fb7098bd324d960a-C001-57",
"00a2e4d0cacfb1fb7098bd324d960a-C001-59",
"00a2e4d0cacfb1fb7098bd324d960a-C001-60"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-63"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-265"
]
],
"cite_sentences": [
"00a2e4d0cacfb1fb7098bd324d960a-C001-26",
"00a2e4d0cacfb1fb7098bd324d960a-C001-55",
"00a2e4d0cacfb1fb7098bd324d960a-C001-60",
"00a2e4d0cacfb1fb7098bd324d960a-C001-63",
"00a2e4d0cacfb1fb7098bd324d960a-C001-265"
]
},
"@EXT@": {
"gold_contexts": [
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-26",
"00a2e4d0cacfb1fb7098bd324d960a-C001-27"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-42"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-63",
"00a2e4d0cacfb1fb7098bd324d960a-C001-64",
"00a2e4d0cacfb1fb7098bd324d960a-C001-65"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-265",
"00a2e4d0cacfb1fb7098bd324d960a-C001-266"
],
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-274",
"00a2e4d0cacfb1fb7098bd324d960a-C001-275"
]
],
"cite_sentences": [
"00a2e4d0cacfb1fb7098bd324d960a-C001-26",
"00a2e4d0cacfb1fb7098bd324d960a-C001-42",
"00a2e4d0cacfb1fb7098bd324d960a-C001-63",
"00a2e4d0cacfb1fb7098bd324d960a-C001-265",
"00a2e4d0cacfb1fb7098bd324d960a-C001-274"
]
},
"@SIM@": {
"gold_contexts": [
[
"00a2e4d0cacfb1fb7098bd324d960a-C001-141",
"00a2e4d0cacfb1fb7098bd324d960a-C001-142",
"00a2e4d0cacfb1fb7098bd324d960a-C001-143"
]
],
"cite_sentences": [
"00a2e4d0cacfb1fb7098bd324d960a-C001-142"
]
}
}
},
"ABC_6a054953660e465151e4d8a2223a76_9": {
"x": [
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-20",
"text": "However, these models take more time to instantiate in comparison to weighting of a co-occurrence matrix, bring more parameters to explore and produce vector spaces with uninterpretable dimensions (vector space dimension interpretation is used by some lexical mod-els, for example, McGregor et al. (2015) , and the passage from formal semantics to tensor models relies on it (Coecke et al., 2010) )."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-21",
"text": "In this work we focus on vector spaces that directly weight a co-occurrence matrix and report results for SVD, GloVe and SGNS from the study of Levy et al. (2015) for comparison."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-22",
"text": "The mismatch of recent experiments with nondense models in vector dimensionality between lexical and compositional tasks gives rise to a number of questions:"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-23",
"text": "\u2022 To what extent does model performance depend on vector dimensionality?"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-24",
"text": "\u2022 Do parameters influence 200K and 1K dimensional models similarly?"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-25",
"text": "Can the findings of Levy et al. (2015) be directly applied to models with a few thousand dimensions?"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-26",
"text": "\u2022 If not, can we derive suitable parameter selection heuristics which take account of dimensionality?"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-27",
"text": "To answer these questions, we perform a systematic study of distributional models with a rich set of parameters on SimLex-999 (Hill et al., 2014) , a lexical similairty dataset, and test selected models on MEN (Bruni et al., 2014) , a lexical relatedness dataset."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-28",
"text": "These datasets are currently widely used and surpass datasets stemming from information retrieval, WordSim-353 (Finkelstein et al., 2002) , and computational linguistics, RG65 (Rubenstein and Goodenough, 1965) , in quantity by having more entries and in quality by attention to evaluated relations (Milajevs and Griffiths, 2016) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-30",
"text": "**PARAMETERS**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-32",
"text": "**PMI VARIANTS (DISCR)**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-33",
"text": "Most co-occurrence weighting schemes in distributional semantics are based on point-wise mutual information (PMI, see e.g. Church and Hanks (1990) , Turney and Pantel (2010) , Levy and Goldberg (2014) ):"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-34",
"text": "As commonly done, we replace the infinite PMI values, 1 which arise when P (x, y) = 0, with zeroes and use PMI hereafter to refer to a weighting with this fix."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-35",
"text": "An alternative solution is to increment the probability ratio by 1; we refer to this as compressed PMI (CPMI, see e.g. McGregor et al. (2015) ):"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-36",
"text": "By incrementing the probability ratio by one, the PMI values from the segment of (\u2212\u221e; 0], when the joint probability P (x, y) is less than the chance P (x)P (y), are compressed into the segment of (0; 1]."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-37",
"text": "As the result, the space does not contain negative values, but has the same sparsity as the space with PMI values."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-39",
"text": "**SHIFTED PMI (NEG)**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-40",
"text": "Many approaches use only positive PMI values, as negative PMI values may not positively contribute to model performance and sparser matrices are more computationally tractable (Turney and Pantel, 2010) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-41",
"text": "This can be generalised to an additional cutoff parameter k (neg) following Levy et al. (2015) , giving our third PMI variant (abbreviated as SPMI): 2"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-42",
"text": "When k = 1 SPMI is equivalent to positive PMI."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-43",
"text": "k > 1 increases the underlying matrix sparsity by keeping only highly associated co-occurrence pairs."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-44",
"text": "k < 1 decreases the underlying matrix sparsity by including some unassociated cooccurrence pairs, which are usually excluded due to unreliability of probability estimates (Dagan et al., 1993) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-45",
"text": "We can apply the same idea to CPMI:"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-46",
"text": "2 SPMI is different from CPMI because log"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-47",
"text": ") ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-48",
"text": "Figure 1: Effect of PMI variant (discr), smoothing (cds) and frequency weighting (freq) on SimLex-999."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-49",
"text": "Error bars correspond to a 95% confidence interval as the value is estimated by averaging over all the values of the omitted parameters: neg and similarity."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-51",
"text": "**FREQUENCY WEIGHTING (FREQ)**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-52",
"text": "Another issue with PMI is its bias towards rare events (Levy et al., 2015) ; one way of solving this issue is to weight the value by the co-occurrence frequency (Evert, 2005) :"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-53",
"text": "where n(x, y) is the number of times x was seen together with y. For clarity, we refer to n-weighted PMIs as nPMI, nSPMI, etc."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-54",
"text": "When this weighting component is set to 1, it has no effect; we can explicitly label it as 1PMI, 1SPMI, etc."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-55",
"text": "In addition to the extreme 1 and n weightings, we also experiment with a log n weighting."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-56",
"text": "Levy et al. (2015) show that performance is affected by smoothing the context distribution P (x):"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-58",
"text": "**CONTEXT DISTRIBUTION SMOOTHING (CDS)**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-59",
"text": "We experiment with \u03b1 = 1 (no smoothing) and \u03b1 = 0.75."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-60",
"text": "We call this estimation method local context probability; we can also estimate a global context probability based on the size of the corpus C:"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-61",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-62",
"text": "**VECTOR DIMENSIONALITY (D)**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-63",
"text": "As context words we select the 1K, 2K, 3K, 5K, 10K, 20K, 30K, 40K and 50K most frequent lemmatised nouns, verbs, adjectives and adverbs."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-64",
"text": "All context words are part of speech tagged, but we do not distinguish between refined word types (e.g. intransitive vs. transitive versions of verbs) and do not perform stop word filtering."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-65",
"text": "(Baroni et al., 2009 ) with a symmetric 5-word window ; our evaluation metric is the correlation with human judgements as is standard with SimLex (Hill et al., 2014) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-66",
"text": "We derive our parameter selection heuristics by greedily selecting parameters (cds, neg) that lead to the highest average performance for each combination of frequency weighting, PMI variant and dimensionality D. Figures 1 and 2 show the interaction of cds and neg with other parameters."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-67",
"text": "We also vary the similarity measure (cosine and correlation (Kiela and Clark, 2014) ), but do not report results here due to space limits."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-68",
"text": "Figure 2: The behaviour of shifted PMI (SPMI) on SimLex-999."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-69",
"text": "discr=spmi, freq=1 and neg=1 corresponds to positive PMI."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-70",
"text": "Error bars correspond to a 95% confidence interval as the value is estimated by averaging over all the values of the omitted parameters: cds and similarity."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-72",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-74",
"text": "**HEURISTICS**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-75",
"text": "PMI and CPMI PMI should be used with global context probabilities."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-76",
"text": "CPMI generally outperforms PMI, with less sensitivity to parameters; nCPMI and lognCPMI should be used with local context probabilities and 1CPMI should apply context distribution smoothing with \u03b1 = 0.75."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-77",
"text": "SPMI 10K dimensional 1SPMI is the least sensitive to parameter selection."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-78",
"text": "For models with D > 20K, context distribution smoothing should be used with \u03b1 = 0.75; for D < 20K, it is beneficial to use global context probabilities."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-79",
"text": "Shifting also depends on the dimensionality: models with D < 20K should set k = 0.7, but higherdimensional models should set k = 5."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-80",
"text": "There might be a finer-grained k selection criteria; however, we do not report this to avoid overfitting."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-81",
"text": "lognSPMI should be used with global context probabilities for models with D < 20K."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-82",
"text": "For higher-dimensional spaces, smoothing should be applied with \u03b1 = 0.75, as with 1SPMI."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-83",
"text": "Shifting should be applied with k = 0.5 for models with D < 20K, and k = 1.4 for D > 20K."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-84",
"text": "In contrast to 1SPMI, which might require change of k as the dimensionality increases, k = 1.4 is a much more robust choice for lognSPMI."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-85",
"text": "nSPMI gives good results with local context probabilities (\u03b1 = 1)."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-86",
"text": "Models with D < 20K should use k = 1.4, otherwise k = 5 is preferred."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-87",
"text": "SCPMI With 1SCPMI and D < 20K, global context probability should be used, with shifting set to k = 0.7."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-88",
"text": "Otherwise, local context probability should be used with \u03b1 = 0.75 and k = 2."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-89",
"text": "With nSCPMI and D < 20K, global context probability should be used with k = 1.4."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-90",
"text": "Otherwise, local context probability without smoothing and k = 5 is suggested."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-91",
"text": "For lognSCPMI, models with D < 20K should use global context probabilities and k = 0.7; otherwise, local context probabilities without smoothing should be preferred with k = 1.4."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-93",
"text": "**EVALUATION OF HEURISTICS**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-94",
"text": "We evaluate these heuristics by comparing the performance they give on SimLex-999 against that obtained using the best possible parameter selections (determined via an exhaustive search at each dimensionality setting)."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-95",
"text": "We also compare them to the best scores reported by Levy et al. (2015) for their model (PMI and SVD), word2vec-SGNS (Mikolov et al., 2013) and GloVe (Pennington et al., 2014 )-see Figure 3a , where only the betterperforming SPMI and SCPMI are shown."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-96",
"text": "For lognPMI and lognCPMI, our heuristics pick the best possible models."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-97",
"text": "For lognSPMI, where performance variance is low, the heuristics do well, giving a performance of no more than 0.01 points below the best configuration."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-98",
"text": "For 1SPMI and nSPMI the difference is higher."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-99",
"text": "With lognSCPMI and 1SCPMI, the heuristics follow Levy et al. (2015) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-100",
"text": "We also give our best score, SVD, SGNS and GloVe numbers from that study for comparison."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-101",
"text": "On the right, our heuristic in comparison to the best and average results together with the models selected using the recommendations presented in Levy et al. (2015) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-102",
"text": "the best selection, but with a wider gap than the SPMI models."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-103",
"text": "In general n-weighted models do not perform as well as others."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-104",
"text": "Overall, log n weighting should be used with PMI, CPMI and SCPMI."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-105",
"text": "High-dimensional SPMI models show the same behaviour, but if D < 10K, no weighting should be applied."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-106",
"text": "SPMI and SCPMI should be preferred over CPMI and PMI."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-107",
"text": "As Figure 3b shows, our heuristics give performance close to the optimum for any dimensionality, with a large improvement over both an average parameter setting and the parameters suggested by Levy et al. (2015) in a high-dimensional setting."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-108",
"text": "4 Finally, to see whether the heuristics transfer robustly, we repeat this comparison on the MEN dataset (see Figures 3c, 3d) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-109",
"text": "Again, PMI and CPMI follow the best possible setup, with SPMI and SCPMI showing only a slight drop below ideal performance; and again, the heuristic settings give performance close to the optimum, and significantly higher than average or standard parameters."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-111",
"text": "**CONCLUSION**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-112",
"text": "This paper presents a systematic study of cooccurrence quantification focusing on the selection of parameters presented in Levy et al. (2015) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-113",
"text": "We replicate their recommendation for high-dimensional vector spaces, and show that with appropriate parameter selection it is possible to achieve comparable performance with spaces of dimensionality of 1K to 50K, and propose a set of model selection heuristics that maximizes performance."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-114",
"text": "We foresee the results of the paper are generalisable to other experiments, since model selection was performed on a similarity dataset, and was additionally tested on a relatedness dataset."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-115",
"text": "In general, model performance depends on vector dimensionality (the best setup with 50K dimensions is better than the best setup with 1K dimensions by 0.03 on SimLex-999)."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-116",
"text": "Spaces with a few thousand dimensions benefit from being dense and unsmoothed (k < 1, global context probability); while high-dimensional spaces are better sparse and smooth (k > 1, \u03b1 = 0.75)."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-117",
"text": "However, for unweighted and n-weighted models, these heuristics do not guarantee the best possible result because Table 2 : Our model in comparison to the previous work."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-118",
"text": "On the similarity dataset our model is 0.008 points behind a PPMI model, however on the relatedness dataset 0.020 points above."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-119",
"text": "Note the difference in dimensionality, source corpora and window size."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-120",
"text": "SVD, SGNS and GloVe numbers are given for comparison."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-121",
"text": "* Results reported by Levy et al. (2015) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-122",
"text": "of the high variance of the corresponding scores."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-123",
"text": "Based on this we suggest to use lognSPMI or lognSCPMI with dimensionality of at least 20K to ensure good performance on lexical tasks."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-124",
"text": "There are several directions for the future work."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-125",
"text": "Our experiments show that models with a few thousand dimensions are competitive with more dimensional models, see Figure 3 ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-126",
"text": "Moreover, for these models, unsmoothed probabilities give the best result."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-127",
"text": "It might be the case that due to the large size of the corpus used, the probability estimates for the most frequent words are reliable without smoothing."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-128",
"text": "More experiments need to be done to see whether this holds for smaller corpora."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-129",
"text": "The similarity datasets are transferred to other languages (Leviant and Reichart, 2015) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-130",
"text": "The future work might investigate whether our results hold for languages other than English."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-131",
"text": "The qualitative influence of the parameters should be studied in depth with extensive error analysis on how parameter selection changes similarity judgements."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-2",
"text": "Previous optimisations of parameters affecting the word-context association measure used in distributional vector space models have focused either on highdimensional vectors with hundreds of thousands of dimensions, or dense vectors with dimensionality of a few hundreds; but dimensionality of a few thousands is often applied in compositional tasks as it is still computationally feasible and does not require the dimensionality reduction step."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-3",
"text": "We present a systematic study of the interaction of the parameters of the association measure and vector dimensionality, and derive parameter selection heuristics that achieve performance across word similarity and relevance datasets competitive with the results previously reported in the literature achieved by highly dimensional or dense models."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-4",
"text": "----------------------------------"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-5",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-6",
"text": "Words that occur in similar context have similar meaning (Harris, 1954) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-7",
"text": "Thus the meaning of a word can be modeled by counting its cooccurrence with neighboring words in a corpus."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-8",
"text": "Distributional models of meaning represent cooccurrence information in a vector space, where the dimensions are the neighboring words and the values are co-occurrence counts."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-9",
"text": "Successful models need to be able to discriminate co-occurrence information, as not all co-occurrence counts are equally useful, for instance, the co-occurrence with the article the is less informative than with the noun existence."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-10",
"text": "The discrimination is usually achieved by weighting of co-occurrence counts."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-11",
"text": "Another fundamental question in vector space design is the vector space dimensionality and what neighbor words should correspond to them."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-12",
"text": "Levy et al. (2015) propose optimisations for co-occurrence-based distributional models, using parameters adopted from predictive models (Mikolov et al., 2013) : shifting and context distribution smoothing."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-13",
"text": "Their experiments and thus their parameter recommendations use highdimensional vector spaces with word vector dimensionality of almost 200K, and many recent state-of-the-art results in lexical distributional semantics have been obtained using vectors with similarly high dimensionality Kiela and Clark, 2014; Lapesa and Evert, 2014) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-14",
"text": "In contrast, much work on compositional distributional semantics employs vectors with much fewer dimensions: e.g. 2K (Grefenstette and Sadrzadeh, 2011; , 3K (Dinu and Lapata, 2010; or 10K (Polajnar and Clark, 2014; Baroni and Zamparelli, 2010) ."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-15",
"text": "The most common reason thereof is that these models assign tensors to functional words."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-16",
"text": "For a vector space V with k dimensions, a tensor V \u2297V \u00b7 \u00b7 \u00b7\u2297V of rank n has k n dimensions."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-17",
"text": "Adjectives and intransitive verbs have tensors of rank 2, transitive verbs are of rank 3; for coordinators, the rank can go up to 7."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-18",
"text": "Taking k = 200K already results in a highly intractable tensor of 8 \u00d7 10 15 dimensions for a transitive verb."
},
{
"sent_id": "6a054953660e465151e4d8a2223a76-C001-19",
"text": "An alternative way of obtaining a vector space with few dimensions, usually with just 100-500, is the use of SVD as a part of Latent Semantic Analysis (Dumais, 2004) or another models such as SGNS (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) ."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"6a054953660e465151e4d8a2223a76-C001-21"
],
[
"6a054953660e465151e4d8a2223a76-C001-41"
],
[
"6a054953660e465151e4d8a2223a76-C001-95"
],
[
"6a054953660e465151e4d8a2223a76-C001-99"
],
[
"6a054953660e465151e4d8a2223a76-C001-101"
],
[
"6a054953660e465151e4d8a2223a76-C001-112"
],
[
"6a054953660e465151e4d8a2223a76-C001-121"
]
],
"cite_sentences": [
"6a054953660e465151e4d8a2223a76-C001-21",
"6a054953660e465151e4d8a2223a76-C001-41",
"6a054953660e465151e4d8a2223a76-C001-95",
"6a054953660e465151e4d8a2223a76-C001-99",
"6a054953660e465151e4d8a2223a76-C001-101",
"6a054953660e465151e4d8a2223a76-C001-112",
"6a054953660e465151e4d8a2223a76-C001-121"
]
},
"@MOT@": {
"gold_contexts": [
[
"6a054953660e465151e4d8a2223a76-C001-22",
"6a054953660e465151e4d8a2223a76-C001-25"
],
[
"6a054953660e465151e4d8a2223a76-C001-52"
]
],
"cite_sentences": [
"6a054953660e465151e4d8a2223a76-C001-25",
"6a054953660e465151e4d8a2223a76-C001-52"
]
},
"@DIF@": {
"gold_contexts": [
[
"6a054953660e465151e4d8a2223a76-C001-107"
]
],
"cite_sentences": [
"6a054953660e465151e4d8a2223a76-C001-107"
]
},
"@EXT@": {
"gold_contexts": [
[
"6a054953660e465151e4d8a2223a76-C001-112",
"6a054953660e465151e4d8a2223a76-C001-113"
]
],
"cite_sentences": [
"6a054953660e465151e4d8a2223a76-C001-112"
]
}
}
},
"ABC_9885b924f6b0806844d4e70d857a35_9": {
"x": [
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-40",
"text": "The datasets are pre-partitioned into training, development and test sets, and rebuilt from the original version to include mention information."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-41",
"text": "The first two datasets were constructed to contain mostly English messages."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-42",
"text": "GEOTEXT consists of tweets from 9.5K users: 1895 users are held out for each of development and test data."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-43",
"text": "The primary location of each user is set to the coordinates of their first tweet."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-44",
"text": "TWITTER-US consists of 449K users, of which 10K users are held out for each of development and test data."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-45",
"text": "The primary location of each user is, once again, set to the coordinates of their first tweet."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-46",
"text": "TWITTER-WORLD consists of 1.3M users, of which 10000 each are held out for development and test."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-47",
"text": "Unlike the other two datasets, the primary location of users is mapped to the geographic centre of the city where the majority of their tweets were posted."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-49",
"text": "**METHODS**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-50",
"text": "We use label propagation over an @-mention graph in our models."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-51",
"text": "We use k-d tree descretised adaptive grids as class labels for users and learn a label distribution for each user by label propagation over the @-mention network using labelled nodes as seeds."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-52",
"text": "For k-d tree discretisation, we set the number of users in each region to 50, 2400, 2400 for GEOTEXT, TWITTER-US and TWITTER-WORLD respectively, based on tuning over the development data."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-53",
"text": "Social Network: We used the @-mention information to build an undirected graph between users."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-2",
"text": "We propose a label propagation approach to geolocation prediction based on Modified Adsorption, with two enhancements:"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-3",
"text": "(1) the removal of \"celebrity\" nodes to increase location homophily and boost tractability; and (2) the incorporation of text-based geolocation priors for test users."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-4",
"text": "Experiments over three Twitter benchmark datasets achieve state-of-theart results, and demonstrate the effectiveness of the enhancements."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-7",
"text": "Geolocation of social media users is essential in applications ranging from rapid disaster response (Earle et al., 2010; Ashktorab et al., 2014; Morstatter et al., 2013a) and opinion analysis (Mostafa, 2013; Kirilenko and Stepchenkova, 2014) , to recommender systems (Noulas et al., 2012; Schedl and Schnitzer, 2014) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-8",
"text": "Social media platforms like Twitter provide support for users to declare their location manually in their text profile or automatically with GPS-based geotagging."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-9",
"text": "However, the text-based profile locations are noisy and only 1-3% of tweets are geotagged (Cheng et al., 2010; Morstatter et al., 2013b) , meaning that geolocation needs to be inferred from other information sources such as the tweet text and network relationships."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-10",
"text": "User geolocation is the task of inferring the primary (or \"home\") location of a user from available sources of information, such as text posted by that individual, or network relationships with other individuals (Han et al., 2014) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-11",
"text": "Geolocation models are usually trained on the small set of users whose location is known (e.g. through GPS-based geotagging), and other users are geolocated using the resulting model."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-12",
"text": "These models broadly fall into two categories: text-based and network-based methods."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-13",
"text": "Orthogonally, the geolocation task can be viewed as a regression task over real-valued geographical coordinates, or a classification task over discretised region-based locations."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-14",
"text": "Most previous research on user geolocation has focused either on text-based classification approaches (Eisenstein et al., 2010; Wing and Baldridge, 2011; Roller et al., 2012; Han et al., 2014) or, to a lesser extent, network-based regression approaches (Jurgens, 2013; Compton et al., 2014; Rahimi et al., 2015) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-15",
"text": "Methods which combine the two, however, are rare."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-16",
"text": "In this paper, we present work on Twitter user geolocation using both text and network information."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-17",
"text": "Our contributions are as follows: (1) we propose the use of Modified Adsorption (Talukdar and Crammer, 2009) as a baseline networkbased geolocation model, and show that it outperforms previous network-based approaches (Jurgens, 2013; Rahimi et al., 2015) ; (2) we demonstrate that removing \"celebrity\" nodes (nodes with high in-degrees) from the network increases geolocation accuracy and dramatically decreases network edge size; and (3) we integrate textbased geolocation priors into Modified Adsorption, and show that our unified geolocation model outperforms both text-only and network-only approaches, and achieves state-of-the-art results over three standard datasets."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-18",
"text": "----------------------------------"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-19",
"text": "**RELATED WORK**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-20",
"text": "A recent spike in interest on user geolocation over social media data has resulted in the development of a range of approaches to automatic geolocation prediction, based on information sources such as the text of messages, social networks, user profile data, and temporal data."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-21",
"text": "Text-based methods model the geographical bias of language use in social media, and use it to geolocate non-geotagged users."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-22",
"text": "Gazetted expressions (Leidner and Lieberman, 2011) and geographical names (Quercini et al., 2010) were used as feature in early work, but were shown to be sparse in coverage."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-23",
"text": "Han et al. (2014) used information-theoretic methods to automatically extract location-indicative words for location classification."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-24",
"text": "reported that discriminative approaches (based on hierarchical classification over adaptive grids), when optimised properly, are superior to explicit feature selection."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-25",
"text": "Cha et al. (2015) showed that sparse coding can be used to effectively learn a latent representation of tweet text to use in user geolocation."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-26",
"text": "Eisenstein et al. (2010) and Ahmed et al. (2013) proposed topic modelbased approaches to geolocation, based on the assumption that words are generated from hidden topics and geographical regions."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-27",
"text": "Similarly, Yuan et al. (2013) used graphical models to jointly learn spatio-temporal topics for users."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-28",
"text": "The advantage of these generative approaches is that they are able to work with the continuous geographical space directly without any pre-discretisation, but they are algorithmically complex and don't scale well to larger datasets."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-29",
"text": "Hulden et al. (2015) used kernelbased methods to smooth linguistic features over very small grid sizes to alleviate data sparseness."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-30",
"text": "Network-based geolocation models, on the other hand, utilise the fact that social media users interact more with people who live nearby."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-31",
"text": "Jurgens (2013) and Compton et al. (2014) used a Twitter reciprocal mention network, and geolocated users based on the geographical coordinates of their friends, by minimising the weighted distance of a given user to their friends."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-32",
"text": "For a reciprocal mention network to be effective, however, a huge amount of Twitter data is required."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-33",
"text": "Rahimi et al. (2015) showed that this assumption could be relaxed to use an undirected mention network for smaller datasets, and still attain state-of-theart results."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-34",
"text": "The greatest shortcoming of networkbased models is that they completely fail to geolocate users who are not connected to geolocated components of the graph."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-35",
"text": "As shown by Rahimi et al. (2015) , geolocation predictions from text can be used as a backoff for disconnected users, but there has been little work that has investigated a more integrated text-and network-based approach to user geolocation."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-37",
"text": "**DATA**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-38",
"text": "We evaluate our models over three pre-existing geotagged Twitter datasets: (1) GEOTEXT (Eisen-stein et al., 2010), (2) TWITTER-US (Roller et al., 2012) , and (3) TWITTER-WORLD (Han et al., 2012) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-39",
"text": "In each dataset, users are represented by a single meta-document, generated by concatenating their tweets."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-54",
"text": "In order to make the inference more tractable, we removed all nodes that were not a member of the training/test set, and connected all pairings of training/test users if there was any path between them (including paths through non training/test users)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-55",
"text": "We call this network a \"collapsed network\", as illustrated in Figure 1 . Note that a celebrity node with n mentions connects n(n \u2212 1) nodes in the collapsed network."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-56",
"text": "We experiment with both binary and weighted edge (based on the number of mentions connecting the given users) networks."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-57",
"text": "Baseline: Our baseline geolocation model (\"MAD-B\") is formulated as label propagation over a binary collapsed network, based on Modified Adsorption (Talukdar and Crammer, 2009) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-58",
"text": "It applies to a graph G = (V, E, W ) where V is the set of nodes with |V | = n = n l + n u (where n l nodes are labelled and n u nodes are unlabelled), E is the set of edges, and W is an edge weight matrix."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-59",
"text": "Assume C is the set of labels where |C| = m is the total number of labels."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-60",
"text": "Y is an n \u00d7 m matrix storing the training node labels, and Y is the estimated label distribution for the nodes."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-61",
"text": "The goal is to estimate\u0176 for all nodes (including training nodes) so that the following objective function is minimised:"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-62",
"text": "where \u00b5 1 and \u00b5 2 are hyperparameters; 1 L is the Laplacian of an undirected graph derived from G; and S is a diagonal binary matrix indicating if a node is labelled or not."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-63",
"text": "The first term of the equation forces the labelled nodes to keep their label (prior term), while the second term pulls a node's label toward that of its neighbours 1 In the base formulation of MAD-B, there is also a regularisation term with weight \u00b53, but in all our experiments, we found that the best results were achieved over development data with \u00b53 = 0, i.e. with no regularisation; the term is thus omitted from our description."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-64",
"text": "(smoothness term)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-65",
"text": "For the first term, the label confidence for training and test users is set to 1.0 and 0.0, respectively."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-66",
"text": "Based on the development data, we set \u00b5 1 and \u00b5 2 to 1.0 and 0.1, respectively, for all the experiments."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-67",
"text": "For TWITTER-US and TWITTER-WORLD, the inference was intractable for the default network, as it was too large."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-68",
"text": "There are two immediate issues with the baseline graph propagation method: (1) it doesn't scale to large datasets with high edge counts, related to which, it tends to be biased by highly-connected nodes; and (2) it can't predict the geolocation of test users who aren't connected to any training user (MAD-B returns Unknown, which we rewrite with the centre of the map)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-69",
"text": "We redress these two issues as follows."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-70",
"text": "Celebrity Removal To address the first issue, we target \"celebrity\" users, i.e. highly-mentioned Twitter users."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-71",
"text": "Edges involving these users often carry little or no geolocation information (e.g. the majority of people who mention Barack Obama don't live in Washington D.C.)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-72",
"text": "Additionally, these users tend to be highly connected to other users and generate a disproportionately high number of edges in the graph, leading in large part to the baseline MAD-B not scaling over large datasets such as TWITTER-US and TWITTER-WORLD."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-73",
"text": "We identify and filter out celebrity nodes simply by assuming that a celebrity is mentioned by more than T users, where T is tuned over development data."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-74",
"text": "Based on tuning over the development Table 1 : Geolocation results over the three Twitter corpora, comparing baseline Modified Adsorption (MAD-B), with Modified Adsorption with celebrity removal (MADCEL-B and MADCEL-W, over binary and weighted networks, resp.) or celebrity removal plus text priors (MADCEL-B-LR and MADCEL-W-LR, over binary and weighted networks, resp.); the table also includes state-of-the-art results for each dataset (\"-\" signifies that no results were published for the given dataset; \"???\" signifies that no results were reported for the given metric; and \"\u00d7\u00d7\u00d7\" signifies that results could not be generated, due to the intractability of the training data)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-75",
"text": "set of GEOTEXT and TWITTER-US, T was set to 5 and 15 respectively."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-76",
"text": "For TWITTER-WORLD tuning was very resource intensive so T was set to 5 based on GEOTEXT, to make the inference faster."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-77",
"text": "Celebrity removal dramatically reduced the edge count in all three datasets (from 1 \u00d7 10 9 to 5 \u00d7 10 6 for TWITTER-US and from 4 \u00d7 10 10 to 1 \u00d7 10 7 for TWITTER-WORLD), and made inference tractable for TWITTER-US and TWITTER-WORLD."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-78",
"text": "Jurgens et al. (2015) report that the time complexity of most network-based geolocation methods is O(k 2 ) for each node where k is the average number of vertex neighbours."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-79",
"text": "In the case of the collapsed network of TWITTER-WORLD, k is decreased by a factor of 4000 after setting the celebrity threshold T to 5."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-80",
"text": "We apply celebrity removal over both binary (\"MADCEL-B\") and weighted (\"MADCEL-W\") networks (using the respective T for each dataset)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-81",
"text": "The effect of celebrity removal over the development set of TWITTER-US is shown in Figure 2 where it dramatically reduces the graph edge size and simultaneously leads to an improvement in the mean error."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-82",
"text": "A Unified Geolocation Model To address the issue of disconnected test users, we incorporate text information into the model by attaching a labelled dongle node to every test node (Zhu and Ghahramani, 2002; Goldberg and Zhu, 2006) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-83",
"text": "The label for the dongle node is based on a textbased l 1 regularised logistic regression model, using the method of Rahimi et al. (2015) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-84",
"text": "The dongle nodes with their corresponding label confidences are added to the seed set, and are treated in the same way as other labelled nodes (i.e. the training nodes)."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-85",
"text": "Once again, we experiment with text-based labelled dongle nodes over both binary (\"MADCEL-B-LR\") and weighted (\"MADCEL-W-LR\") networks."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-86",
"text": "Following Cheng et al. (2010) and Eisenstein et al. (2010) , we evaluate using the mean and median error (in km) over all test users (\"Mean\" and \"Median\", resp.), and also accuracy within 161km of the actual location (\"Acc@161\")."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-87",
"text": "Note that higher numbers are better for Acc@161, but lower numbers are better for mean and median error, with a lower bound of 0 and no (theoretical) upper bound."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-88",
"text": "To generate a continuous-valued latitude/longitude coordinate for a given user from the k-d tree cell, we use the median coordinates of all training points in the predicted region."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-89",
"text": "Table 1 shows the performance of MAD-B, MADCEL-B, MADCEL-W, MADCEL-B-LR and MADCEL-W-LR over the GEOTEXT, TWITTER-US and TWITTER-WORLD datasets."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-90",
"text": "The results are also compared with prior work on network-based geolocation using label propagation (LP) (Rahimi et al., 2015) , text-based classification models (Han et al., 2012; Wing and Baldridge, 2011; Rahimi et al., 2015; Cha et al., 2015) , textbased graphical models (Ahmed et al., 2013) , and network-text hybrid models (LP-LR) (Rahimi et al., 2015) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-92",
"text": "**RESULTS**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-93",
"text": "Our baseline network-based model of MAD-B outperforms the text-based models and also previous network-based models (Jurgens, 2013; Compton et al., 2014; Rahimi et al., 2015) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-94",
"text": "The inference, however, is intractable for TWITTER-US and TWITTER-WORLD due to the size of the network."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-95",
"text": "Celebrity removal in MADCEL-B and MADCEL-W has a positive effect on geolocation accuracy, and results in a 47% reduction in Median over GEOTEXT."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-96",
"text": "It also makes graph inference over TWITTER-US and TWITTER-WORLD tractable, and results in superior Acc@161 and Median, but slightly inferior Mean, compared to the state-of-the-art results of LR, based on text-based classification (Rahimi et al., 2015) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-97",
"text": "MADCEL-W (weighted graph) outperforms MADCEL-B (binary graph) over the smaller GEOTEXT dataset where it compensates for the sparsity of network information, but doesn't improve the results for the two larger datasets where network information is denser."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-98",
"text": "Adding text to the network-based geolocation models in the form of MADCEL-B-LR (binary edges) and MADCEL-W-LR (weighted edges), we achieve state-of-the-art results over all three datasets."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-99",
"text": "The inclusion of text-based priors has the greatest impact on Mean, resulting in an additional 26% and 23% error reduction over TWITTER-US and TWITTER-WORLD, respectively."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-100",
"text": "The reason for this is that it provides a user-specific geolocation prior for (relatively) disconnected users."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-102",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-103",
"text": "We proposed a label propagation method over adaptive grids based on collapsed @-mention networks using Modified Adsorption, and successfully supplemented the baseline algorithm by: (a) removing \"celebrity\" nodes (improving the results and also making inference more tractable); and (b) incorporating text-based geolocation priors into the model."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-104",
"text": "As future work, we plan to use temporal data and also look at improving the text-based geolocation model using sparse coding (Cha et al., 2015) ."
},
{
"sent_id": "9885b924f6b0806844d4e70d857a35-C001-105",
"text": "We also plan to investigate more nuanced methods for differentiating between global and local celebrity nodes, to be able to filter out global celebrity nodes but preserve local nodes that can have high geolocation utility."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"9885b924f6b0806844d4e70d857a35-C001-14"
],
[
"9885b924f6b0806844d4e70d857a35-C001-35"
]
],
"cite_sentences": [
"9885b924f6b0806844d4e70d857a35-C001-14",
"9885b924f6b0806844d4e70d857a35-C001-35"
]
},
"@MOT@": {
"gold_contexts": [
[
"9885b924f6b0806844d4e70d857a35-C001-14",
"9885b924f6b0806844d4e70d857a35-C001-15"
]
],
"cite_sentences": [
"9885b924f6b0806844d4e70d857a35-C001-14"
]
},
"@DIF@": {
"gold_contexts": [
[
"9885b924f6b0806844d4e70d857a35-C001-17"
],
[
"9885b924f6b0806844d4e70d857a35-C001-93"
],
[
"9885b924f6b0806844d4e70d857a35-C001-96"
]
],
"cite_sentences": [
"9885b924f6b0806844d4e70d857a35-C001-17",
"9885b924f6b0806844d4e70d857a35-C001-93",
"9885b924f6b0806844d4e70d857a35-C001-96"
]
},
"@USE@": {
"gold_contexts": [
[
"9885b924f6b0806844d4e70d857a35-C001-17"
],
[
"9885b924f6b0806844d4e70d857a35-C001-38"
],
[
"9885b924f6b0806844d4e70d857a35-C001-83"
]
],
"cite_sentences": [
"9885b924f6b0806844d4e70d857a35-C001-17",
"9885b924f6b0806844d4e70d857a35-C001-38",
"9885b924f6b0806844d4e70d857a35-C001-83"
]
},
"@SIM@": {
"gold_contexts": [
[
"9885b924f6b0806844d4e70d857a35-C001-90"
]
],
"cite_sentences": [
"9885b924f6b0806844d4e70d857a35-C001-90"
]
}
}
},
"ABC_4cc18724e62db32e748838080cbfd0_9": {
"x": [
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-2",
"text": "Deep neural networks reach state-of-the-art performance for wide range of natural language processing, computer vision and speech applications."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-3",
"text": "Yet, one of the biggest challenges is running these complex networks on devices such as mobile phones or smart watches with tiny memory footprint and low computational capacity."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-4",
"text": "We propose on-device Self-Governing Neural Networks (SGNNs), which learn compact projection vectors with local sensitive hashing."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-5",
"text": "The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-6",
"text": "We conduct extensive evaluation on dialog act classification and show significant improvement over state-of-the-art results."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-7",
"text": "Our findings show that SGNNs are effective at capturing low-dimensional semantic text representations, while maintaining high accuracy."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-10",
"text": "Deep neural networks are one of the most successful machine learning methods outperforming many state-of-the-art machine learning methods in natural language processing (Sutskever et al., 2014) , speech and visual recognition tasks (Krizhevsky et al., 2012) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-11",
"text": "The availability of high performance computing has enabled research in deep learning to focus largely on the development of deeper and more complex network architectures for improved accuracy."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-54",
"text": "Instead feature-ids and projection vectors are dynamically computed via hash functions."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-12",
"text": "However, the increased complexity of the deep neural networks has become one of the biggest obstacles to deploy deep neural networks ondevice such as mobile phones, smart watches and IoT (Iandola et al., 2016) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-13",
"text": "The main challenges with developing and deploying deep neural network models on-device are (1) the tiny memory footprint, (2) inference latency and (3) significantly low computational capacity compared to high performance computing systems such as CPUs, GPUs and TPUs on the cloud."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-14",
"text": "There are multiple strategies to build lightweight text classification models for ondevice."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-15",
"text": "One can create a small dictionary of common input \u2192 category mapping on the device and use a naive look-up at inference time."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-16",
"text": "However, such an approach does not scale to complex natural language tasks involving rich vocabularies and wide language variability."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-17",
"text": "Another strategy is to employ fast sampling techniques (Ahmed et al., 2012; Ravi, 2013) or incorporate deep learning models with graph learning like (Bui et al., 2017 (Bui et al., , 2018 , which result in large models but have proven to be extremely powerful for complex language understanding tasks like response completion (Pang and Ravi, 2012) and Smart Reply (Kannan et al., 2016) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-18",
"text": "In this paper, we propose Self-Governing Neural Networks (SGNNs) inspired by projection networks (Ravi, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-19",
"text": "SGNNs are on-device deep learning models learned via embedding-free projection operations."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-20",
"text": "We employ a modified version of the locality sensitive hashing (LSH) to reduce input dimension from millions of unique words/features to a short, fixed-length sequence of bits."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-21",
"text": "This allows us to compute a projection for an incoming text very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming text and word embeddings."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-22",
"text": "We evaluate the performance of our SGNNs on Dialogue Act classification, because (1) it is an important step towards dialog interpretation and conversational analysis aiming to understand the intent of the speaker at every utterance of the conversation and (2) deep learning methods reached state-of-the-art (Lee and Dernoncourt, 2016; Khanpour et al., 2016; Tran et al., 2017; Ortega and Vu, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-23",
"text": "The main contributions of the paper are:"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-24",
"text": "\u2022 Novel Self-Governing Neural Networks (SGNNs) for on-device deep learning for short text classification."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-25",
"text": "\u2022 Compression technique that effectively captures low-dimensional semantic text representation and produces compact models that save on storage and computational cost."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-26",
"text": "\u2022 On the fly computation of projection vectors that eliminate the need for large pre-trained word embeddings or vocabulary pruning."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-27",
"text": "\u2022 Exhaustive experimental evaluation on dialog act datasets, outperforming state-of-theart deep CNN (Lee and Dernoncourt, 2016) and RNN variants (Khanpour et al., 2016; Ortega and Vu, 2017 )."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-29",
"text": "**SELF-GOVERNING NEURAL NETWORKS**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-30",
"text": "We model the Self-Governing network using a projection model architecture (Ravi, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-31",
"text": "The projection model is a simple network with dynamically-computed layers that encodes a set of efficient-to-compute operations which can be performed directly on device for inference."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-32",
"text": "The model defines a set of efficient \"projection\" functions P( x i ) that project each input instance x i to a different space \u2126 P and then performs learning in this space to map it to corresponding outputs y p i ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-33",
"text": "A very simple projection model comprises just few operations where the inputs x i are transformed using a series of T projection functions P 1 , ..., P T followed by a single layer of activations."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-35",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-36",
"text": "In this work, we design a Self-Governing Neural Network (SGNN) using multi-layered localitysensitive projection model."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-37",
"text": "Figure 1 shows the model architecture of the on-device SGNN network."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-38",
"text": "The self-governing property of this network stems from its ability to learn a model (e.g., a classifier) without having to initialize, load or store any feature or vocabulary weight matrices."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-39",
"text": "In this sense, our method is a truly embedding-free approach unlike majority of the widely-used stateof-the-art deep learning techniques in NLP whose performance depends on embeddings pre-trained on large corpora."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-40",
"text": "Instead, we use the projection functions to dynamically transform each input to a low-dimensional representation."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-41",
"text": "Furthermore, we stack this with additional layers and non-linear activations to achieve deep, non-linear combinations of projections that permit the network to learn complex mappings from inputs x i to outputs y i ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-42",
"text": "An SGNN network is shown below:"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-43",
"text": "where, i p refers to the output of projection operation applied to input x i , h p is applied to projection output, h t is applied at intermediate layers of the network with depth k followed by a final softmax activation layer at the top."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-44",
"text": "In a k-layer SGNN, h t , where t = p, p + 1, ..., p + k \u2212 1 refers to the k subsequent layers after the projection layer."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-45",
"text": "W p , W t , W o and b p , b t , b o represent trainable weights and biases respectively."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-46",
"text": "The projection transformations use precomputed parameterized functions, i.e., they are not trained during the learning process, and their outputs are concatenated to form the hidden units for subsequent operations."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-47",
"text": "Each input text x i is converted to an intermediate feature vector (via raw text features such as skip-grams) followed by projections."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-48",
"text": "On-the-fly Computation."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-49",
"text": "The transformation step F dynamically extracts features from the raw input."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-50",
"text": "Text features (e.g., skip-grams) are converted into feature-ids f j (via hashing) to generate a sparse feature representation x i of feature-id, weight pairs (f j , w j ) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-51",
"text": "This intermediate feature representation is passed through projection functions P to construct projection layer i p in SGNN."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-52",
"text": "For this last step, a projection vector P k is first constructed on-the-fly using a hash function with feature ids f j in x i and fixed seed as input, then dot product of the two vectors < x i , P k > is computed and transformed into binary representation P k ( x i ) using sgn(.) of the dot product."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-53",
"text": "As shown in Figure 1 , both F and P steps are computed on-the-fly, i.e., no word-embedding or vocabulary/feature matrices need to be stored and looked up during training or inference."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-55",
"text": "For intermediate feature weights w j , we use observed counts in each input text and do not use pre-computed statistics like idf."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-56",
"text": "Hence the method is embedding-free."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-57",
"text": "Model Optimization."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-58",
"text": "The SGNN network is trained from scratch on the task data using a supervised loss defined wrt ground truth\u0177 i :"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-59",
"text": "During training, the network learns to choose and apply specific projection operations P j (via activations) that are more predictive for a given task."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-60",
"text": "The choice of the type of projection matrix P as well as representation of the projected space \u2126 P has a direct effect on computation cost and model size."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-61",
"text": "We leverage an efficient randomized projection method and use a binary representation {0, 1} d for \u2126 P ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-62",
"text": "This yields a drastically lower memory footprint both in terms of number and size of parameters."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-63",
"text": "Computing Projections."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-64",
"text": "We employ an efficient randomized projection method for the projection step."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-65",
"text": "We use locality sensitive hashing (LSH) (Charikar, 2002) to model the underlying projection operations in SGNN."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-66",
"text": "LSH is typically used as a dimensionality reduction technique for clustering (Manning et al., 2008) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-67",
"text": "LSH allows us to project similar inputs x i or intermediate network layers into hidden unit vectors that are nearby in metric space."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-68",
"text": "We use repeated binary hashing for P and apply the projection vectors to transform the input x i to a binary hash representation denoted by P k ( x i ) \u2208 {0, 1}, where"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-69",
"text": "This results in a dbit vector representation, one bit corresponding to each projection row P k=1...d ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-70",
"text": "The same projection matrix P is used for training and inference."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-71",
"text": "We never need to explicitly store the random projection vector P k since we can compute them on the fly using hash functions over feature indices with a fixed row seed rather than invoking a random number generator."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-72",
"text": "This also permits us to perform projection operations that are linear in the observed feature size rather than the overall feature or vocabulary size which can be prohibitively large for high-dimensional data, thereby saving both memory and computation cost."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-73",
"text": "Thus, SGNN can efficiently model highdimensional sparse inputs and large vocabulary sizes common for text applications instead of relying on feature pruning or other pre-processing heuristics employed to restrict input sizes in standard neural networks for feasible training."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-74",
"text": "The binary representation is significant since this results in a significantly compact representation for the projection network parameters that in turn considerably reduces the model size."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-75",
"text": "SGNN Parameters."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-76",
"text": "In practice, we employ T different projection functions P j=1...T , each resulting in d-bit vector that is concatenated to form the projected vector i p in Equation 5."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-77",
"text": "T and d vary depending on the projection network parameter configuration specified for P and can be tuned to trade-off between prediction quality and model size."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-78",
"text": "Note that the choice of whether to use a single projection matrix of size T \u00b7 d or T separate matrices of d columns depends on the type of projection employed (dense or sparse)."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-79",
"text": "For the intermediate feature step F in Equation 5, we use skip-gram features (3-grams with skip-size=2) extracted from raw text."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-81",
"text": "**TRAINING AND INFERENCE**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-82",
"text": "We use the compact bit units to represent the projection in SGNN."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-83",
"text": "During training, the network learns to move the gradients for points that are nearby to each other in the projected bit space \u2126 P in the same direction."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-84",
"text": "SGNN network is trained end-to-end using backpropagation."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-85",
"text": "Training can progress efficiently with stochastic gradient descent with distributed computing on highperformance CPUs or GPUs."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-86",
"text": "Complexity."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-87",
"text": "The overall complexity for SGNN inference, governed by the projection layer, is O(n \u00b7 T \u00b7 d), where n is the observed feature size (*not* overall vocabulary size) which is linear in input size, d is the number of LSH bits specified for each projection vector P k , and T is the number of projection functions used in P. The model size (in terms of number of parameters) and memory storage required for the projection inference step is O(T \u00b7 d \u00b7 C), where C is the number of hidden units in h p in the multi-layer projection network and typically smaller than T \u00b7 d."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-89",
"text": "**DATASETS AND EXPERIMENTAL SETUP 3.1 DATA DESCRIPTION**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-90",
"text": "We conduct our experimental evaluation on two dialog act benchmark datasets."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-91",
"text": "\u2022 SWDA: Switchboard Dialog Act Corpus (Godfrey et al., 1992; Jurafsky et al., 1997) is a popular open domain dialogs corpus between two speakers with 42 dialogs acts."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-92",
"text": "\u2022 MRDA: ICSI Meeting Recorder Dialog Act Corpus (Adam et al., 2003; Shriberg et al., 2004 ) is a dialog corpus of multiparty meetings with 5 tags of dialog acts."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-93",
"text": "Table 1 summarizes dataset statistics."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-94",
"text": "We use the train, validation and test splits as defined in (Lee and Dernoncourt, 2016; Ortega and Vu, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-96",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-97",
"text": "We setup our experimental evaluation, as follows: given a classification task and a dataset, we generate an on-device model."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-98",
"text": "The size of the model can be configured (by adjusting the projection matrix P) to fit in the memory footprint of the device, i.e. a phone has more memory compared to a smart watch."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-99",
"text": "For each classification task, we report Accuracy on the test set."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-101",
"text": "**HYPERPARAMETER AND TRAINING**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-102",
"text": "For both datasets we used the following: 2-layer SGNN (P T =80,d=14 \u00d7 FullyConnected 256 \u00d7 FullyConnected 256 ), mini-batch size of 100, dropout rate of 0.25, learning rate was initialized to 0.025 with cosine annealing decay (Loshchilov and Hutter, 2016) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-103",
"text": "Unlike prior approaches (Lee and Dernoncourt, 2016; Ortega and Vu, 2017 ) that rely on pre-trained word embeddings, we learn the projection weights on the fly during training, i.e word embeddings (or vocabularies) do not need to be stored."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-104",
"text": "Instead, features are computed on the fly and are dynamically compressed via the projection matrices into projection vectors."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-105",
"text": "These values were chosen via a grid search on development sets, we do not perform any other dataset-specific tuning."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-106",
"text": "Training is performed through stochastic gradient descent over shuffled mini-batches with Nesterov momentum optimizer (Sutskever et al., 2013) , run for 1M steps."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-107",
"text": "Tables 2 and 3 show results on the SwDA and MRDA dialog act datasets."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-108",
"text": "Overall, our SGNN model consistently outperforms the baselines and prior state-of-the-art deep learning models."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-110",
"text": "**RESULTS**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-112",
"text": "**BASELINES**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-113",
"text": "We compare our model against a majority class baseline and Naive Bayes classifier (Lee and Dernoncourt, 2016) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-114",
"text": "Our model significantly outperforms both baselines by 12 to 35% absolute."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-116",
"text": "**COMPARISON AGAINST STATE-OF-ART METHODS**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-117",
"text": "We also compare our performance against prior work using HMMs (Stolcke et al., 2000) and recent deep learning methods like CNN (Lee and Dernoncourt, 2016) , RNN (Khanpour et al., 2016) and RNN with gated attention (Tran et al., 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-118",
"text": "To the best of our knowledge, (Lee and Dernoncourt, 2016; Ortega and Vu, 2017; Tran et al., 2017) are the latest approaches in dialog act classification, which also reported on the same data splits."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-119",
"text": "Therefore, we compare our research against these works."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-120",
"text": "According to (Ortega and Vu, 2017) , prior work by (Ji and Bilmes, 2006) achieved promising results on the MRDA dataset, but since the evaluation was conducted on a different data split, it is hard to compare them directly."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-121",
"text": "For both SwDA and MRDA datasets, our SGNNs obtains the best result of 83.1 and 86.7 accuracy outperforming prior state-of-the-art work."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-122",
"text": "This is very impressive given that we work with very small memory footprint and we do not rely on pre-trained word embeddings."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-123",
"text": "Our study also shows that the proposed method is very effective for such natural language tasks compared to more complex neural network architectures such as deep CNN (Lee and Dernoncourt, 2016) and RNN variants (Khanpour et al., 2016; Ortega and Vu, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-124",
"text": "We believe that the compression techniques like locality sensitive projections jointly coupled with non-linear functions are effective at capturing lowdimensional semantic text representations that are useful for text classification applications."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-126",
"text": "**DISCUSSION ON MODEL SIZE AND INFERENCE**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-127",
"text": "LSTMs have millions of parameters, while our on-device architecture has just 300K parameters (order of magnitude lower)."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-128",
"text": "Most deep learning methods also use large vocabulary size of 10K or higher."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-129",
"text": "Each word embedding is represented as 100-dimensional vector leading to a storage requirement of 10, 000 \u00d7 100 parameter weights just in the first layer of the deep network."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-130",
"text": "In contrast, SGNNs in all our experiments use a fixed 1120-dimensional vector regardless of the vocabulary or feature size, dynamic computation results"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-132",
"text": "**METHOD**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-133",
"text": "Acc."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-134",
"text": "Majority Class (baseline) (Ortega and Vu, 2017) 33.7 Naive Bayes (baseline) (Khanpour et al., 2016) 47.3 HMM (Stolcke et al., 2000) 71.0 DRLM-conditional training (Ji and Bilmes, 2006) 77.0 DRLM-joint training (Ji and Bilmes, 2006) 74.0 LSTM (Lee and Dernoncourt, 2016) 69.9 CNN (Lee and Dernoncourt, 2016) 73.1 Gated-Attention&HMM (Tran et al., 2017) 74.2 RNN+Attention (Ortega and Vu, 2017) 73.8 RNN (Khanpour et al., 2016) 80.1 SGNN: Self-Governing Neural Network (ours) 83.1 (Ortega and Vu, 2017) 59.1 Naive Bayes (baseline) (Khanpour et al., 2016) 74.6 Graphical Model (Ji and Bilmes, 2006) 81.3 CNN (Lee and Dernoncourt, 2016) 84.6 RNN+Attention (Ortega and Vu, 2017) 84.3 RNN (Khanpour et al., 2016) 86.8 SGNN: Self-Governing Neural Network (ours) 86.7 Table 3 : MRDA Dataset Results in further speed up for high-dimensional feature spaces."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-135",
"text": "This amounts to a huge savings in storage and computation cost wrt FLOPs (floating point operations per second)."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-137",
"text": "**CONCLUSION**"
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-138",
"text": "We proposed Self-Governing Neural Networks for on-device short text classification."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-139",
"text": "Experiments on multiple dialog act datasets showed that our model outperforms state-of-the-art deep leaning methods (Lee and Dernoncourt, 2016; Khanpour et al., 2016; Ortega and Vu, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-140",
"text": "We introduced a compression technique that effectively captures low-dimensional semantic representation and produces compact models that significantly save on storage and computational cost."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-141",
"text": "Our approach does not rely on pre-trained embeddings and efficiently computes the projection vectors on the fly."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-142",
"text": "In the future, we are interested in extending this approach to more natural language tasks."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-143",
"text": "For instance, we built a multilingual SGNN model for customer feedback classification (Liu et al., 2017) and obtained 73% on Japanese, close to best performing system on the challenge (Plank, 2017) ."
},
{
"sent_id": "4cc18724e62db32e748838080cbfd0-C001-144",
"text": "Unlike their method, we did not use any pre-processing, tagging, parsing, pre-trained embeddings or other resources."
}
],
"y": {
"@SIM@": {
"gold_contexts": [
[
"4cc18724e62db32e748838080cbfd0-C001-22"
]
],
"cite_sentences": [
"4cc18724e62db32e748838080cbfd0-C001-22"
]
},
"@DIF@": {
"gold_contexts": [
[
"4cc18724e62db32e748838080cbfd0-C001-23",
"4cc18724e62db32e748838080cbfd0-C001-27"
],
[
"4cc18724e62db32e748838080cbfd0-C001-103"
],
[
"4cc18724e62db32e748838080cbfd0-C001-113",
"4cc18724e62db32e748838080cbfd0-C001-114"
],
[
"4cc18724e62db32e748838080cbfd0-C001-123"
],
[
"4cc18724e62db32e748838080cbfd0-C001-139"
]
],
"cite_sentences": [
"4cc18724e62db32e748838080cbfd0-C001-27",
"4cc18724e62db32e748838080cbfd0-C001-103",
"4cc18724e62db32e748838080cbfd0-C001-113",
"4cc18724e62db32e748838080cbfd0-C001-123",
"4cc18724e62db32e748838080cbfd0-C001-139"
]
},
"@USE@": {
"gold_contexts": [
[
"4cc18724e62db32e748838080cbfd0-C001-94"
],
[
"4cc18724e62db32e748838080cbfd0-C001-113"
],
[
"4cc18724e62db32e748838080cbfd0-C001-117"
]
],
"cite_sentences": [
"4cc18724e62db32e748838080cbfd0-C001-94",
"4cc18724e62db32e748838080cbfd0-C001-113",
"4cc18724e62db32e748838080cbfd0-C001-117"
]
},
"@BACK@": {
"gold_contexts": [
[
"4cc18724e62db32e748838080cbfd0-C001-118"
]
],
"cite_sentences": [
"4cc18724e62db32e748838080cbfd0-C001-118"
]
}
}
},
"ABC_5fe12a1a43957faded5722f698eb41_9": {
"x": [
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-2",
"text": "We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-3",
"text": "Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the stateof-the-art on two standard datasets used for the task."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-4",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-5",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-6",
"text": "Lexical simplification is the task of automatically rewriting a text by substituting words or phrases with simpler variants, while retaining its meaning and grammaticality."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-7",
"text": "The goal is to make the text easier to understand for children, language learners, people with cognitive disabilities and even machines."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-8",
"text": "Approaches to lexical simplification generally follow a standard pipeline consisting of two main steps: generation and ranking."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-9",
"text": "In the generation step, a set of possible substitutions for the target word is commonly created by querying semantic databases such as Wordnet (Devlin and Tait, 1998) , learning substitution rules from sentence-aligned parallel corpora of complex-simple texts (Horn et al., 2014; Paetzold and Specia, 2017) , and learning word embeddings from a large corpora to obtain similar words of the complex word (Glava\u0161 and\u0160tajner, 2015; Kim et al., 2016; Specia, 2016a, 2017) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-10",
"text": "In the ranking step, features that discriminate a substitution candidate from other substitution candidates are leveraged and the candidates are ranked with respect to their simplicity and contextual fitness."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-11",
"text": "* This research was conducted while the first author was a Post Doctoral Fellow at the City University of Hong Kong."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-12",
"text": "The ranking step is challenging because the substitution candidates usually have similar meaning to the target word, and thus share similar context features."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-13",
"text": "State-of-the-art approaches to ranking in lexical simplification exploit supervised machine learning-based methods that rely mostly on surface features, such as word frequency, word length and n-gram probability, for training the model (Horn et al., 2014; Bingel and S\u00f8gaard, 2016; Specia, 2016a, 2017) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-14",
"text": "Moreover, deep architectures are not explored in these models."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-15",
"text": "Surface features and shallow architectures might not be able to explore the features at different levels of abstractions that capture nuances that discriminate the candidates."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-16",
"text": "In this paper, we propose to use a Deep Structured Similarity Model (DSSM) (Huang et al., 2013) to rank substitution candidates."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-17",
"text": "The DSSM exploits a deep architecture by using a deep neural network (DNN), that can effectively capture contextual features to perform semantic matching between two sentences."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-18",
"text": "It has been successfully applied to several natural language processing (NLP) tasks, such as machine translation , web search ranking (Huang et al., 2013; Shen et al., 2014; Liu et al., 2015) , question answering , and image captioning (Fang et al., 2015) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-19",
"text": "To the best of our knowledge, this is the first time this model is applied to lexical simplification."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-20",
"text": "We adapt the original DSSM architecture and objective function to our specific task."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-21",
"text": "Our evaluation on two standard datasets for lexical simplification shows that this method outperforms state-of-the-art approaches that use supervised machine learning-based methods."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-22",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-23",
"text": "**METHOD**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-24",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-25",
"text": "**TASK DEFINITION**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-26",
"text": "We focus on the ranking step of the standard lexical simplification pipeline."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-27",
"text": "Given a dataset of tar-get words, their sentential contexts and substitution candidates for the target words, the goal is to train a model that accurately ranks the candidates based on their simplicity and semantic matching."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-28",
"text": "For generating substitution candidates, we utilize the method proposed by Paetzold and Specia (2017) , which was recently shown to be the state-of-art method for generating substitution candidates."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-29",
"text": "They exploit a hybrid substitution generation approach where candidates are first extracted from 550,644 simple-complex aligned sentences from the Newsela corpus (Xu et al., 2015) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-30",
"text": "Then, these candidates are complemented with candidates generated with a retrofitted word embedding model."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-31",
"text": "The word embedding model is retrofitted over WordNet's synonym pairs (for details, please refer to Paetzold and Specia (2017) )."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-32",
"text": "For ranking substitution candidates, we use a DSSM, which we elaborate in the next section."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-34",
"text": "**DSSM FOR RANKING SUBSTITUTION**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-35",
"text": "Candidates"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-36",
"text": "Word Hashing"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-38",
"text": "**NONLINEAR PROJECTION**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-39",
"text": "Relevance measured by cosine similarity Posterior probability computed by softmax"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-40",
"text": "Figure 1: Architecture of the Deep Structured Similarity Model (DSSM): The input X (either a target word or a substitute candidate and their sentential contexts, T and S, respectively) is first represented as a bag of words, then hashed into letter 3-grams H. Non-linear projection W t generates the semantic representation of T and nonlinear projection W s constructs the semantic representation of S. Finally, the cosine similarity is adopted to measure the relevance between the T and S. At last, the posterior probabilities over all candidates are computed."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-41",
"text": "Compared to other latent semantic models, such as Latent Semantic Analysis (Deerwester et al., 1990) and its extensions, Deep Structured Similarity Model (also called Deep Semantic Similarity Model) or DSSM (Huang et al., 2013) can capture fine-grained local and global contextual features more effectively."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-42",
"text": "The DSSM is trained by optimizing a similarity-driven objective, by projecting the whole sentence to a continuous semantic space."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-43",
"text": "In addition, it is is built upon characters (rather than words) for scalability and generalizability (He, 2016) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-44",
"text": "Figure 1 shows the architecture of the model used in this work."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-45",
"text": "It consists of a typical DNN with a word hashing layer, a nonlinear projection layer, and an output layer."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-46",
"text": "Each component is described in the following: Word Hashing Layer: the input is first mapped from a high-dimentional one-hot word vector into a low-dimentional letter-trigram space (with the dimentionality as low as 5k), a method called word hashing (Huang et al., 2013) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-47",
"text": "For example, the word cat is hashed as the bag of letter trigram #-ca, c-a-t, a-t-#, where # is a boundary symbol (Liu et al., 2015) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-48",
"text": "The word hashing helps the model generalize better for out-of-vocabulary words and for spelling variants of the same word (Liu et al., 2015) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-49",
"text": "Nonlinear Projection Layer: This layers maps the substitution candidate and the target word in their sentential contexts, S and T respectively, which are represented as letter tri-grams, into ddimension semantic representations, S Sq and T Sq respectively:"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-50",
"text": "where the nonlinear activation tanh is defined as:"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-51",
"text": "1+e \u22122z ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-52",
"text": "Output Layer: This layer computes the semantic relevance score between S and T as:"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-54",
"text": "**FEATURES FOR DSSM**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-55",
"text": "As baseline features, we use the same n-gram probability features as in Paetzold and Specia (2017) , who also employ a neural network to rank substitution candidates."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-56",
"text": "As in Paetzold and Specia (2017) , the features were extracted using the SubIMDB corpus (Paetzold and Specia, 2015) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-57",
"text": "We also experiment with additional features that have been reported as useful in this task."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-58",
"text": "For each target word and a substitution candidate word we also compute: cosine similarity, word length, and alignment probability in the sentence-aligned Normal-Simple Wikipedia corpus (Kauchak, 2013) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-59",
"text": "The cosine similarity feature is computed using the SubIMDB corpus."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-61",
"text": "**IMPLEMENTATION DETAILS AND TRAINING PROCEDURE OF THE DSSM**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-62",
"text": "Following previous works that used supervised machine learning for ranking in lexical simplification (Horn et al., 2014; Paetzold and Specia, 2017) , we train the DSSM using the LexMTurk dataset (Horn et al., 2014) , which contains 500 instances composed of a sentence, a target word and substitution candidates ranked by simplicity (Paetzold and Specia, 2017) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-63",
"text": "In order to learn the parameters W t and W s (Figure 1 ) of the DSSM, we use the standard backpropagation algorithm (Rumelhart et al., 1988) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-64",
"text": "The objective used in this paper follows the pair-wise learning-to-rank paradigm outlined in (Burges et al., 2005) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-65",
"text": "Given a target word and its sentential context T , we obtain a list of candidates L. We set different positive values to the candidates based on their simplicity rankings."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-66",
"text": "E.g., if the list of the candidates is ordered by simplificity as, L = {A + > B + > C + }, the labels are first constructed as L = {y A + = 3, y B + = 2, y C + = 1}. The values are then normalized by dividing by the maximum value in the list: L = {y A + = 1, y B + = 0.667, y C + = 0.333}. If the target word was not originally in L, we add it with label 0."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-67",
"text": "This enables the model to reflect the label information in the similarity scores."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-68",
"text": "We minimize the Bayesian expected loss as:"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-69",
"text": "Note that P (S l |T ) is computed as:"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-70",
"text": "here, \u03b3 is a tuning factor."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-71",
"text": "We used 5-cross validation approach to select hyper-parameters, such as number of hidden nodes."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-72",
"text": "We set the gamma factor as 10 as per Huang et al. (2013) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-73",
"text": "The selected hyperparameters were used to train the model in the whole LexMTurk dataset."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-74",
"text": "We employ earlystopping and select the model whose change of the average loss in each epoch was smaller than 1.0e-3."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-75",
"text": "Since the training data is small (only 500 samples) we use a smaller number of hidden nodes, d = 32, in the nonlinear projection layer and adopt a higher dropout rate (0.4)."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-76",
"text": "The model is optimized using Adam (Kingma and Ba, 2014) with the learning rate fixed at 0.001, and is trained for 30 epochs."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-77",
"text": "The mini-batch is set to 16 during training."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-79",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-81",
"text": "**DATASETS AND EVALUATION METRICS**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-82",
"text": "To evaluate the proposed model, we conduct experiments on two common datasets for lexical simplification: BenchLS (Paetzold and Specia, 2016b) , which contains 929 instances, and NNSEval (Paetzold and Specia, 2016a) , which contains 239 instances."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-83",
"text": "Each instance is composed of a sentence, a target word, and a set of gold candidates ranked by simplicity (Paetzold and Specia, 2017) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-84",
"text": "Since both datasets contain instances from the LexMturk dataset (Horn et al., 2014) , which we use for training the DNN, we remove the overlap instances between training and test datasets 1 ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-85",
"text": "We finally obtain 429 remaining instances in the BenchLS dataset, and 78 instances in the NNEval dataset, which are used in our evaluation."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-86",
"text": "We adopt the same evaluation metrics featured in Glava\u0161 and\u0160tajner (2015) and Horn et al. (2014) : 1) precision: ratio of correct simplifications out of all the simplifications made by the system; 2) accuracy: ratio of correct simplifications out of all words that should have been simplified; and 3) changed: ratio of target words changed by the system."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-88",
"text": "**BASELINES**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-89",
"text": "We compared the proposed model (DSSM Ranking) to two state-of-the-art approaches to ranking in lexical simplification that exploit supervised machine learning-based methods."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-90",
"text": "The first baseline is the Neural Substitution Ranking (NSR) approach described in (Paetzold and Specia, 2017) , which employs a multi-layer perceptron neural network."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-91",
"text": "We reimplement their model as part of the LEXenstein toolkit (Paetzold and Specia, 2015) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-92",
"text": "The network has 3 hidden layers with 8 nodes each."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-93",
"text": "Unlike the proposed model, they treat ranking in lexical simplification as a standard classification problem."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-94",
"text": "The second baseline is SVM rank (Joachims, 2006) Table 1 : Substitution candidates ranking results."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-95",
"text": "n-gram probs."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-96",
"text": "denotes the n-gram probability features described in Paetzold and Specia (2017) , and all denotes all features described in Section 2.3."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-97",
"text": "All values marked in bold are significantly higher compared to the best baseline, SVM rank , measured by t-test at p-value of 0.05."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-98",
"text": "with default parameters) for ranking substitution candidates, similar to the method described in (Horn et al., 2014) ."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-99",
"text": "All the three models employ the n-gram probability features extracted from the SubIMDB corpus (Paetzold and Specia, 2015) , as described in (Paetzold and Specia, 2017) , and are trained using the LexMTurk dataset."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-101",
"text": "**RESULTS**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-102",
"text": "The top part of table 1 (Substitution Candidates Ranking) summarizes the results of all three systems."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-103",
"text": "Overall, both SVM rank and DSSM Ranking outperform the NSR Baseline."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-104",
"text": "The DSSM Ranking performs comparably to SVM rank when using only n-gram probabilities as features, and consistently leverages all features described in Section 2.3, outperforming all systems in accuracy, precision and changed ratio."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-105",
"text": "We experimented with adding all features described in Section 2.3 to the baselines as well, however, we obtained no improvements compared to using only n-gram probability features."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-106",
"text": "We also tried running all ranking systems on selected candidates that best replace the target word in the input sentence."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-107",
"text": "We follow the Unsupervised Boundary Ranking Substitution Selection method described in Paetzold and Specia (2017) , which ranks candidates according to how well they fit the context of the target word, and discards 50% of the worst ranking candidates."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-108",
"text": "The bottom part of the table 1 (Selection Step + Substitution Candidates Ranking) summarizes the results of all ranking systems after performing the selection step on generated substitution candidates."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-109",
"text": "We obtain similar tendency in the results, with the DSSM Ranking outperforming both baselines."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-110",
"text": "The results indicate the advantage of using a deep architecture, and of building a semantic representation of the whole sentence on top of the characters."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-111",
"text": "To illustrate by examples, Table 2 lists the top candidate ranked by the systems for different input sentences."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-112",
"text": "In the examples, the DSSM Ranking correctly ranked a substitute for the target word, while the two baselines either left the target word unchanged, or ranked an incorrect substitute."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-113",
"text": "----------------------------------"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-114",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-115",
"text": "We presented an effective method for ranking in lexical simplification."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-116",
"text": "We explored the application of a DSSM that builds a semantic representation of the whole sentence on top of characters."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-117",
"text": "The DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming state-of-art approaches that use supervised machine learning to ranking in lexical simplification."
},
{
"sent_id": "5fe12a1a43957faded5722f698eb41-C001-118",
"text": "For future work, we plan to examine and incorporate a larger feature set to the DSSM, as well as try other DSSM architectures, such as the Convolutional DSSM (C-DSSM) (Shen et al., 2014) ."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5fe12a1a43957faded5722f698eb41-C001-9"
],
[
"5fe12a1a43957faded5722f698eb41-C001-31"
]
],
"cite_sentences": [
"5fe12a1a43957faded5722f698eb41-C001-9",
"5fe12a1a43957faded5722f698eb41-C001-31"
]
},
"@USE@": {
"gold_contexts": [
[
"5fe12a1a43957faded5722f698eb41-C001-28"
],
[
"5fe12a1a43957faded5722f698eb41-C001-31"
],
[
"5fe12a1a43957faded5722f698eb41-C001-55"
],
[
"5fe12a1a43957faded5722f698eb41-C001-56"
],
[
"5fe12a1a43957faded5722f698eb41-C001-62"
],
[
"5fe12a1a43957faded5722f698eb41-C001-82",
"5fe12a1a43957faded5722f698eb41-C001-83"
],
[
"5fe12a1a43957faded5722f698eb41-C001-89",
"5fe12a1a43957faded5722f698eb41-C001-90"
],
[
"5fe12a1a43957faded5722f698eb41-C001-99"
],
[
"5fe12a1a43957faded5722f698eb41-C001-107"
]
],
"cite_sentences": [
"5fe12a1a43957faded5722f698eb41-C001-28",
"5fe12a1a43957faded5722f698eb41-C001-31",
"5fe12a1a43957faded5722f698eb41-C001-55",
"5fe12a1a43957faded5722f698eb41-C001-56",
"5fe12a1a43957faded5722f698eb41-C001-62",
"5fe12a1a43957faded5722f698eb41-C001-83",
"5fe12a1a43957faded5722f698eb41-C001-90",
"5fe12a1a43957faded5722f698eb41-C001-99",
"5fe12a1a43957faded5722f698eb41-C001-107"
]
},
"@SIM@": {
"gold_contexts": [
[
"5fe12a1a43957faded5722f698eb41-C001-107",
"5fe12a1a43957faded5722f698eb41-C001-109"
]
],
"cite_sentences": [
"5fe12a1a43957faded5722f698eb41-C001-107"
]
},
"@DIF@": {
"gold_contexts": [
[
"5fe12a1a43957faded5722f698eb41-C001-107",
"5fe12a1a43957faded5722f698eb41-C001-109"
]
],
"cite_sentences": [
"5fe12a1a43957faded5722f698eb41-C001-107"
]
}
}
},
"ABC_0bd3236100730487986ade49af24b9_9": {
"x": [
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-26",
"text": "Stance detection on a corpus of student essays is considered in [5] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-25",
"text": "e employed features include agreement, cue words, denial, hedges, duration, polarity, and punctuation [10] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-2",
"text": "Stance detection is a classi cation problem in natural language processing where for a text and target pair, a class result from the set {Favor, Against, Neither} is expected."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-3",
"text": "It is similar to the sentiment analysis problem but instead of the sentiment of the text author, the stance expressed for a particular target is investigated in stance detection."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-4",
"text": "In this paper, we present a stance detection tweet data set for Turkish comprising stance annotations of these tweets for two popular sports clubs as targets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-5",
"text": "Additionally, we provide the evaluation results of SVM classi ers for each target on this data set, where the classi ers use unigram, bigram, and hashtag features."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-6",
"text": "is study is signi cant as it presents one of the initial stance detection data sets proposed so far and the rst one for Turkish language, to the best of our knowledge."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-7",
"text": "e data set and the evaluation results of the corresponding SVM-based approaches will form plausible baselines for the comparison of future studies on stance detection."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-10",
"text": "Stance detection (also called stance identi cation or stance classication) is one of the considerably recent research topics in natural language processing (NLP)."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-11",
"text": "It is usually de ned as a classi cation problem where for a text and target pair, the stance of the author of the text for that target is expected as a classi cation output from the set: {Favor, Against, Neither} [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-12",
"text": "Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-13",
"text": "Copyrights for third-party components of this work must be honored."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-14",
"text": "For all other uses, contact the owner/author(s)."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-15",
"text": "SIDEWAYS'17, Prague, Czech Republic Stance detection is usually considered as a subtask of sentiment analysis (opinion mining) [13] topic in NLP."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-16",
"text": "Both are mostly performed on social media texts, particularly on tweets, hence both are important components of social media analysis."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-17",
"text": "Nevertheless, in sentiment analysis, the sentiment of the author of a piece of text usually as Positive, Negative, and Neutral is explored while in stance detection, the stance of the author of the text for a particular target (an entity, event, etc.) either explicitly or implicitly referred to in the text is considered."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-18",
"text": "Like sentiment analysis, stance detection systems can be valuable components of information retrieval and other text analysis systems [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-19",
"text": "Previous work on stance detection include [16] where a stance classi er based on sentiment and arguing features is proposed in addition to an arguing lexicon automatically compiled."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-20",
"text": "e ultimate approach performs be er than distribution-based and unigram-based baseline systems [16] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-21",
"text": "In [17] , the authors show that the use of dialogue structure improves stance detection in on-line debates."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-22",
"text": "In [7] , Hasan and Ng carry out stance detection experiments using di erent machine learning algorithms, training data sets, features, and inter-post constraints in on-line debates, and draw insightful conclusions based on these experiments."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-23",
"text": "For instance, they nd that sequence models like HMMs perform be er at stance detection when compared with non-sequence models like Naive Bayes (NB) [7] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-24",
"text": "In another related study [10] , the authors conclude that topic-independent features can be exploited for disagreement detection in on-line dialogues."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-27",
"text": "A er using linguistically-motivated feature sets together with multivalued NB and SVM as the learning models, the authors conclude that they outperform two baseline approaches [5] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-28",
"text": "In [4] , the author claims that Wikipedia can be used to determine stances about controversial topics based on their previous work regarding controversy extraction on the Web."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-29",
"text": "Among more recent related work, in [1] stance detection for unseen targets is studied and bidirectional conditional encoding is employed."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-30",
"text": "e authors state that their approach achieves stateof-the art performance rates [1] on SemEval 2016 Twi er Stance Detection corpus [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-31",
"text": "In [3] , a stance-community detection approach called SCIFNET is proposed."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-32",
"text": "SCIFNET creates networks of people who are stance targets, automatically from the related document collections [3] using stance expansion and re nement techniques to arrive at stance-coherent networks."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-33",
"text": "A tweet data set annotated with stance information regarding six prede ned targets is proposed in [11] where this data set is annotated through crowdsourcing."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-34",
"text": "e authors indicate that the data set is also annotated with sentiment information in addition to stance, so it can help SIDEWAYS'17, July 2017, Prague, Czech Republic D. K\u00fc\u00e7\u00fck reveal associations between stance and sentiment [11] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-35",
"text": "Lastly, in [12] , SemEval 2016's aforementioned shared task on Twi er Stance Detection is described."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-36",
"text": "Also provided are the results of the evaluations of 19 systems participating in two subtasks (one with training data set provided and the other without an annotated data set) of the shared task [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-37",
"text": "In this paper, we present a tweet data set in Turkish annotated with stance information, where the corresponding annotations are made publicly available."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-38",
"text": "e domain of the tweets comprises two popular football clubs which constitute the targets of the tweets included."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-39",
"text": "We also provide the evaluation results of SVM classi ers (for each target) on this data set using unigram, bigram, and hashtag features."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-40",
"text": "To the best of our knowledge, the current study is the rst one to target at stance detection in Turkish tweets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-41",
"text": "Together with the provided annotated data set and the corresponding evaluations with the aforementioned SVM classi ers which can be used as baseline systems, our study will hopefully help increase social media analysis studies on Turkish content."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-42",
"text": "e rest of the paper is organized as follows: In Section 2, we describe our tweet data set annotated with the target and stance information."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-43",
"text": "Section 3 includes the details of our SVM-based stance classi ers and their evaluation results with discussions."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-44",
"text": "Section 4 includes future research topics based on the current study, and nally Section 5 concludes the paper with a summary."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-46",
"text": "**A STANCE DETECTION DATA SET**"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-47",
"text": "We have decided to consider tweets about popular sports clubs as our domain for stance detection."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-48",
"text": "Considerable amounts of tweets are being published for sports-related events at every instant."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-49",
"text": "Hence we have determined our targets as Galatasaray (namely Target-1) and Fenerbah\u00e7e (namely, Target-2) which are two of the most popular football clubs in Turkey."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-73",
"text": "We have used the SVM implementation available in the Weka data mining application [6] where this particular implementation employs the SMO algorithm [14] to train a classi er with a linear kernel."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-50",
"text": "As is the case for the sentiment analysis tools, the outputs of the stance detection systems on a stream of tweets about these clubs can facilitate the use of the opinions of the football followers by these clubs."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-51",
"text": "In a previous study on the identi cation of public health-related tweets, two tweet data sets in Turkish (each set containing 1 million random tweets) have been compiled where these sets belong to two di erent periods of 20 consecutive days [9] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-52",
"text": "We have decided to use one of these sets (corresponding to the period between August 18 and September 6, 2015) and rstly ltered the tweets using the possible names used to refer to the target clubs."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-53",
"text": "en, we have annotated the stance information in the tweets for these targets as Favor or Against."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-54",
"text": "Within the course of this study, we have not considered those tweets in which the target is not explicitly mentioned, as our initial ltering process reveals."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-55",
"text": "For the purposes of the current study, we have not annotated any tweets with the Neither class."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-56",
"text": "is stance class and even nergrained classes can be considered in further annotation studies."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-57",
"text": "We should also note that in a few tweets, the target of the stance was the management of the club while in some others a particular footballer of the club is praised or criticised."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-58",
"text": "Still, we have considered the club as the target of the stance in all of the cases and carried out our annotations accordingly."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-59",
"text": "At the end of the annotation process, we have annotated 700 tweets, where 175 tweets are in favor of and 175 tweets are against Target-1, and similarly 175 tweets are in favor of and 175 are against Target-2."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-60",
"text": "Hence, our data set is a balanced one although it is currently limited in size."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-61",
"text": "e corresponding stance annotations are made publicly available at http://ceng.metu.edu.tr/\u223ce120329/ Turkish Stance Detection Tweet Dataset.csv in Comma Separated Values (CSV) format."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-62",
"text": "e le contains three columns with the corresponding headers."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-63",
"text": "e rst column is the tweet id of the corresponding tweet, the second column contains the name of the stance target, and the last column includes the stance of the tweet for the target as Favor or Against."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-64",
"text": "To the best of our knowledge, this is the rst publicly-available stance-annotated data set for Turkish."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-65",
"text": "Hence, it is a signi cant resource as there is a scarcity of annotated data sets, linguistic resources, and NLP tools available for Turkish."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-66",
"text": "Additionally, to the best of our knowledge, it is also signi cant for being the rst stance-annotated data set including sports-related tweets, as previous stance detection data sets mostly include on-line texts on political/ethical issues."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-68",
"text": "**STANCE DETECTION EXPERIMENTS USING SVM CLASSIFIERS**"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-69",
"text": "It is emphasized in the related literature that unigram-based methods are reliable for the stance detection task [16] and similarly unigram-based models have been used as baseline models in studies such as [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-70",
"text": "In order to be used as a baseline and reference system for further studies on stance detection in Turkish tweets, we have trained two SVM classi ers (one for each target) using unigrams as features."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-71",
"text": "Before the extraction of unigrams, we have employed automated preprocessing to lter out the stopwords in our annotated data set of 700 tweets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-72",
"text": "e stopword list used is the list presented in [8] which, in turn, is the slightly extended version of the stopword list provided in [2] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-74",
"text": "e 10-fold cross-validation results of the two classi ers are provided in Table 1 using the metrics of precision, recall, and F-Measure."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-75",
"text": "e evaluation results are quite favorable for both targets and particularly higher for Target-1, considering the fact that they are the initial experiments on the data set."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-76",
"text": "e performance of the classi ers is be er for the Favor class for both targets when compared with the performance results for the Against class."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-77",
"text": "is outcome may be due to the common use of some terms when expressing positive stance towards sports clubs in Turkish tweets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-78",
"text": "e same percentage of common terms may not have been observed in tweets during the expression of negative stances towards the targets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-79",
"text": "Yet, completely the opposite pa ern is observed in stance detection results of baseline systems given in [12] , i.e., be er FMeasure rates have been obtained for the Against class when compared with the Favor class [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-80",
"text": "Some of the baseline systems reported in [12] are SVM-based systems using unigrams and ngrams as features similar to our study, but their data sets include all three stance classes of Favor, Against, and Neither, while our data set comprises only tweets classi ed as belonging to Favor or Against classes."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-81",
"text": "Another di erence is that the data sets in [12] have been divided into training and test sets, while in our study we provide 10-fold cross-validation results on the whole data set."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-82",
"text": "On the other hand, we should also note that SVM-based sentiment analysis systems (such as those given in [15] ) have been reported to achieve be er F-Measure rates for the Positive sentiment class when compared with the results obtained for the Negative class."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-83",
"text": "erefore, our evaluation results for each stance class seem to be in line with such sentiment analysis systems."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-84",
"text": "Yet, further experiments on the extended versions of our data set should be conducted and the results should again be compared with the stance detection results given in the literature."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-85",
"text": "We have also evaluated SVM classi ers which use only bigrams as features, as ngram-based classi ers have been reported to perform be er for the stance detection problem [12] ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-86",
"text": "However, we have observed that using bigrams as the sole features of the SVM classi ers leads to quite poor results."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-87",
"text": "is observation may be due to the relatively limited size of the tweet data set employed."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-88",
"text": "Still, we can conclude that unigram-based features lead to superior results compared to the results obtained using bigrams as features, based on our experiments on our data set."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-89",
"text": "Yet, ngram-based features may be employed on the extended versions of the data set to verify this conclusion within the course of future work."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-90",
"text": "With an intention to exploit the contribution of hashtag use to stance detection, we have also used the existence of hashtags in tweets as an additional feature to unigrams."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-91",
"text": "e corresponding evaluation results of the SVM classi ers using unigrams together the existence of hashtags as features are provided in Table 2 ."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-92",
"text": "When the results given in Table 2 are compared with the results in Table 1 , a slight decrease in F-Measure (0.5%) for Target-1 is observed, while the overall F-Measure value for Target-2 has increased by 1.8%."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-93",
"text": "Although we could not derive sound conclusions mainly due to the relatively small size of our data set, the increase in the performance of the SVM classi er Target-2 is an encouraging evidence for the exploitation of hashtags in a stance detection system."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-94",
"text": "We leave other ways of exploiting hashtags for stance detection as a future work."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-95",
"text": "To sum up, our evaluation results are signi cant as reference results to be used for comparison purposes and provides evidence for the utility of unigram-based and hashtag-related features in SVM classi ers for the stance detection problem in Turkish tweets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-97",
"text": "**FUTURE PROSPECTS**"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-98",
"text": "Future work based on the current study includes the following:"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-99",
"text": "\u2022 e presented stance-annotated data set for Turkish has been created by one annotator only (the author of this study), yet, the data set should be er be revised and extended through crowdsourcing facilities."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-100",
"text": "When employing such a procedure, other stance classes like Neither can be considered as well."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-101",
"text": "e procedure will improve the quality the data set as well as the quality of prospective systems to be trained and tested on it."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-102",
"text": "\u2022 Other features like emoticons (as commonly used for sentiment analysis), features based on hashtags, and ngram features can also be used by the classi ers and these classiers can be tested on larger data sets."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-103",
"text": "Other classi cation approaches could also be implemented and tested against our baseline classi ers."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-104",
"text": "Particularly, related methods presented in recent studies such as [12] can be tested on our data set."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-105",
"text": "\u2022 Lastly, the SVM classi ers utilized in this study and their prospective versions utilizing other features can be tested on stance data sets in other languages (such as English) for comparison purposes."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-107",
"text": "**CONCLUSION**"
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-108",
"text": "Stance detection is a considerably new research area in natural language processing and is considered within the scope of the wellstudied topic of sentiment analysis."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-109",
"text": "It is the detection of stance within text towards a target which may be explicitly speci ed in the text or not."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-110",
"text": "In this study, we present a stance-annotated tweet data set in Turkish where the targets of the annotated stances are two popular sports clubs in Turkey."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-111",
"text": "e corresponding annotations are made publicly-available for research purposes."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-112",
"text": "To the best of our knowledge, this is the rst stance detection data set for the Turkish language and also the rst sports-related stanceannotated data set."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-113",
"text": "Also presented in this study are SVM classi ers (one for each target) utilizing unigram and bigram features in addition to using the existence of hashtags as another feature."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-114",
"text": "10-fold cross validation results of these classi ers are presented which can be used as reference results by prospective systems."
},
{
"sent_id": "0bd3236100730487986ade49af24b9-C001-115",
"text": "Both the annotated data set and the classi ers with evaluations are signi cant since they are the initial contributions to stance detection problem in Turkish tweets."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"0bd3236100730487986ade49af24b9-C001-11"
],
[
"0bd3236100730487986ade49af24b9-C001-18"
],
[
"0bd3236100730487986ade49af24b9-C001-30"
],
[
"0bd3236100730487986ade49af24b9-C001-35"
],
[
"0bd3236100730487986ade49af24b9-C001-69"
]
],
"cite_sentences": [
"0bd3236100730487986ade49af24b9-C001-11",
"0bd3236100730487986ade49af24b9-C001-18",
"0bd3236100730487986ade49af24b9-C001-30",
"0bd3236100730487986ade49af24b9-C001-35",
"0bd3236100730487986ade49af24b9-C001-69"
]
},
"@DIF@": {
"gold_contexts": [
[
"0bd3236100730487986ade49af24b9-C001-74",
"0bd3236100730487986ade49af24b9-C001-76",
"0bd3236100730487986ade49af24b9-C001-78",
"0bd3236100730487986ade49af24b9-C001-79"
],
[
"0bd3236100730487986ade49af24b9-C001-80"
],
[
"0bd3236100730487986ade49af24b9-C001-81"
],
[
"0bd3236100730487986ade49af24b9-C001-85",
"0bd3236100730487986ade49af24b9-C001-86"
]
],
"cite_sentences": [
"0bd3236100730487986ade49af24b9-C001-79",
"0bd3236100730487986ade49af24b9-C001-80",
"0bd3236100730487986ade49af24b9-C001-81",
"0bd3236100730487986ade49af24b9-C001-85"
]
},
"@USE@": {
"gold_contexts": [
[
"0bd3236100730487986ade49af24b9-C001-85"
],
[
"0bd3236100730487986ade49af24b9-C001-104"
]
],
"cite_sentences": [
"0bd3236100730487986ade49af24b9-C001-85",
"0bd3236100730487986ade49af24b9-C001-104"
]
},
"@FUT@": {
"gold_contexts": [
[
"0bd3236100730487986ade49af24b9-C001-104"
]
],
"cite_sentences": [
"0bd3236100730487986ade49af24b9-C001-104"
]
}
}
},
"ABC_250a88831a4911f76acca3c9d318de_9": {
"x": [
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-2",
"text": "We describe an algorithm for Japanese analysis that does both base phrase chunking and dependency parsing simultaneously in linear-time with a single scan of a sentence."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-3",
"text": "In this paper, we show a pseudo code of the algorithm and evaluate its performance empirically on the Kyoto University Corpus."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-4",
"text": "Experimental results show that the proposed algorithm with the voted perceptron yields reasonably good accuracy."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-7",
"text": "Single scan algorithms of parsing are important for interactive applications of NLP."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-8",
"text": "For instance, such algorithms would be more suitable for robots accepting speech inputs or chatbots handling natural language inputs which should respond quickly in some situations even when human inputs are not clearly ended."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-9",
"text": "Japanese sentence analysis typically consists of three major steps, namely morphological analysis, bunsetsu (base phrase) chunking, and dependency parsing."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-10",
"text": "In this paper, we describe a novel algorithm that combines the last two steps into a single scan process."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-11",
"text": "The algorithm, which is an extension of Sassano's (2004) , allows us to chunk morphemes into base phrases and decide dependency relations of the phrases in a strict left-toright manner."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-12",
"text": "We show a pseudo code of the algorithm and evaluate its performance empirically with the voted perceptron on the Kyoto University Corpus (Kurohashi and Nagao, 1998) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-13",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-14",
"text": "**JAPANESE SENTENCE STRUCTURE**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-15",
"text": "In Japanese NLP, it is often assumed that the structure of a sentence is given by dependency relations Meg-ga kare-ni ano pen-wo age-ta."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-16",
"text": "Meg-subj to him that pen-acc give-past."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-17",
"text": "among bunsetsus."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-18",
"text": "A bunsetsu is a base phrasal unit and consists of one or more content words followed by zero or more function words."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-19",
"text": "In addition, most of algorithms of Japanese dependency parsing, e.g., (Sekine et al., 2000; Sassano, 2004) , assume the three constraints below."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-20",
"text": "(1) Each bunsetsu has only one head except the rightmost one."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-21",
"text": "(2) Dependency links between bunsetsus go from left to right. (3) Dependency links do not cross one another."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-22",
"text": "In other words, dependencies are projective."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-23",
"text": "A sample sentence in Japanese is shown in Figure 1 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-24",
"text": "We can see all the constraints are satisfied."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-25",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-26",
"text": "**PREVIOUS WORK**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-27",
"text": "As far as we know, there is no dependency parser that does simultaneously both bunsetsu chunking and dependency parsing and, in addition, does them with a single scan."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-28",
"text": "Most of the modern dependency parsers for Japanese require bunsetsu chunking (base phrase chunking) before dependency parsing (Sekine et al., 2000; Kudo and Matsumoto, 2002; Sassano, 2004) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-29",
"text": "Although wordbased parsers are proposed in (Mori et al., 2000; Mori, 2002) , they do not build bunsetsus and are not compatible with other Japanese dependency parsers."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-30",
"text": "Multilingual parsers of participants in the CoNLL 2006 shared task (Buchholz and Marsi, 2006) can handle Japanese sentences. But they are basically word-based."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-31",
"text": "Meg ga kare ni ano pen wo age-ta."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-32",
"text": "Meg subj him to that pen acc give-past."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-34",
"text": "**ALGORITHM**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-36",
"text": "**DEPENDENCY REPRESENTATION**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-37",
"text": "In our proposed algorithm, we use a morphemebased dependency structure instead of a bunsetsubased one."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-38",
"text": "The morpheme-based representation is carefully designed to convey the same information on dependency structure of a sentence without the loss from the bunsetsu-based one."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-39",
"text": "The rightmost morpheme of the bunsetsu t should modify the rightmost morpheme of the bunsetsu u when the bunsetsu t modifies the bunsetsu u. Every morpheme except the rightmost one in a bunsetsu should modify its following one."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-40",
"text": "The sample sentence in Figure 1 is converted to the sentence with our proposed morpheme-based representation in Figure 2 . Take for instance, the head of the 0-th bunsetsu \"Meg-ga\" is the 4-th bunsetsu \"age-ta.\" in Figure 1."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-41",
"text": "This dependency relation is represented by that the head of the morpheme \"ga\" is \"age-ta.\" in Figure 2 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-42",
"text": "The morpheme-based representation above cannot explicitly state the boundaries of bunsetsus."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-43",
"text": "Thus we add the type to every dependency relation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-44",
"text": "A bunsetsu boundary is represented by the type associated with every dependency relation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-45",
"text": "The type \"D\" represents that this relation is a dependency of two bunsetsus, while the type \"B\" represents a sequence of morphemes inside of a given bunsetsu."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-46",
"text": "In addition, the type \"O\", which represents that two morphemes do not have a dependency relation, is used in implementations of our algorithm with a trainable classifier."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-47",
"text": "Following this encoding scheme of the type of dependency relations bunsetsu boundaries exist just after the morphemes that have the type \"D\"."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-48",
"text": "Inserting \"|\" after every morpheme with \"D\" of the sentence in Figure 2 results in Meg-ga | kare-ni | ano | pen-wo | age-ta."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-49",
"text": "This is identical to the sentence with the bunsetsu-based representation in Figure 1 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-51",
"text": "**PSEUDO CODE FOR THE PROPOSED ALGORITHM**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-52",
"text": "The algorithm that we propose is based on (Sassano, 2004) , which is considered to be a simple form of shift-reduce parsing."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-53",
"text": "The pseudo code of our algorithm is presented in Figure 3 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-54",
"text": "Important variables here are h j and t j where j is an index of morphemes."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-55",
"text": "The variable h j holds the head ID and the variable t j has the type of dependency relation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-56",
"text": "For example, the head and the dependency relation type of \"Meg\" in Figure 2 are represented as h 0 = 1 and t 0 = \"B\" respectively."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-57",
"text": "The flow of the algorithm, which has the same structure as Sassano's (2004) , is controlled with a stack that holds IDs for modifier morphemes."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-58",
"text": "Decision of the relation between two morphemes is made in Dep(), which uses a machine learning-based classifier that supports multiclass prediction."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-59",
"text": "The presented algorithm runs in a left-to-right manner and its upper bound of the time complexity is O(n)."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-60",
"text": "Due to space limitation, we do not discuss its complexity here."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-61",
"text": "See (Sassano, 2004) for further details."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-62",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-63",
"text": "**EXPERIMENTS AND DISCUSSION**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-65",
"text": "**EXPERIMENTAL SET-UP**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-66",
"text": "Corpus For evaluation, we used the Kyoto University Corpus Version 2 (Kurohashi and Nagao, 1998) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-67",
"text": "The split for training/test/development is the same as in other papers, e.g., (Uchimoto et al., 1999) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-69",
"text": "**SELECTION OF A CLASSIFIER AND ITS SETTING**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-70",
"text": "We implemented a parser with the voted perceptron (VP) (Freund and Schapire, 1999) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-71",
"text": "We used a polynomial kernel and set its degree to 3 because cubic kernels proved to be effective empirically for Japanese parsing (Kudo and Matsumoto, 2002) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-72",
"text": "The number of epoch T of VP was selected using the development test set."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-73",
"text": "For multiclass prediction, we used the pairwise method (Kre\u00dfel, 1999) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-75",
"text": "**FEATURES**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-76",
"text": "We have designed rather simple features based on the common feature set (Uchimoto et al., 1999; Kudo and Matsumoto, 2002; Sassano, 2004) for bunsetsu-based parsers."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-77",
"text": "We use the following features for each morpheme: Gap features between two morphemes are also used since they have proven to be very useful and contribute to the accuracy (Uchimoto et al., 1999; Kudo and Matsumoto, 2002) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-78",
"text": "They are represented as a binary feature and include distance (1, 2, 3, 4 -10, or 11 \u2264), particles, parentheses, and punctuation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-79",
"text": "In our proposed algorithm basically two morphemes are examined to estimate their dependency relation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-80",
"text": "Context information about the current morphemes to be estimated would be very useful and we can incorporate such information into our model."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-81",
"text": "We assume that we have the j-th morpheme and the i-th one in Figure 3 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-82",
"text": "We also use the j \u2212 n, ..., j \u2212 1, j + 1, ..., j + n morphemes and the i \u2212 n, ..., i \u2212 1, i + 1, ..., i + n ones, where n Table 2 : Dependency accuracy."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-83",
"text": "The system with the previous method employs the algorithm (Sassano, 2004 ) with the voted perceptron."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-84",
"text": "is the size of the context window."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-85",
"text": "We examined 0, 1, 2 and 3 for n."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-87",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-88",
"text": "Accuracy Performances of our parser on the test set is shown in Table 1 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-89",
"text": "The dependency accuracy is the percentage of the morphemes that have a correct head."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-90",
"text": "The dependency type accuracy is the percentage of the morphemes that have a correct dependency type, i.e., \"B\" or \"D\"."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-91",
"text": "The bottom line of Table 1 shows the percentage of the morphemes that have both a correct head and a correct dependency type."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-92",
"text": "In all these measures we excluded the last morpheme in a sentence, which does not have a head and its associated dependency type."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-93",
"text": "The accuracy of dependency type in Table 1 is interpreted to be accuracy of base phrase (bunsetsu) chunking."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-94",
"text": "Very accurate chunking is achieved."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-95",
"text": "Next we examine the dependency accuracy."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-96",
"text": "In order to recognize how accurate it is, we compared the performance of our parser with that of the parser that uses one of previous methods."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-97",
"text": "We implemented a parser that employs the algorithm of (Sassano, 2004) with the commonly used features and runs with VP instead of SVM, which Sassano (2004) originally used."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-98",
"text": "His parser, which cannot do bunsetsu chunking, accepts only a chunked sentence and then produces a bunsetsu-based dependency structure."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-99",
"text": "Thus we cannot directly compare results with ours."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-100",
"text": "To enable us to compare them we gave bunsetsu chunked sentences by our parser to the parser of (Sassano, 2004) in the Kyoto University Corpus. And then we received results from the parser of (Sassano, 2004) , which are bunsetsu-based dependency structures, and converted them to morpheme-based structures that follow the scheme we propose in this paper."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-101",
"text": "Finally we have got results that have the compatible format and show a comparison with them in Table 2 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-102",
"text": "Although the bunsetsu-based parser outperformed slightly our morpheme-based parser in this experiment, it is still notable that our method yields comparable performance with even a single scan of a sentence for dependency parsing in addition to bunsetsu chunking."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-103",
"text": "According to the results in Table 2 , we suppose that performance of our parser roughly corresponds to about 86-87% in terms of bunsetsu-based accuracy."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-104",
"text": "Context Window Size Performance change depending on the size of context window is shown in Table 3 ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-105",
"text": "Among them the best size is 2."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-106",
"text": "In this case, we use ten morphemes to determine whether or not given two morphemes have a dependency relation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-107",
"text": "That is, to decide the relation of morphemes j and i (j < i), we use morphemes j\u22122, j\u22121, j, j+1, j+2 and i\u22122, i\u22121, i, i+1, i+2."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-109",
"text": "**RUNNING TIME AND ASYMPTOTIC TIME COMPLEXITY**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-110",
"text": "We have observed that the running time is proportional to the sentence length (Figure 4) ."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-111",
"text": "The theoretical time complexity of the proposed algorithm is confirmed with this observation."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-113",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-114",
"text": "We have described a novel algorithm that combines Japanese base phrase chunking and dependency parsing into a single scan process."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-115",
"text": "The proposed algorithm runs in linear-time with a single scan of a sentence."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-116",
"text": "In future work we plan to combine morphological analysis or word segmentation into our proposed algorithm."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-117",
"text": "We also expect that structure analysis of compound nouns can be incorporated by extending the dependency relation types."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-118",
"text": "Furthermore, we believe it would be interesting to discuss linguistically and psycholinguistically the differences between Japanese and other European languages such as English."
},
{
"sent_id": "250a88831a4911f76acca3c9d318de-C001-119",
"text": "We would like to know what differences lead to easiness of analyzing a Japanese sentence."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"250a88831a4911f76acca3c9d318de-C001-11"
],
[
"250a88831a4911f76acca3c9d318de-C001-52"
],
[
"250a88831a4911f76acca3c9d318de-C001-57"
],
[
"250a88831a4911f76acca3c9d318de-C001-76"
],
[
"250a88831a4911f76acca3c9d318de-C001-83"
],
[
"250a88831a4911f76acca3c9d318de-C001-97"
],
[
"250a88831a4911f76acca3c9d318de-C001-100"
]
],
"cite_sentences": [
"250a88831a4911f76acca3c9d318de-C001-11",
"250a88831a4911f76acca3c9d318de-C001-52",
"250a88831a4911f76acca3c9d318de-C001-57",
"250a88831a4911f76acca3c9d318de-C001-76",
"250a88831a4911f76acca3c9d318de-C001-83",
"250a88831a4911f76acca3c9d318de-C001-97",
"250a88831a4911f76acca3c9d318de-C001-100"
]
},
"@BACK@": {
"gold_contexts": [
[
"250a88831a4911f76acca3c9d318de-C001-18",
"250a88831a4911f76acca3c9d318de-C001-19",
"250a88831a4911f76acca3c9d318de-C001-20",
"250a88831a4911f76acca3c9d318de-C001-21"
],
[
"250a88831a4911f76acca3c9d318de-C001-28"
],
[
"250a88831a4911f76acca3c9d318de-C001-61"
]
],
"cite_sentences": [
"250a88831a4911f76acca3c9d318de-C001-19",
"250a88831a4911f76acca3c9d318de-C001-28",
"250a88831a4911f76acca3c9d318de-C001-61"
]
},
"@EXT@": {
"gold_contexts": [
[
"250a88831a4911f76acca3c9d318de-C001-83"
],
[
"250a88831a4911f76acca3c9d318de-C001-100"
]
],
"cite_sentences": [
"250a88831a4911f76acca3c9d318de-C001-83",
"250a88831a4911f76acca3c9d318de-C001-100"
]
},
"@DIF@": {
"gold_contexts": [
[
"250a88831a4911f76acca3c9d318de-C001-97",
"250a88831a4911f76acca3c9d318de-C001-98",
"250a88831a4911f76acca3c9d318de-C001-99"
]
],
"cite_sentences": [
"250a88831a4911f76acca3c9d318de-C001-97"
]
},
"@MOT@": {
"gold_contexts": [
[
"250a88831a4911f76acca3c9d318de-C001-97",
"250a88831a4911f76acca3c9d318de-C001-98",
"250a88831a4911f76acca3c9d318de-C001-99"
]
],
"cite_sentences": [
"250a88831a4911f76acca3c9d318de-C001-97"
]
}
}
},
"ABC_1042e7b6ef7b73f29ad75b193f9e3b_9": {
"x": [
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-2",
"text": "We introduce DSDS: a cross-lingual neural part-of-speech tagger that learns from disparate sources of distant supervision, and realistically scales to hundreds of low-resource languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-3",
"text": "The model exploits annotation projection, instance selection, tag dictionaries, morphological lexicons, and distributed representations, all in a uniform framework."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-4",
"text": "The approach is simple, yet surprisingly effective, resulting in a new state of the art without access to any gold annotated data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-7",
"text": "Low-resource languages lack manually annotated data to learn even the most basic models such as part-of-speech (POS) taggers."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-8",
"text": "To compensate for the absence of direct supervision, work in crosslingual learning and distant supervision has discovered creative use for a number of alternative data sources to learn feasible models: -aligned parallel corpora to project POS annotations to target languages (Yarowsky et al., 2001; Agi\u0107 et al., 2015; Fang and Cohn, 2016) , -noisy tag dictionaries for type-level approximation of full supervision (Li et al., 2012) , -combination of projection and type constraints (Das and Petrov, 2011; T\u00e4ckstr\u00f6m et al., 2013) , -rapid annotation of seed training data ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-9",
"text": "However, only one or two compatible sources of distant supervision are typically employed."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-10",
"text": "In reality severely under-resourced languages may require a more pragmatic \"take what you can get\" viewpoint."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-11",
"text": "Our results suggest that combining supervision sources is the way to go about creating viable low-resource taggers."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-12",
"text": "We propose a method to strike a balance between model simplicity and the capacity to easily integrate heterogeneous learning signals."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-13",
"text": "Our system is a uniform neural model for POS tagging that learns from disparate sources of distant supervision (DSDS)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-14",
"text": "We use it to combine: i) multi-source annotation projection, ii) instance selection, iii) noisy tag dictionaries, and iv) distributed word and sub-word representations."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-15",
"text": "We examine how far we can get by exploiting only the wide-coverage resources that are currently readily available for more than 300 languages, which is the breadth of the parallel corpus we employ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-16",
"text": "DSDS yields a new state of the art by jointly leveraging disparate sources of distant supervision in an experiment with 25 languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-17",
"text": "We demonstrate: i) substantial gains in carefully selecting high-quality instances in annotation projection, ii) the usefulness of lexicon features for neural tagging, and iii) the importance of word embeddings initialization for faster convergence."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-18",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-19",
"text": "**METHOD**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-20",
"text": "DSDS is illustrated in Figure 1 ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-21",
"text": "The base model is a bidirectional long short-term memory network (bi-LSTM) (Graves and Schmidhuber, 2005; Hochreiter and Schmidhuber, 1997; Plank et al., 2016; Kiperwasser and Goldberg, 2016) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-22",
"text": "Let x 1:n be a given sequence of input vectors."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-23",
"text": "In our base model, the input sequence consists of word embeddings w and the two output states of a character-level bi-LSTM c. Given x 1:n and a desired index i, the function BiRN N \u03b8 (x 1:n , i) (here instantiated as LSTM) reads the input sequence in forward and reverse order, respectively, and uses the concatenated (\u2022) output states as input for tag prediction at position i. 1 Our model differs from prior work on the type of input vectors x 1:n and distant data sources, in particular, we extend the input with lexicon embeddings, all described next."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-24",
"text": "Annotation projection."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-25",
"text": "Ever since the seminal work of Yarowsky et al. (2001) , projecting sequential labels from source to target languages has been one of the most prevalent approaches to crosslingual learning."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-26",
"text": "Its only requirement is that parallel texts are available between the languages, and that the source side is annotated for POS."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-27",
"text": "We apply the approach by Agi\u0107 et al. (2016) , where labels are projected from multiple sources and then decoded through weighted majority voting with word alignment probabilities and source POS tagger confidences."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-28",
"text": "We exploit their widecoverage Watchtower corpus (WTC), in contrast to the typically used Europarl data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-29",
"text": "Europarl covers 21 languages of the EU with 400k-2M sentence pairs, while WTC spans 300+ widely diverse languages with only 10-100k pairs, in effect sacrificing depth for breadth, and introducing a more radical domain shift."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-30",
"text": "However, as our results show little projected data turns out to be the most beneficial, reinforcing breadth for depth."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-31",
"text": "While Agi\u0107 et al. (2016) selected 20k projected sentences at random to train taggers, we propose a novel alternative: selection by coverage."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-32",
"text": "We rank the target sentences by percentage of words covered by word alignment from 21 sources of Agi\u0107 et al. (2016) , and select the top k covered instances for training."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-33",
"text": "In specific, we employ the mean coverage ranking of target sentences, whereby each target sentence is coupled with the arithmetic mean of the 21 individual word alignment coverages for each of the 21 source-language sentences."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-34",
"text": "We show that this simple approach to instance selection offers substantial improvements: across all languages, we learn better taggers with significantly fewer training instances."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-35",
"text": "Dictionaries."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-36",
"text": "Dictionaries are a useful source for distant supervision (Li et al., 2012; T\u00e4ckstr\u00f6m et al., 2013) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-37",
"text": "There are several ways to exploit such information: i) as type constraints during encoding (T\u00e4ckstr\u00f6m et al., 2013) , ii) to guide unsupervised learning (Li et al., 2012) , or iii) as additional signal at training."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-38",
"text": "We focus on the latter and evaluate two ways to integrate lexical knowledge into neural models, while comparing to the former two: a) by representing lexicon properties as n-hot vector (e.g., if a word has two properties according to lexicon src, it results in a 2-hot vector, if the word is not present in src, a zero vector), with m the number of lexicon properties; b) by embedding the lexical features, i.e., e src is a lexicon src embedded into an l-dimensional space."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-39",
"text": "We represent e src as concatenation of all embedded m properties of length l, and a zero vector otherwise."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-40",
"text": "Tuning on the dev set, we found the second embedding approach to perform best, and simple concatenation outperformed mean vector representations."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-41",
"text": "We evaluate two dictionary sources, motivated by ease of accessibility to many languages: WIK-TIONARY, a word type dictionary that maps tokens to one of the 12 Universal POS tags (Li et al., 2012; Petrov et al., 2012) ; and UNIMORPH, a morphological dictionary that provides inflectional paradigms across 350 languages (Kirov et al., 2016) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-42",
"text": "For Wiktionary, we use the freely available dictionaries from Li et al. (2012) and ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-43",
"text": "The size of the dictionaries ranges from a few thousands (e.g., Hindi and Bulgarian) to 2M (Finnish UniMorph)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-44",
"text": "Sizes are provided in Table 1 , first columns."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-45",
"text": "UniMorph covers between 8-38 morphological properties (for English and Finnish, respectively)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-46",
"text": "Word embeddings."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-47",
"text": "Embeddings are available for many languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-48",
"text": "Pre-initialization of w offers consistent and considerable performance improvements in our distant supervision setup (Section 4)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-49",
"text": "We use off-the-shelf Polyglot embeddings (AlRfou et al., 2013) , which performed consistently better than FastText (Bojanowski et al., 2016) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-51",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-52",
"text": "Baselines."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-53",
"text": "We compare to the following weaklysupervised POS taggers: -AGIC: Multi-source annotation projection with Bible parallel data by Agi\u0107 et al. (2015) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-54",
"text": "-DAS: The label propagation approach by Das and Petrov (2011) over Europarl data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-55",
"text": "-GARRETTE: The approach by that works with projections, dictionaries, and unlabeled target text."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-56",
"text": "-LI: Wiktionary supervision (Li et al., 2012) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-57",
"text": "Data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-58",
"text": "Our set of 25 languages is motivated by accessibility to embeddings and dictionaries."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-59",
"text": "In all experiments we work with the 12 Universal POS tags (Petrov et al., 2012) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-60",
"text": "For development, we use 21 dev sets of the Universal Dependencies 2.1 (Nivre et al., 2017) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-61",
"text": "We employ UD test sets on additional languages as well as the test sets of Agi\u0107 et al. (2015) to facilitate comparisons."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-62",
"text": "Their test sets are a mixture of CoNLL (Buchholz and Marsi, 2006; Nivre et al., 2007) and HamleDT test data (Zeman et al., 2014) , and are more distant from the training and development data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-63",
"text": "Model and parameters."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-64",
"text": "We extend an off-theshelf state-of-the-art bi-LSTM tagger with lexicon information."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-65",
"text": "The code is available at: https:// github.com/bplank/bilstm-aux."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-66",
"text": "The parameter l=40 was set on dev data across all languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-67",
"text": "Besides using 10 epochs, word dropout rate (p=.25) and 40-dimensional lexicon embeddings, we use the parameters from Plank et al. (2016) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-68",
"text": "For all experiments, we average over 3 randomly seeded runs, and provide mean accuracy."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-69",
"text": "For the learning curve, we average over 5 random samples with 3 runs each."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-70",
"text": "Table 1 shows the tagging accuracy for individual languages, while the means over all languages are given in Figure 2 ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-71",
"text": "There are several take-aways."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-73",
"text": "**RESULTS**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-74",
"text": "Data selection."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-75",
"text": "The first take-away is that coverage-based instance selection yields substantially better training data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-76",
"text": "Most prior work on annotation projection resorts to arbitrary selection; informed selection clearly helps in this noisy data setup, as shown in Figure 2 (a)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-77",
"text": "Training on 5k instances results in a sweet spot; more data (10k) starts to decrease performance, at a cost of runtime."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-78",
"text": "Training on all WTC data (around 120k) is worse for most languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-79",
"text": "From now on we consider the 5k model trained with Polyglot as our baseline (Table 1 , column \"5k\"), obtaining a mean accuracy of 83.0 over 21 languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-80",
"text": "Embeddings initialization."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-81",
"text": "Polyglot initialization offers a large boost; on average +3.8% absolute improvement in accuracy for our 5k training scheme, as shown in Figure 2 (b) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-82",
"text": "The big gap in low-resource setups further shows their effectiveness, with up to 10% absolute increase in accuracy when training on only 500 instances."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-83",
"text": "Lexical information."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-84",
"text": "The main take-away is that lexical information helps neural tagging, and embedding it proves the most helpful."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-85",
"text": "Embedding Wiktionary tags reaches 83.7 accuracy on average, versus 83.4 for n-hot encoding, and 83.2 for type constraints."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-86",
"text": "Only on 4 out of 21 languages are type constraints better."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-87",
"text": "This is the case for only one language for n-hot encoding (French)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-88",
"text": "The best approach is to embed both Wiktionary and Unimorph, boosting performance further to 84.0, and resulting in our final model."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-89",
"text": "It helps the most on morphological rich languages such as Uralic."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-90",
"text": "On the test sets (Table 4 , right) DSDS reaches 87.2 over 8 test languages intersecting Li et al. (2012) and Agi\u0107 et al. (2016) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-91",
"text": "It reaches 86.2 over the more commonly used 8 languages of Das and Petrov (2011) , compared to their 83.4."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-92",
"text": "This shows that our novel \"soft\" inclusion of noisy dictionaries is superior to a hard decoding restriction, and including lexicons in neural taggers helps."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-93",
"text": "We did not assume any gold data to further enrich the lexicons, nor fix possible tagset divergences."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-95",
"text": "**DISCUSSION**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-96",
"text": "Analysis."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-97",
"text": "The inclusion of lexicons results in higher coverage and is part of the explanation for the improvement of DSDS; see correlation in Figure 3 (a) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-98",
"text": "What is more interesting is that our model benefits from the lexicon beyond its content: OOV accuracy for words not present in the lexicon overall improves, besides the expected improvement on known OOV, see Figure 3 (b)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-99",
"text": "Best result in boldface; in case of equal means, the one with lower std is boldfaced."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-100",
"text": "Averages over language families (with two or more languages in the sample, number of languages in parenthesis)."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-101",
"text": "More languages."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-102",
"text": "All data sources employed in our experiment are very high-coverage."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-103",
"text": "However, for true low-resource languages, we cannot safely assume the availability of all disparate information sources."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-104",
"text": "Table 2 presents results for four additional languages where some supervision sources are missing."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-105",
"text": "We observe that adding lexicon information always helps, even in cases where only 1k entries are available, and embedding it is usually the most beneficial way."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-106",
"text": "For closely-related languages such as Serbian and Croatian, using resources for one aids tagging the other, and modern resources are a better fit."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-107",
"text": "For example, using the Croatian WTC projections to train a model for Serbian is preferable over in-language Serbian Bible data where the OOV rate is much higher."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-108",
"text": "How much gold data?"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-109",
"text": "We assume not having access to any gold annotated data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-110",
"text": "It is thus interesting to ask how much gold data is needed to reach our performance."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-111",
"text": "This is a tricky question, as training within the same corpus naturally favors the same corpus data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-112",
"text": "We test both in-corpus (UD) and out-of-corpus data (our test sets) and notice an important gap: while in-corpus only 50 sentences are sufficient, outside the corpus one would need over 200 sentences."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-113",
"text": "This experiment was done for a subset of 18 languages with both in-and out-ofcorpus test data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-114",
"text": "Further comparison."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-115",
"text": "In Table 1 we directly report the accuracies from the original contributions by DAS, LI, GARRETTE, and AGIC over the same test data."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-116",
"text": "We additionally attempted to reach the scores of LI by running their tagger over the Table 1 data setup."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-117",
"text": "The results are depicted in Figure 4 as mean accuracies over EM iterations until convergence."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-118",
"text": "We show: i) LI peaks at 10 iterations for their test languages, and at 35 iterations for all the rest."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-119",
"text": "This is in slight contrast to 50 iterations that Li et al. (2012) recommend, although selecting 50 does not dramatically hurt the scores; ii) Our replication falls \u223c5 points short of their 84.9 accuracy."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-120",
"text": "There is a large 33-point accuracy gap between the scores of Li et al. (2012) , where the dictionaries are large, and the other languages in Figure 4 , with smaller dictionaries."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-121",
"text": "Compared to DAS, our tagger clearly benefits from pre-trained word embeddings, while theirs relies on label propagation through Europarl, a much cleaner corpus that lacks the coverage of the noisier WTC."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-122",
"text": "Similar applies to T\u00e4ckstr\u00f6m et al. (2013) , as they use 1-5M near-perfect parallel sentences."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-123",
"text": "Even if we use much smaller and noisier data sources, DSDS is almost on par: 86.2 vs. 87.3 for the 8 languages from Das and Petrov (2011) , and we even outperform theirs on four languages: Czech, French, Italian, and Spanish."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-125",
"text": "**RELATED WORK**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-126",
"text": "Most successful work on low-resource POS tagging is based on projection (Yarowsky et al., 2001) , tag dictionaries (Li et al., 2012) , annotation of seed training data or even more recently some combination of these, e.g., via multi-task learning (Fang and not Li et al. (2012) Figure 4: The performance of LI with our dictionary data over EM iterations, separate for the languages from Li et al. (2012) and all the remaining languages in Table 1 . Cohn, 2016; Kann et al., 2018) ."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-127",
"text": "Our paper contributes to this literature by leveraging a range of prior directions in a unified, neural test bed."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-128",
"text": "Most prior work on neural sequence prediction follows the commonly perceived wisdom that hand-crafted features are unnecessary for deep learning methods."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-129",
"text": "They rely on end-to-end training without resorting to additional linguistic resources."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-130",
"text": "Our study shows that this is not the case."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-131",
"text": "Only few prior studies investigate such sources, e.g., for MT (Sennrich and Haddow, 2016; Chen et al., 2017; Li et al., 2017; Passban et al., 2018) and Sagot and Mart\u00ednez Alonso (2017) for POS tagging use lexicons, but only as n-hot features and without examining the cross-lingual aspect."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-132",
"text": "----------------------------------"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-133",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-134",
"text": "We show that our approach of distant supervision from disparate sources (DSDS) is simple yet surprisingly effective for low-resource POS tagging."
},
{
"sent_id": "1042e7b6ef7b73f29ad75b193f9e3b-C001-135",
"text": "Only 5k instances of projected data paired with off-the-shelf embeddings and lexical information integrated into a neural tagger are sufficient to reach a new state of the art, and both data selection and embeddings are essential components to boost neural tagging performance."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-8"
],
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-36"
],
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-120"
],
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-126"
]
],
"cite_sentences": [
"1042e7b6ef7b73f29ad75b193f9e3b-C001-8",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-36",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-120",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-126"
]
},
"@MOT@": {
"gold_contexts": [
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-8",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-9"
]
],
"cite_sentences": [
"1042e7b6ef7b73f29ad75b193f9e3b-C001-8"
]
},
"@USE@": {
"gold_contexts": [
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-41"
],
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-42"
],
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-53",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-56"
]
],
"cite_sentences": [
"1042e7b6ef7b73f29ad75b193f9e3b-C001-41",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-42",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-56"
]
},
"@SIM@": {
"gold_contexts": [
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-90"
]
],
"cite_sentences": [
"1042e7b6ef7b73f29ad75b193f9e3b-C001-90"
]
},
"@DIF@": {
"gold_contexts": [
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-119"
],
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-126",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-127"
]
],
"cite_sentences": [
"1042e7b6ef7b73f29ad75b193f9e3b-C001-119",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-126"
]
},
"@EXT@": {
"gold_contexts": [
[
"1042e7b6ef7b73f29ad75b193f9e3b-C001-126",
"1042e7b6ef7b73f29ad75b193f9e3b-C001-127"
]
],
"cite_sentences": [
"1042e7b6ef7b73f29ad75b193f9e3b-C001-126"
]
}
}
},
"ABC_26743b7d006e485be1b850a4424a5f_9": {
"x": [
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-2",
"text": "We describe an approach for machine learning-based empty category detection that is based on the phrase structure analysis of Japanese."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-3",
"text": "The problem is formalized as tree node classification, and we find that the path feature, the sequence of node labels from the current node to the root, is highly effective."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-4",
"text": "We also find that the set of dot products between the word embeddings for a verb and those for case particles can be used as a substitution for case frames."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-5",
"text": "Experiments show that the proposed method outperforms the previous state-of the art method by 68.6% to 73.2% in terms of F-measure."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-8",
"text": "Empty categories are phonetically null elements that are used for representing dropped pronouns (\"pro\" or \"small pro\"), controlled elements (\"PRO\" or \"big pro\") and traces of movement (\"T\" or \"trace\"), such as WH-questions and relative clauses."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-9",
"text": "They are important for pro-drop languages such as Japanese, in particular, for the machine translation from pro-drop languages to nonpro-drop languages such as English."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-10",
"text": "Chung and Gildea (2010) reported their recover of empty categories improved the accuracy of machine translation both in Korean and in Chinese."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-11",
"text": "Kudo et al. (2014) showed that generating zero subjects in Japanese improved the accuracy of preorderingbased translation."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-12",
"text": "State-of-the-art statistical syntactic parsers had typically ignored empty categories."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-13",
"text": "Although Penn Treebank (Marcus et al., 1993) has annotations on PRO and trace, they provide only labeled bracketing."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-14",
"text": "Johnson (2002) proposed a statistical pattern-matching algorithm for post-processing the results of syntactic parsing based on minimal unlexicalized tree fragments from empty node to its antecedent."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-15",
"text": "Dienes and Dubey (2003) proposed a machine learning-based \"trace tagger\" as a preprocess of parsing."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-16",
"text": "Campbell (2004) proposed a rule-based post-processing method based on linguistically motivated rules."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-17",
"text": "Gabbard et al. (2006) replaced the rules with machine learning-based classifiers."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-18",
"text": "Schmid (2006) and Cai et al. (2011) integrated empty category detection with the syntactic parsing."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-19",
"text": "Empty category detection for pro (dropped pronouns or zero pronoun) has begun to receive attention as the Chinese Penn Treebank (Xue et al., 2005) has annotations for pro as well as PRO and trace."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-20",
"text": "Xue and Yang (2013) formalized the problem as classifying each pair of the location of empty category and its head word in the dependency structure."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-21",
"text": "Wang et al. (2015) proposed a joint embedding of empty categories and their contexts on dependency structure."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-22",
"text": "Xiang et al. (2013) formalized the problem as classifying each IP node (roughly corresponds to S and SBAR in Penn Treebank) in the phrase structure."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-23",
"text": "In this paper, we propose a novel method for empty category detection for Japanese that uses conjunction features on phrase structure and word embeddings."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-24",
"text": "We use the Keyaki Treebank (Butler et al., 2012) , which is a recent development."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-25",
"text": "As it has annotations for pro and trace, we show our method has substantial improvements over the state-of-the-art machine learning-based method (Xiang et al., 2013) for Chinese empty category detection as well as linguistically-motivated manually written rule-based method similar to (Campbell, 2004 )."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-27",
"text": "**BASELINE SYSTEMS**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-28",
"text": "The Keyaki Treebank annotates the phrase structure with functional information for Japanese sentences following a scheme adapted from the Annotation manual for the Penn Historical Corpora and Figure 1: An annotation example of (*pro* brought back a daughter who ran away from home.) in Keyaki Treebank."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-29",
"text": "(The left tree is the original tree and the right tree is a converted tree based on Xiang et al.'s (2013) formalism) the PCEEC (Santorini, 2010) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-30",
"text": "There are some major changes: the VP level of structure is typically absent, function is marked on all clausal nodes (such as IP-REL and CP-THT) and all NPs that are clause level constituents (such as NP-SBJ)."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-31",
"text": "Disambiguation tags are also used for clarifying the functions of its immediately preceding node, such as NP-OBJ * *(wo) for PP, however, we removed them in our experiment."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-32",
"text": "Keyaki Treebank has annotation for trace markers of relative clauses (*T*) and dropped pronouns (*pro*), however, it deliberately has no annotation for control dependencies (PRO) (Butler et al., 2015) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-33",
"text": "It has also fine grained empty categories of *pro* such as *speaker* and *hearer*, but we unified them into *pro* in our experiment."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-34",
"text": "HARUNIWA (Fang et al., 2014 ) is a Japanese phrase structure parser trained on the treebank."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-35",
"text": "It has a rule-based post-processor for adding empty categories, which is similar to (Campbell, 2004) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-36",
"text": "We call it RULE in later sections and use it as one of two baselines."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-37",
"text": "We also use Xiang et al's (2013) model as another baseline."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-38",
"text": "It formulates empty category detection as the classification of IP nodes."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-39",
"text": "For example, in Figure 1 , empty nodes in the left tree are removed and encoded as additional labels with its position information to IP nodes in the right tree."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-40",
"text": "As we can uniquely decode them from the extended IP labels, the problem is to predict the labels for the input tree that has no empty nodes."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-41",
"text": "Let T = t 1 t 2 \u00b7 \u00b7 \u00b7 t n be the sequence of nodes produced by the post-order traversal from root node, and e i be the empty category tag associated with t i ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-42",
"text": "The probability model of (Xiang et al., 2013) is formulated as MaxEnt model:"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-43",
"text": "where \u03c6 is a feature vector, \u03b8 is a weight vector to \u03c6 and Z is normalization factor:"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-44",
"text": "where E represents the set of all empty category types to be detected."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-87",
"text": "distributed case frame (DCF), we used an existing case frame lexicon (Kawahara and Kurohashi, 2006) and tested three different ways of encoding the case frame information: BIN encodes each case as binary features."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-45",
"text": "Xiang et al. (2013) grouped their features into four types: tree label features, lexical features, empty category features and conjunction features as shown in Table 1 ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-46",
"text": "As the features for (Xiang et al., 2013) were developed for Chinese Penn Treebank, we modify their features for Keyaki Treebank: First, the traversal order is changed from post-order (bottom-up) to pre-order (top-down)."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-47",
"text": "As PROs are implicit in Keyaki Treebank, the decisions on IPs in lower levels depend on those on higher levels in the tree."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-48",
"text": "Second, empty category features are extracted from ancestor IP nodes, not from descendant IP nodes, in accordance with the first change."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-49",
"text": "Table 2 shows the accuracies of Japanese empty category detection, using the original and our modification of the (Xiang et al., 2013) with ablation test."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-50",
"text": "We find that the conjunction features left-sibling label or POS tag (up to two siblings) 10 right-sibling label or POS tag (up to two siblings) Lexical features 11 left-most word under the current node 12 right-most word under the current node 13 word immediately left to the span of the current node 14 word immediately right to the span of the current node 15 head word of the current node 16 head word of the parent node 17 is the current node head child of its parent?"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-51",
"text": "(binary) Empty category features 18 predicted empty categories of the left sibling 19* the set of detected empty categories of ancestor nodes Conjunction features 20 current node label with parent node label 21* current node label with features computed from ancestor nodes 22 current node label with features computed from leftsibling nodes 23 current node label with lexical features (Xiang et al., 2013) 68.2 \u22120.40 modified (Xiang et al., 2013) 68.6 -\u2212 Tree label 68.6 \u22120.00 \u2212 Empty category 68.3 \u22120.30 \u2212 Lexicon 68.6 \u22120.00 \u2212 Conjunction 58.5 \u221210.1 Table 2 : Ablation result of (Xiang et al., 2013) are highly effective compared to the three other features."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-52",
"text": "This observation leads to the model proposed in the next section."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-54",
"text": "**PROPOSED MODEL**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-55",
"text": "In the proposed model, we use combinations of path features and three other features, namely head word feature, child feature and empty category feature."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-56",
"text": "Path feature (PATH) is a sequence of nonterminal labels from the current node to the ancestor nodes up to either the root node or the nearest CP node."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-57",
"text": "For example, in Figure 1 , if the current node is IP-REL, four paths are extracted; IP-REL, IP-REL \u2192 NP, IP-REL \u2192 NP \u2192 PP and IP-REL \u2192 NP \u2192 PP \u2192 IP-MAT."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-58",
"text": "Head word feature (HEAD) is the surface form of the lexical head of the current node."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-59",
"text": "Child feature (CHILD) is the set of labels for the children of the current node."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-60",
"text": "The label is augmented with the surface form of the rightmost terminal node if it is a function word."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-61",
"text": "In the example of Figure 1 , if the current node is IP-MAT, HEAD is (tsure) and CHILD includes: PP-(wo), VB, VB2, AXD-(ta) and PU-."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-62",
"text": "Empty category feature (EC) is a set of empty categories detected in the ancestor IP nodes."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-63",
"text": "For example in Figure 1 , if the current node is IP-REL, EC is *pro*."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-64",
"text": "We then combine the PATH with others."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-65",
"text": "If the current node is the IP-MAT node in right-half of Figure 1 , the combination of PATH and HEAD is:IP-MAT\u00d7 (tsure) and the combinations of PATH and CHILD are: IP-MAT\u00d7PP-(wo), IP-MAT\u00d7VB, IP-MAT\u00d7VB2, IP-MAT\u00d7AXD-(ta) and IP-MAT\u00d7PU-."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-67",
"text": "**USING WORD EMBEDDING TO APPROXIMATE CASE FRAME LEXICON**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-68",
"text": "A case frame lexicon would be obviously useful for empty category detection because it provides information on the type of argument the verb in question takes."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-69",
"text": "The problem is that case frame lexicon is not usually readily available."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-70",
"text": "We propose a novel method to approximate case frame lexicon for languages with explicit case marking such as Japanese using word embeddings."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-71",
"text": "According to (Pennington et al., 2014) , they designed their embedding model GloVe so that the dot product of two word embeddings approximates the logarithm of their co-occurrence counts."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-72",
"text": "Using this characteristic, we can easily make a feature that approximate the case frame of a verb."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-73",
"text": "Given a set of word embeddings for case particles q 1 , q 2 , \u00b7 \u00b7 \u00b7 , q N \u2208 Q, the distributed case frame feature (DCF) for a verb w i is defined as:"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-74",
"text": "In our experiment, we used a set of high frequency case particles (ga), (ha), (mo), (no), (wo), (ni), (he) and (kara) as Q. Table 3 ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-75",
"text": "We used GloVe as word embedding, Wikipedia articles in Japanese as of January 18, 2015, are used for training, which amounted to 660 million words and 23.4 million sentences."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-76",
"text": "By using the development set, we set the dimension of word embedding and the window size for co-occurrence counts as 200 and 10, respectively."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-78",
"text": "**RESULT AND DISCUSSION**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-79",
"text": "We tested in two conditions: gold parse and system parse."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-80",
"text": "In gold parse condition, we used the trees of Keyaki Treebank without empty categories as input to the systems."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-81",
"text": "In system parse condition, we used the output of the Berkeley Parser model of HARUNIWA before rule-based empty category detection 1 ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-82",
"text": "We evaluated them using the word-position-level identification metrics described in (Xiang et al., 2013) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-83",
"text": "It projects the predicted empty category tags to the surface level."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-84",
"text": "An empty node is regarded as correctly predicted surface position in the sentence, type (T or pro) and function (SBJ, OB1 and so on) are matched with the reference."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-85",
"text": "To evaluate the effectiveness of the proposed 1 There are two models available in HARUNIWA, namely the BitPar model (Schmid, 2004) and Berkeley Parser binary branching model (Petrov and Klein, 2007) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-86",
"text": "The output of the later is first flattened, then added disambiguation tags and empty categories using tsurgeon script (Levy and Andrew, 2006) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-88",
"text": "SET encodes each combination of required cases as a binary feature."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-89",
"text": "DIST is a vector of co-occurrence counts for each case particle, which can be thought of an unsmoothed version of our DCF."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-90",
"text": "Table 4 shows the accuracies of various empty category detection methods, for both gold parse and system parse."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-91",
"text": "In the gold parse condition, the two baselines, the rule-based method (RULE) and the modified (Xiang et al., 2013) method, achieved the F-measure of 62.6% and 68.6% respectively."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-92",
"text": "We also implemented the third baseline based on (Johnson, 2002) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-93",
"text": "Minimal unlexicalized tree fragments from empty node to its antecedent were extracted as pattern rules based on corpus statistics."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-94",
"text": "For *pro*, which has no antecedent, we used the statistics from empty node to the root."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-95",
"text": "Although the precision of the method is high, the recall is very low, which results in the F-measure of 38.1%."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-96",
"text": "Among the proposed models, the combination of path feature and child feature (PATH \u00d7 CHILD) even outperformed the baselines."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-97",
"text": "It reached 73.2% with all features."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-98",
"text": "As for the result of systemparse condition, the F-measure dropped considerably from 73.2% to 54.7% mostly due to the parsing errors on the IP nodes and its function."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-99",
"text": "We find that there are no significant differences among the different encodings of the case frame lexicon, and the improvement brought by the proposed distributed case frame is comparable to the existing case frame lexicon."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-101",
"text": "**CONCLUSION**"
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-102",
"text": "In this paper, we proposed a novel model for empty category detection in Japanese using path features and the distributed case frames."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-103",
"text": "Although it achieved fairly high accuracy for the gold parse, there is much room for improvement when applied to the output of a syntactic parser."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-104",
"text": "Since the accuracy of the empty category detection implemented as a post-process highly depends on that of the underlying parser, we want to explore models that can solve them jointly, such as the lattice parsing approach of (Cai et al., 2011) ."
},
{
"sent_id": "26743b7d006e485be1b850a4424a5f-C001-105",
"text": "We would like to report the results in the future version of this paper."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"26743b7d006e485be1b850a4424a5f-C001-22"
],
[
"26743b7d006e485be1b850a4424a5f-C001-42",
"26743b7d006e485be1b850a4424a5f-C001-43",
"26743b7d006e485be1b850a4424a5f-C001-44"
],
[
"26743b7d006e485be1b850a4424a5f-C001-45"
]
],
"cite_sentences": [
"26743b7d006e485be1b850a4424a5f-C001-22",
"26743b7d006e485be1b850a4424a5f-C001-42",
"26743b7d006e485be1b850a4424a5f-C001-45"
]
},
"@DIF@": {
"gold_contexts": [
[
"26743b7d006e485be1b850a4424a5f-C001-25"
],
[
"26743b7d006e485be1b850a4424a5f-C001-91",
"26743b7d006e485be1b850a4424a5f-C001-96"
]
],
"cite_sentences": [
"26743b7d006e485be1b850a4424a5f-C001-25",
"26743b7d006e485be1b850a4424a5f-C001-91"
]
},
"@USE@": {
"gold_contexts": [
[
"26743b7d006e485be1b850a4424a5f-C001-37"
],
[
"26743b7d006e485be1b850a4424a5f-C001-49"
],
[
"26743b7d006e485be1b850a4424a5f-C001-82"
],
[
"26743b7d006e485be1b850a4424a5f-C001-91"
]
],
"cite_sentences": [
"26743b7d006e485be1b850a4424a5f-C001-37",
"26743b7d006e485be1b850a4424a5f-C001-49",
"26743b7d006e485be1b850a4424a5f-C001-82",
"26743b7d006e485be1b850a4424a5f-C001-91"
]
},
"@EXT@": {
"gold_contexts": [
[
"26743b7d006e485be1b850a4424a5f-C001-46"
]
],
"cite_sentences": [
"26743b7d006e485be1b850a4424a5f-C001-46"
]
}
}
},
"ABC_7d80c3cc15453ddeaea72dcb9c04f9_9": {
"x": [
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-2",
"text": "Matrix factorization of knowledge bases in universal schema has facilitated accurate distantlysupervised relation extraction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-3",
"text": "This factorization encodes dependencies between textual patterns and structured relations using lowdimensional vectors defined for each entity pair; although these factors are effective at combining evidence for an entity pair, they are inaccurate on rare pairs, or for relations that depend crucially on the entity types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-4",
"text": "On the other hand, tensor factorization is able to overcome these shortcomings when applied to link prediction by maintaining entity-wise factors."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-5",
"text": "However these models have been unsuitable for universal schema."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-6",
"text": "In this paper we first present an illustration on synthetic data that explains the unsuitability of tensor factorization to relation extraction with universal schemas."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-7",
"text": "Since the benefits of tensor and matrix factorization are complementary, we then investigate two hybrid methods that combine the benefits of the two paradigms."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-8",
"text": "We show that the combination can be fruitful: we handle ambiguously phrased relations, achieve gains in accuracy on real-world relations, and demonstrate that entity embeddings encode entity types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-11",
"text": "Distantly-supervised relation extraction has gained prominence as it utilizes automatically aligned data to train accurate extractors."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-12",
"text": "Universal schema, in particular, has found impressive accuracy gains by (1) treating the distant-supervision as a knowledge-base (KB) containing both structured relations such as bornIn * First two authors contributed equally to the paper. and surface form relations such as \"was born in\" extracted from text, and (2) by completing the entries in such a KB using joint and compact encoding of the dependencies between the relations (Riedel et al., 2013; Fan et al., 2014; ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-13",
"text": "Matrix factorization is at the core of this completion: Riedel et al. (2013) convert the KB into a binary matrix with entity-pairs forming the rows and relations forming the columns."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-14",
"text": "Factorization of this matrix results in low-dimensional factors for entity-pairs and relations, which are able to effectively combine multiple evidence for each entity pair to predict unseen relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-15",
"text": "An important shortcoming of this matrix factorization model for universal schema is that no information is shared between the rows that contain the same entity."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-16",
"text": "This can significantly impact accuracy on pairs of entities that are not mentioned together frequently, and for relations that depend crucially on fine-grained entity types, such as schoolAttended, nationality, and bookAuthor."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-17",
"text": "On the other hand, tensor factorization for knowledge-base completion maintains perentity factors that combine evidence from all the relations an entity participates in, to predict its relations to other entities -a task known as link prediction (Nickel et al., 2012; Bordes et al., 2013) ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-18",
"text": "These entity factors, as opposed to pairwise factors in matrix factorization, can be quite effective in identifying the latent, fine-grained entity types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-19",
"text": "Thus, in the light of the above problems of matrix factorization, the use of tensor factorization for universal schema is tempting."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-20",
"text": "However, directly applying tensor factorization to universal schema has not been successful."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-21",
"text": "Strong results were obtained only through a combination with matrix factorization predictions, and the use of predefined type information ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-22",
"text": "In this paper, we explore the application of matrix and tensor factorization for universal schema data."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-23",
"text": "On simple, synthetic relations, we contrast the representational capabilities of these methods (in \u00a7 3.1) and investigate their benefits and shortcomings."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-24",
"text": "We then propose two hybrid tensor and matrix factorization approaches that, by combining their complementary advantages, is able to overcome the shortcomings on synthetic data."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-25",
"text": "We also present improved accuracy on real-world relation extraction data, and demonstrate that the entity embeddings are effective at encoding entity types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-27",
"text": "**MATRIX AND TENSOR FACTORIZATION**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-28",
"text": "In this section we introduce universal schemas and various factorization models that can be used to complete knowledge bases of such schemas."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-30",
"text": "**UNIVERSAL SCHEMA**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-31",
"text": "A universal schema is defined as the union of all OpenIE-like surface form patterns found in text and fixed canonical relations that exist in a knowledge base (Riedel et al., 2013) ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-32",
"text": "The task here is to complete this schema by jointly reasoning over surface form patterns and relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-33",
"text": "A successful approach to this joint reasoning is to embed both kinds of relations into the same low-dimensional embedding space, which can be achieved by matrix or tensor factorization methods."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-34",
"text": "We will study such representations for universal schema in this paper."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-36",
"text": "**MATRIX FACTORIZATION WITH FACTORS OVER ENTITY-PAIRS**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-37",
"text": "In matrix factorization for universal schema, Riedel et al. (2013) construct a sparse binary matrix of size |P| \u00d7 |R| whose rows are indexed by entity-pairs (a, b) \u2208 P and columns by surface form and Freebase relations s \u2208 R. Subsequently, generalized PCA (Collins et al., 2001 ) is used to find a rank-k factorization, i.e., with relation factors r \u2208 R |R|\u00d7k and entity-pair factors p \u2208 R |P|\u00d7k , the probability of a relation s and two entities a and b is:"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-38",
"text": "where \u03c3 is the sigmoid function."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-39",
"text": "Using this factorization, similar entity-pairs and relations are embedded close to each other in a k-dimensional vector space."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-40",
"text": "Since this model uses embeddings for pairs of entities, as opposed to per-entity embeddings, we refer to such models as pairwise models."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-41",
"text": "Pairwise embeddings are especially suitable when working with universal schema data, since they can represent correlations between surface pattern relations and structured relations compactly."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-42",
"text": "Furthermore, they combine multiple evidences specific to an entity-pair to predict a relation between them."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-43",
"text": "Since the observed data matrix contains only true entries, the parameters are learned using Bayesian personalized Ranking (Rendle et al., 2009 ) that supports implicit feedback."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-44",
"text": "Riedel et al. (2013) explore a number of variants of this factorization, including a neighborhood model that learns local classifiers, and an entity model that includes entity representations (we revisit this formulation in Section 2.3.4)."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-45",
"text": "In the rest of this paper we will only use the basic factorization model (referred to as Model F) as the primary pairwise embedding model, however the ideas apply directly to these variants as well."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-46",
"text": "There are a few shortcomings of models that rely solely on pairwise embeddings."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-47",
"text": "To learn an appropriate representation of an entity-pair, the two entities need to be mentioned together frequently, which is not the case for many entity-pairs of interest."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-48",
"text": "Since predicting relations often relies on the entity types, this lack of ample relational evidence for an entity pair can result in poor estimation of their types, and hence, of their relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-49",
"text": "Further, a large number of pairwise relation instances (relative to the number of entities) results in a large number of model parameters, leading to scalability concerns."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-51",
"text": "**TENSOR FACTORIZATION WITH ENTITY FACTORS**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-52",
"text": "Instead of using a matrix, it can be natural to represent the binary relations in universal schema as a mode-3 tensor."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-53",
"text": "Here we allocate one mode for relations, one for entities appearing as first argument of relations, and the last mode for entities as second argument."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-54",
"text": "This formulation allows the use of tensor factorization approaches that we will describe here."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-55",
"text": "We use e a \u2208 R k to refer to the embedding of an entity a. In cases where the position of the entity requires different embeddings, we use e a,1 and e a,2 to represent its occurrence as first and second argument, respectively."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-57",
"text": "**CANDECOMP/PARAFAC-DECOMPOSITION**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-58",
"text": "In CANDECOMP/PARAFAC-decomposition (Harshman, 1970 ) the data tensor is approximated using a finite sum of rank one tensors, i.e.,"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-59",
"text": "This decomposition was originally introduced without the logistic function, i.e., in its linear form."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-60",
"text": "However since the additional non-linearity is beneficial for factorizing for binary data (Collins et al., 2001; Bouchard et al., 2015) , we use the version above for our relational data."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-61",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-62",
"text": "**TUCKER2 DECOMPOSITION AND RESCAL**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-63",
"text": "CP-decomposition is quite restrictive since it does not take advantage of correlations between multiple entities and relations (Nickel et al., 2012) ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-64",
"text": "A more expressive factorization is Tucker decomposition (Tucker, 1966) , where in its standard formulation, a mode-3 tensor is decomposed into a core tensor and three matrices."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-65",
"text": "However, it is computationally expensive to estimate the core tensor, thus in practice the data tensor is often factorized only along two (instead of three) modes, which is referred to as Tucker2 decomposition."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-66",
"text": "A natural choice for relational data is to keep the relational mode fixed, and thus represent each relation as a k \u00d7 k matrix (e.g. R s for relation s) and entities as k-vectors:"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-67",
"text": "Like PARAFAC, the Tucker2 model was originally introduced in the linear form, however we use the logistic version here."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-68",
"text": "A variant of Tucker2 decomposition that has been applied very successfully in knowledge base completion is RESCAL (Nickel et al., 2012) , where each entity in has a single shared embedding irrespective of its argument position."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-69",
"text": "Although a logistic version of RESCAL has also been introduced by Nickel and Tresp (2013) , we use the linear form since an open-source implementation of the logistic version is not available."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-71",
"text": "**TRANSE**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-72",
"text": "Another formulation that is based on entity representations is the translating embeddings model by Bordes et al. (2013) ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-73",
"text": "The idea is that if a relation s between two entities a and b holds, that relation's vector representation r s should translate the representation e a to the second argument e b , i.e.,"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-74",
"text": "In this work we use a variant of TransE in which different embeddings are learned for an entity for each argument position."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-76",
"text": "**MODEL E**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-77",
"text": "Furthermore, we isolate the entity factorization in Riedel et al. (2013) by viewing it as tensor factorization."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-78",
"text": "In this model, each relation is assigned an embedding for each of its two arguments, i.e.,"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-79",
"text": "Although not explored in isolation by Riedel et al. (2013) , model E can be used on its own to predict relations between entities, even if they have not been observed to be in a relation."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-81",
"text": "**COMBINED TENSOR AND MATRIX FACTORIZATION FOR UNIVERSAL SCHEMA**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-82",
"text": "In the previous section, we provided background on matrix factorization with pairwise factors, followed by a tensor factorization based formulation of universal schema."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-83",
"text": "Although matrix factorization performs well for universal schema (Riedel et al., 2013) , it is not robust to sparse data and does not capture latent entity types that can be crucial for accurate relation extraction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-84",
"text": "On the other hand, although tensor factorization models are able to compactly represent entity types using unary embeddings, they are unable to adequately represent the pair-specific information that is necessary for modeling relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-85",
"text": "It is worth noting that tensor factorization for universal schema has been proposed by , who also observed that tensor factorization by itself performs poorly (even with additional type constraints), and the predictions need to be combined with matrix factorization to be accurate."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-86",
"text": "In this section we will present the fundamental differences between matrix and tensor factorization, and examine a few hybrid models that can address these concerns."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-87",
"text": "Black is a sparsely observed relation between any pair of entities."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-88",
"text": "Red relations correspond to each black edge, and a model that learns this implication can generalize to test instances (red dotted edge)."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-89",
"text": "Green relation exists between white and gray entities (we omit many of these edges for clarity), requiring the model to learn latent entity types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-90",
"text": "Finally, Blue relations exist for pairs where both a black and green relation is observed."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-91",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-92",
"text": "**ILLUSTRATION USING SYNTHETIC RELATIONS**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-93",
"text": "As an illustration of the limitations, we present experiments on a simple, synthetic relation extraction task."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-94",
"text": "The generated data consists of entities that belong to one of two types, and the following four types of relations (see Figure 1 for an example): (a) Black relations that are observed randomly between any two entities (with probability 0.5), (b) Red relations that exist between all pairs for which a Black relation exists, similar to a bornIn relation corresponding to each observed \"X was born in Y\" surface pattern, (c) Green relations that appear between all pairs of entities of different types, and (d) Blue relations that appear between entity pairs that are of different types and a Black relation was observed between them."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-95",
"text": "These Blue relation instances represent the relations that often occur in real-data: an ambiguous surface pattern such as \"X went to Y\" corresponds to schoolAttended relation only if the arguments are of certain types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-96",
"text": "We create such a dataset over 100 entities, and with 5 different sets of such relations (thus 20 total relations, and each entity is assigned 5 of 10 types), and hold out a random 10% of the Red, Green, and Blue relations for evaluation."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-97",
"text": "These relations target the strengths of the factorization representations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-98",
"text": "Red relations, as they directly correlate with observed Black instances, should be trivial for matrix factorization."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-99",
"text": "1 Similarly, Green rela- tions are based on, and clearly define, the latent types of the entities, and thus tensor factorization with entity embeddings should be able to near-perfectly generalize these relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-100",
"text": "The converse is more difficult to anticipate; it is unclear how matrix factorization can represent the types needed for Green relations, or whether tensor factorization can encode the BlackRed correspondence."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-101",
"text": "Further, it is not easy to see how any of these approaches will generalize to the Blue relation."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-102",
"text": "We show the average precision curves on held-out relations for a pairwise embedding approach (matrix factorization F from \u00a72.2) and many of the unary embeddings methods from \u00a72.3, with rank 6 in Figure 2 ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-103",
"text": "As expected, matrix factorization (F) is able to capture the Red relation accurately, however unary embeddings are not able to generalize to it."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-104",
"text": "On the other hand, unary embeddings are able to learn the Green relation which the pairwise approach fail to predict accurately."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-105",
"text": "Blue relations, which most closely model many kinds of relations that occur in text, unfortunately, are not represented well by these approaches that use either unary or pairwise embeddings."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-107",
"text": "**HYBRID FACTORIZATION MODELS**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-108",
"text": "Since matrix and tensor factorization techniques are quite limited in their representations even on the simple, synthetic data, we now turn to hybrid matrix and Figure 3 : Overview of the Models: Some of the models explored in this work, showing pairwise (F) and unary (E) models, along with their combinations (FE and RFE), for computing P (s(a, b) )."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-109",
"text": "tensor factorization models that represent entity types for universal schema."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-110",
"text": "We describe two possible combinations, models FE and RFE, summarized in Figure 3 ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-111",
"text": "Note that these approaches are distinct from collective factorization (Singh and Gordon, 2008) that can be used when extra entity information is available as unary relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-113",
"text": "**COMBINED MODEL (FE)**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-114",
"text": "As the direct combination of a pairwise model (Eq. 1) with an entity model (Eq. 5), we consider the FE model from Riedel et al. (2013) , i.e., the additive combination of the two: P (s(a, b)) = \u03c3(r s \u00b7 e ab + r s,1 \u00b7 e a + r s,2 \u00b7 e b ) (6) Both the matrix factorization model F and entity model E can de defined as special cases of this model, by setting r s,1/2 or r s to zero, respectively."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-116",
"text": "**RECTIFIER MODEL (RFE)**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-117",
"text": "A problem with combining the two models additively, as in FE, is that one model can easily override the other."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-118",
"text": "For instance, even if the type constraints of a relation are violated, a high score by the pairwise model score might still yield a high prediction for that triplet."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-119",
"text": "To alleviate this shortcoming, we experimented with rectifier units (Nair and Hinton, 2010) so that a score of model F or model E first needs to reach a certain threshold to influence the overall prediction for a triplet."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-120",
"text": "Specifically, we use the smooth approximation of a rectifier \u2295(x) = log(1 + e x ) and define the probability for a triplet as follows:"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-122",
"text": "**PARAMETER ESTIMATION**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-123",
"text": "As by Riedel et al. (2013) , we use a Bayesian personalized ranking objective (Rendle et al., 2009 ) to estimate parameters, i.e., for each observed training fact, we sample an unobserved fact for the same relation, and maximize their relative ranking using AdaGrad."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-124",
"text": "For all models we use k = 100 as dimension of latent representations, an initial learning rate of 0.1, and 2 -regularization of all parameters with a weight of 0.01."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-125",
"text": "For CANDECOMP/PARAFAC and RESCAL we use the open-source scikit-tensor 2 package with default hyper-parameters."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-127",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-128",
"text": "In order to evaluate whether the hybrid models are able to effectively combine the benefits of matrix and tensor factorization, we first present experiments on synthetic data in Section 4.1."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-129",
"text": "For a more realworld evaluation, we also experiment with universal schema for distantly-supervised relation extraction in Section 4.2."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-131",
"text": "**SYNTHETIC RGB RELATIONS**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-132",
"text": "In Section 3.1 we described a simple synthetic data set consisting of multiple Red, Green, and Blue relations constructed in order to illustrate the restrictions in the representation capabilities of matrix and tensor factorization models."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-133",
"text": "Here we revisit the dataset using the proposed combined tensor and matrix factorization approaches to evaluate whether these hybrid models are able to compete with tensor and matrix factorization on the relations they are good at (Green and Red, respectively), but more importantly, whether the combined approaches can represent the Blue rela- Figure 4: Hybrid Methods on RGB Data: Average precision as the rank is varied."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-134",
"text": "FE and RFE perform as well (or better than) tensor factorization on Green and matrix factorization on Red, but importantly, are able to encode the Blue relations that matrix or tensor factorization fail to model."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-135",
"text": "tions that matrix and tensor factorization approaches fail to generalize to."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-136",
"text": "In Figure 4 we present the average precision on the held-out data as the rank is varied for a number of approaches (we omit the remaining tensor factorization approaches for clarity since they perform similar to RESCAL and Model E)."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-137",
"text": "On the Red relation (Figure 4a ), tensor factorization is close to random, while combined factorization approaches (FE and RFE) are competitive to, and often outperform, matrix factorization (F)."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-138",
"text": "Similarly, on the Green relation (Figure 4b ), the combined approaches perform as well as tensor factorization, while matrix factorization is not much better than random."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-139",
"text": "Finally, on the Blue relation on which matrix and tensor factorization fare poorly, the combined approaches are able to obtain high accuracy, in particular achieve close to 90% average precision with only a rank of 5."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-141",
"text": "**R13-F TR-R13**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-142",
"text": "Although the same rank corresponds to different numbers of parameters for each method, the trend clearly indicates these results do not depend significantly on the number of parameters."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-144",
"text": "**UNIVERSAL SCHEMA RELATION EXTRACTION**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-145",
"text": "With the promising results shown on synthetic data, we now turn to evaluation on real-world information extraction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-146",
"text": "In particular, we evaluate the models on universal schema for distantly-supervised relation extraction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-147",
"text": "Following the experiment setup of Riedel et al. (2013) , we instantiate the universal schema matrix over entity pairs and text/Freebase relations for New York Times data, and compare the performance using average precision of the presented models."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-148",
"text": "Table 1 summarizes the performance of our models, as compared to existing approaches (see Riedel et al. (2013) for an overview)."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-149",
"text": "In particular, TR-R13 takes the output predictions of matrix factorization, and combines it with an entity-type aware RESCAL model ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-150",
"text": "3 Tensor factorization approaches perform poorly on this data."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-151",
"text": "We present results for Model E, but other formulations such as PARAFAC, TransE, RESCAL, and Tucker2 achieved even lower accuracy; this is consistent with the results in ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-152",
"text": "Models that use the matrix factorization (F, FE, R13-F and RFE) are significantly better, but more importantly, the hybrid appraoch FE achieves the highest accuracy."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-153",
"text": "It is unclear why RFE fails to provide similar gains, in particular, performing slightly worse than matrix factorization."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-154",
"text": "Note that we are not introducing a new state-of-art here, the neighborhood model (NF) that achieves a higher accuracy is omitted for clarity."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-155",
"text": "Table 2 : Nearest-Neighbors for a few randomlyselected entities based on their embeddings, demonstrating that similar entities are close to each other."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-157",
"text": "**ENTITY EMBEDDINGS AND TYPES**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-158",
"text": "Although the focus of this work is relation extraction, and the models are trained primarily for finding relations, in this section we explore the learned entity embeddings."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-159",
"text": "The low-dimensional entity embeddings have been trained to predict the binary relations that the entity participates in, and thus we expect entities that participate in similar relations to have similar embeddings."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-160",
"text": "To investigate whether the embeddings capture this intuition, we compute similarities of a few randomly selected entities with every other entity using the cosine distance of the FE entity embeddings, and show the 10 nearest neighbors in Table 2 ."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-161",
"text": "The nearest neighbors definitely capture the entity types, for example all the neighbors of \"La Stampa\" are newspapers in other parts of the world, which is quite impressive considering no explicit type information was available during training."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-162",
"text": "However, the granularity of the types depends on the textual patterns and relations in the schema; for \"LG Electronics\", the neighbors are mostly generic commercial institutions, perhaps because the observed surface patterns are similar across these types of organizations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-163",
"text": "Since the embeddings enable us to compute the similarity between any two entities, we also present a 2D visualization of the entities in the data using the t-Distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008) technique for dimensionality reduction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-164",
"text": "Further, to investigate whether the embeddings represent correct entity types, we perform an automatic, error-prone alignment of the entity strings to Freebase by finding a prominent entity that has the string as its name, and extract its types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-165",
"text": "Figure 5 shows the projection for 10 000 randomly selected entities, colored as per their type."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-166",
"text": "We see that the entity embeddings are able to separate most of the coarse level types, as locations are clustered quite separately from the organizations and people, but further, even fine-grained person types occur as distinct collections, for example politicians and sportsmen."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-167",
"text": "There is some cluster overlap as well, especially between the different person types such as authors, actors/musicians, and politicians; it is unclear whether this arises due to incorrect entity linking, inexact two-dimensional projection, entities that belong to multiple types, or from inaccurate embeddings caused by insufficient data."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-169",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-170",
"text": "Although tensor factorization has been widely used for knowledge-base completion for structured data, it performs poorly on universal schema for relation extraction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-171",
"text": "Matrix factorization, on the other hand, is appropriate for the task as it is able to compactly represent the correlations between surface pattern and structured KB relations, however learning pairwise factors is not effective for entity pairs with sparse observations or for identifying latent entity types."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-172",
"text": "We illustrate the differences between these matrix and tensor factorization using simple relations, and further, construct an additional relation that none of these approaches are able to model."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-173",
"text": "Motivated by this need for combining their complementary benefits, we explore two hybrid matrix and tensor factorization approaches."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-174",
"text": "Along with being able to model our constructed relations, these approaches also provided improvements on real-world relation extraction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-175",
"text": "We further provide qualitative exploration of the entity embedding vectors, showing that the embeddings learn fine-grained entity types from relational data."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-176",
"text": "Our investigations suggest a number of possible avenues for future work."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-177",
"text": "Foremost, we would like to investigate why the hybrid models, which perform significantly better on synthetic data, fail to achieve similar gains on real-world relations."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-178",
"text": "Second, including tensor factorization in the universal schema model enables us to augment the model with external entity information such as observed unary patterns and Freebase types, in order to aid both relation extraction and entity type prediction."
},
{
"sent_id": "7d80c3cc15453ddeaea72dcb9c04f9-C001-179",
"text": "Lastly, these hybrid approaches also enable extension of universal schema directly to n-ary relations, allowing a variety of models based on the choice of matrix or tensor representation for each relation."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-12"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-31"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-37"
]
],
"cite_sentences": [
"7d80c3cc15453ddeaea72dcb9c04f9-C001-12",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-31",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-37"
]
},
"@EXT@": {
"gold_contexts": [
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-77"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-79"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-114",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-117",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-119"
]
],
"cite_sentences": [
"7d80c3cc15453ddeaea72dcb9c04f9-C001-77",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-79",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-114"
]
},
"@MOT@": {
"gold_contexts": [
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-83"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-114",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-117"
]
],
"cite_sentences": [
"7d80c3cc15453ddeaea72dcb9c04f9-C001-83",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-114"
]
},
"@USE@": {
"gold_contexts": [
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-114"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-123"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-147"
],
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-148"
]
],
"cite_sentences": [
"7d80c3cc15453ddeaea72dcb9c04f9-C001-114",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-123",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-147",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-148"
]
},
"@DIF@": {
"gold_contexts": [
[
"7d80c3cc15453ddeaea72dcb9c04f9-C001-149",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-150",
"7d80c3cc15453ddeaea72dcb9c04f9-C001-151"
]
],
"cite_sentences": []
}
}
},
"ABC_a557739447131395f7a76d87a4cd19_9": {
"x": [
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-2",
"text": "Stance detection is a classification problem in natural language processing where for a text and target pair, a class result from the set {Favor, Against, Neither} is expected."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-3",
"text": "It is similar to the sentiment analysis problem but instead of the sentiment of the text author, the stance expressed for a particular target is investigated in stance detection."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-4",
"text": "In this paper, we present a stance detection tweet data set for Turkish comprising stance annotations of these tweets for two popular sports clubs as targets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-5",
"text": "Additionally, we provide the evaluation results of SVM classifiers for each target on this data set, where the classifiers use unigram, bigram, and hashtag features."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-6",
"text": "This study is significant as it presents one of the initial stance detection data sets proposed so far and the first one for Turkish language, to the best of our knowledge."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-7",
"text": "The data set and the evaluation results of the corresponding SVM-based approaches will form plausible baselines for the comparison of future studies on stance detection."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-10",
"text": "Stance detection (also called stance identification or stance classification) is one of the considerably recent research topics in natural language processing (NLP)."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-11",
"text": "It is usually defined as a classification problem where for a text and target pair, the stance of the author of the text for that target is expected as a classification output from the set: {Favor, Against, Neither} [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-12",
"text": "SIDEWAYS '17, Prague, Czech Republic Copyright held by the author(s)."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-13",
"text": "Stance detection is usually considered as a subtask of sentiment analysis (opinion mining) [13] topic in NLP."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-14",
"text": "Both are mostly performed on social media texts, particularly on tweets, hence both are important components of social media analysis."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-15",
"text": "Nevertheless, in sentiment analysis, the sentiment of the author of a piece of text usually as Positive, Negative, and Neutral is explored while in stance detection, the stance of the author of the text for a particular target (an entity, event, etc.) either explicitly or implicitly referred to in the text is considered."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-16",
"text": "Like sentiment analysis, stance detection systems can be valuable components of information retrieval and other text analysis systems [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-17",
"text": "Previous work on stance detection include [16] where a stance classifier based on sentiment and arguing features is proposed in addition to an arguing lexicon automatically compiled."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-18",
"text": "The ultimate approach performs better than distribution-based and unigram-based baseline systems [16] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-19",
"text": "In [17] , the authors show that the use of dialogue structure improves stance detection in on-line debates."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-20",
"text": "In [7] , Hasan and Ng carry out stance detection experiments using different machine learning algorithms, training data sets, features, and inter-post constraints in on-line debates, and draw insightful conclusions based on these experiments."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-21",
"text": "For instance, they find that sequence models like HMMs perform better at stance detection when compared with non-sequence models like Naive Bayes (NB) [7] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-22",
"text": "In another related study [10] , the authors conclude that topic-independent features can be exploited for disagreement detection in on-line dialogues."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-23",
"text": "The employed features include agreement, cue words, denial, hedges, duration, polarity, and punctuation [10] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-24",
"text": "Stance detection on a corpus of student essays is considered in [5] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-25",
"text": "After using linguistically-motivated feature sets together with multivalued NB and SVM as the learning models, the authors conclude that they outperform two baseline approaches [5] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-26",
"text": "In [4] , the author claims that Wikipedia can be used to determine stances about controversial topics based on their previous work regarding controversy extraction on the Web."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-27",
"text": "Among more recent related work, in [1] stance detection for unseen targets is studied and bidirectional conditional encoding is employed."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-28",
"text": "The authors state that their approach achieves state-ofthe art performance rates [1] on SemEval 2016 Twitter Stance Detection corpus [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-29",
"text": "In [3] , a stance-community detection approach called SCIFNET is proposed."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-30",
"text": "SCIFNET creates networks of people who are stance targets, automatically from the related document collections [3] using stance expansion and refinement techniques to arrive at stance-coherent networks."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-31",
"text": "A tweet data set annotated with stance information regarding six predefined targets is proposed in [11] where this data set is annotated through crowdsourcing."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-32",
"text": "The authors indicate that the data set is also annotated with sentiment information in addition to stance, so it can help reveal SIDEWAYS'17, July 2017, Prague, Czech Republic D. K\u00fc\u00e7\u00fck associations between stance and sentiment [11] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-33",
"text": "Lastly, in [12] , Se-mEval 2016's aforementioned shared task on Twitter Stance Detection is described."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-34",
"text": "Also provided are the results of the evaluations of 19 systems participating in two subtasks (one with training data set provided and the other without an annotated data set) of the shared task [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-35",
"text": "In this paper, we present a tweet data set in Turkish annotated with stance information, where the corresponding annotations are made publicly available."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-36",
"text": "The domain of the tweets comprises two popular football clubs which constitute the targets of the tweets included."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-37",
"text": "We also provide the evaluation results of SVM classifiers (for each target) on this data set using unigram, bigram, and hashtag features."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-38",
"text": "To the best of our knowledge, the current study is the first one to target at stance detection in Turkish tweets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-39",
"text": "Together with the provided annotated data set and the corresponding evaluations with the aforementioned SVM classifiers which can be used as baseline systems, our study will hopefully help increase social media analysis studies on Turkish content."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-40",
"text": "The rest of the paper is organized as follows: In Section 2, we describe our tweet data set annotated with the target and stance information."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-41",
"text": "Section 3 includes the details of our SVM-based stance classifiers and their evaluation results with discussions."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-42",
"text": "Section 4 includes future research topics based on the current study, and finally Section 5 concludes the paper with a summary."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-44",
"text": "**A STANCE DETECTION DATA SET**"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-45",
"text": "We have decided to consider tweets about popular sports clubs as our domain for stance detection."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-46",
"text": "Considerable amounts of tweets are being published for sports-related events at every instant."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-47",
"text": "Hence we have determined our targets as Galatasaray (namely Target-1) and Fenerbah\u00e7e (namely, Target-2) which are two of the most popular football clubs in Turkey."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-48",
"text": "As is the case for the sentiment analysis tools, the outputs of the stance detection systems on a stream of tweets about these clubs can facilitate the use of the opinions of the football followers by these clubs."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-49",
"text": "In a previous study on the identification of public health-related tweets, two tweet data sets in Turkish (each set containing 1 million random tweets) have been compiled where these sets belong to two different periods of 20 consecutive days [9] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-50",
"text": "We have decided to use one of these sets (corresponding to the period between August 18 and September 6, 2015) and firstly filtered the tweets using the possible names used to refer to the target clubs."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-51",
"text": "Then, we have annotated the stance information in the tweets for these targets as Favor or Against."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-52",
"text": "Within the course of this study, we have not considered those tweets in which the target is not explicitly mentioned, as our initial filtering process reveals."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-53",
"text": "For the purposes of the current study, we have not annotated any tweets with the Neither class."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-54",
"text": "This stance class and even finergrained classes can be considered in further annotation studies."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-55",
"text": "We should also note that in a few tweets, the target of the stance was the management of the club while in some others a particular footballer of the club is praised or criticised."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-56",
"text": "Still, we have considered the club as the target of the stance in all of the cases and carried out our annotations accordingly."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-57",
"text": "At the end of the annotation process, we have annotated 700 tweets, where 175 tweets are in favor of and 175 tweets are against Target-1, and similarly 175 tweets are in favor of and 175 are against Target-2."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-58",
"text": "Hence, our data set is a balanced one although it is currently limited in size."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-59",
"text": "The corresponding stance annotations are made publicly available at http://ceng.metu.edu.tr/\u223ce120329/ Turkish_Stance_Detection_Tweet_Dataset.csv in Comma Separated Values (CSV) format."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-60",
"text": "The file contains three columns with the corresponding headers."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-61",
"text": "The first column is the tweet id of the corresponding tweet, the second column contains the name of the stance target, and the last column includes the stance of the tweet for the target as Favor or Against."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-62",
"text": "To the best of our knowledge, this is the first publicly-available stance-annotated data set for Turkish."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-63",
"text": "Hence, it is a significant resource as there is a scarcity of annotated data sets, linguistic resources, and NLP tools available for Turkish."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-64",
"text": "Additionally, to the best of our knowledge, it is also significant for being the first stance-annotated data set including sports-related tweets, as previous stance detection data sets mostly include on-line texts on political/ethical issues."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-66",
"text": "**STANCE DETECTION EXPERIMENTS USING SVM CLASSIFIERS**"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-67",
"text": "It is emphasized in the related literature that unigram-based methods are reliable for the stance detection task [16] and similarly unigram-based models have been used as baseline models in studies such as [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-68",
"text": "In order to be used as a baseline and reference system for further studies on stance detection in Turkish tweets, we have trained two SVM classifiers (one for each target) using unigrams as features."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-69",
"text": "Before the extraction of unigrams, we have employed automated preprocessing to filter out the stopwords in our annotated data set of 700 tweets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-70",
"text": "The stopword list used is the list presented in [8] which, in turn, is the slightly extended version of the stopword list provided in [2] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-71",
"text": "We have used the SVM implementation available in the Weka data mining application [6] where this particular implementation employs the SMO algorithm [14] to train a classifier with a linear kernel."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-72",
"text": "The 10-fold cross-validation results of the two classifiers are provided in Table 1 using the metrics of precision, recall, and F-Measure."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-73",
"text": "The evaluation results are quite favorable for both targets and particularly higher for Target-1, considering the fact that they are the initial experiments on the data set."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-74",
"text": "The performance of the classifiers is better for the Favor class for both targets when compared with the performance results for the Against class."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-75",
"text": "This outcome may be due to the common use of some terms when expressing positive stance towards sports clubs in Turkish tweets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-76",
"text": "The same percentage of common terms may not have been observed in tweets during the expression of negative stances towards the targets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-77",
"text": "Yet, completely the opposite pattern is observed in stance detection results of baseline systems given in [12] , i.e., better F-Measure rates have been obtained for the Against class when compared with the Favor class [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-78",
"text": "Some of the baseline systems reported in [12] are SVM-based systems using unigrams and ngrams as features similar to our study, but their data sets include all three stance classes of Favor, Against, and Neither, while our data set comprises only tweets classified as belonging to Favor or Against classes."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-79",
"text": "Another difference is that the data sets in [12] have been divided into training and test sets, while in our study we provide 10-fold cross-validation results on the whole data set."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-80",
"text": "On the other hand, we should also note that SVM-based sentiment analysis systems (such as those given in [15] ) have been reported to achieve better F-Measure rates for the Positive sentiment class when compared with the results obtained for the Negative class."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-81",
"text": "Therefore, our evaluation results for each stance class seem to be in line with such sentiment analysis systems."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-82",
"text": "Yet, further experiments on the extended versions of our data set should be conducted and the results should again be compared with the stance detection results given in the literature."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-83",
"text": "We have also evaluated SVM classifiers which use only bigrams as features, as ngram-based classifiers have been reported to perform better for the stance detection problem [12] ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-84",
"text": "However, we have observed that using bigrams as the sole features of the SVM classifiers leads to quite poor results."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-85",
"text": "This observation may be due to the relatively limited size of the tweet data set employed."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-86",
"text": "Still, we can conclude that unigram-based features lead to superior results compared to the results obtained using bigrams as features, based on our experiments on our data set."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-87",
"text": "Yet, ngram-based features may be employed on the extended versions of the data set to verify this conclusion within the course of future work."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-88",
"text": "With an intention to exploit the contribution of hashtag use to stance detection, we have also used the existence of hashtags in tweets as an additional feature to unigrams."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-89",
"text": "The corresponding evaluation results of the SVM classifiers using unigrams together the existence of hashtags as features are provided in Table 2 ."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-90",
"text": "When the results given in Table 2 are compared with the results in Table 1 , a slight decrease in F-Measure (0.5%) for Target-1 is observed, while the overall F-Measure value for Target-2 has increased by 1.8%."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-91",
"text": "Although we could not derive sound conclusions mainly due to the relatively small size of our data set, the increase in the performance of the SVM classifier Target-2 is an encouraging evidence for the exploitation of hashtags in a stance detection system."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-92",
"text": "We leave other ways of exploiting hashtags for stance detection as a future work."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-93",
"text": "To sum up, our evaluation results are significant as reference results to be used for comparison purposes and provides evidence for the utility of unigram-based and hashtag-related features in SVM classifiers for the stance detection problem in Turkish tweets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-95",
"text": "**FUTURE PROSPECTS**"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-96",
"text": "Future work based on the current study includes the following:"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-97",
"text": "\u2022 The presented stance-annotated data set for Turkish has been created by one annotator only (the author of this study), yet, the data set should better be revised and extended through crowdsourcing facilities."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-98",
"text": "When employing such a procedure, other stance classes like Neither can be considered as well."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-99",
"text": "The procedure will improve the quality the data set as well as the quality of prospective systems to be trained and tested on it."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-100",
"text": "\u2022 Other features like emoticons (as commonly used for sentiment analysis), features based on hashtags, and ngram features can also be used by the classifiers and these classifiers can be tested on larger data sets."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-101",
"text": "Other classification approaches could also be implemented and tested against our baseline classifiers."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-102",
"text": "Particularly, related methods presented in recent studies such as [12] can be tested on our data set."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-103",
"text": "\u2022 Lastly, the SVM classifiers utilized in this study and their prospective versions utilizing other features can be tested on stance data sets in other languages (such as English) for comparison purposes."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-104",
"text": "----------------------------------"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-105",
"text": "**CONCLUSION**"
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-106",
"text": "Stance detection is a considerably new research area in natural language processing and is considered within the scope of the wellstudied topic of sentiment analysis."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-107",
"text": "It is the detection of stance within text towards a target which may be explicitly specified in the text or not."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-108",
"text": "In this study, we present a stance-annotated tweet data set in Turkish where the targets of the annotated stances are two popular sports clubs in Turkey."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-109",
"text": "The corresponding annotations are made publicly-available for research purposes."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-110",
"text": "To the best of our knowledge, this is the first stance detection data set for the Turkish language and also the first sports-related stance-annotated data set."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-111",
"text": "Also presented in this study are SVM classifiers (one for each target) utilizing unigram and bigram features in addition to using the existence of hashtags as another feature."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-112",
"text": "10-fold cross validation results of these classifiers are presented which can be used as reference results by prospective systems."
},
{
"sent_id": "a557739447131395f7a76d87a4cd19-C001-113",
"text": "Both the annotated data set and the classifiers with evaluations are significant since they are the initial contributions to stance detection problem in Turkish tweets."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"a557739447131395f7a76d87a4cd19-C001-10",
"a557739447131395f7a76d87a4cd19-C001-11"
],
[
"a557739447131395f7a76d87a4cd19-C001-16"
],
[
"a557739447131395f7a76d87a4cd19-C001-28"
],
[
"a557739447131395f7a76d87a4cd19-C001-33"
],
[
"a557739447131395f7a76d87a4cd19-C001-34"
],
[
"a557739447131395f7a76d87a4cd19-C001-67"
],
[
"a557739447131395f7a76d87a4cd19-C001-83"
]
],
"cite_sentences": [
"a557739447131395f7a76d87a4cd19-C001-11",
"a557739447131395f7a76d87a4cd19-C001-16",
"a557739447131395f7a76d87a4cd19-C001-28",
"a557739447131395f7a76d87a4cd19-C001-33",
"a557739447131395f7a76d87a4cd19-C001-34",
"a557739447131395f7a76d87a4cd19-C001-67",
"a557739447131395f7a76d87a4cd19-C001-83"
]
},
"@DIF@": {
"gold_contexts": [
[
"a557739447131395f7a76d87a4cd19-C001-72",
"a557739447131395f7a76d87a4cd19-C001-73",
"a557739447131395f7a76d87a4cd19-C001-77"
],
[
"a557739447131395f7a76d87a4cd19-C001-78"
],
[
"a557739447131395f7a76d87a4cd19-C001-79"
]
],
"cite_sentences": [
"a557739447131395f7a76d87a4cd19-C001-77",
"a557739447131395f7a76d87a4cd19-C001-78",
"a557739447131395f7a76d87a4cd19-C001-79"
]
},
"@FUT@": {
"gold_contexts": [
[
"a557739447131395f7a76d87a4cd19-C001-102"
]
],
"cite_sentences": [
"a557739447131395f7a76d87a4cd19-C001-102"
]
}
}
},
"ABC_b8244f9337456f1f90a576b2398680_9": {
"x": [
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-76",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-77",
"text": "We compared each of the patterns models described in Section 2 using an unsupervised IE experiment similar to one described by Sudo et al. (2003) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-102",
"text": "**PATTERN GENERATION**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-78",
"text": "Let D be a corpus of documents and R a set of documents which are relevant to a particular extraction task."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-79",
"text": "In this context \"relevant\" means that the document contains the information we are interested in identifying."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-103",
"text": "The texts used for these experiments were parsed using the Stanford dependency parser (Klein and Manning, 2002) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-80",
"text": "D and R are such that D = R \u222aR and R\u2229R = \u2205. As assumption behind this approach is that useful patterns will be far more likely to occur in R than D overall."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-2",
"text": "Several recent approaches to Information Extraction (IE) have used dependency trees as the basis for an extraction pattern representation."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-3",
"text": "These approaches have used a variety of pattern models (schemes which define the parts of the dependency tree which can be used to form extraction patterns)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-4",
"text": "Previous comparisons of these pattern models are limited by the fact that they have used indirect tasks to evaluate each model."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-5",
"text": "This limitation is addressed here in an experiment which compares four pattern models using an unsupervised learning algorithm and a standard IE scenario."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-6",
"text": "It is found that there is a wide variation between the models' performance and suggests that one model is the most useful for IE."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-9",
"text": "A common approach to Information Extraction (IE) is to (manually or automatically) create a set of patterns which match against text to identify information of interest."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-82",
"text": "**RANKING PATTERNS**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-83",
"text": "Patterns for each model are ranked using a technique inspired by the tf-idf scoring commonly used in Information Retrieval (Manning and Sch\u00fctze, 1999) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-84",
"text": "The score for each pattern, p, is given by:"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-85",
"text": "where tf p is the number of times pattern p appears in relevant documents, N is the total number of documents in the corpus and df p the number of documents in the collection containing the pattern p."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-86",
"text": "Equation 1 combines two factors: the term frequency (in relevant documents) and inverse document frequency (across the corpus)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-87",
"text": "Patterns which occur frequently in relevant documents without being too prevalent in the corpus are preferred."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-88",
"text": "Sudo et al. (2003) found that it was important to find the appropriate balance between these two factors."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-89",
"text": "They introduced the \u03b2 parameter as a way of controlling the relative contribution of the inverse document frequency."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-90",
"text": "\u03b2 is tuned for each extraction task and pattern model combination."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-91",
"text": "Although simple, this approach has the advantage that it can be applied to each of the four pattern models to provide a direct comparison."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-93",
"text": "**EXTRACTION SCENARIO**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-94",
"text": "The ranking process was applied to the IE scenario used for the sixth Message Understanding conference (MUC-6)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-95",
"text": "The aim of this task was to identify management succession events from a corpus of newswire texts."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-96",
"text": "Relevant information describes an executive entering or leaving a position within a company, for example \"Last month Smith resigned as CEO of Rooter Ltd.\"."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-97",
"text": "This sentence described as event involving three items: a person (Smith), position (CEO) and company (Rooter Ltd)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-98",
"text": "We made use of a version of the MUC-6 corpus described by Soderland (1999) which consists of 598 documents."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-99",
"text": "For these experiments relevant documents were identified using annotations in the corpus."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-100",
"text": "However, this is not necessary since Sudo et al. (2003) showed that adequate knowledge about document relevance could be obtained automatically using an IR system."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-10",
"text": "Muslea (1999) reviewed the approaches which were used at the time and found that the most common techniques relied on lexicosyntactic patterns being applied to text which has undergone relatively shallow linguistic processing."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-11",
"text": "For example, the extraction rules used by Soderland (1999) and Riloff (1996) match text in which syntactic chunks have been identified."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-12",
"text": "More recently researchers have begun to employ deeper syntactic analysis, such as dependency parsing (Yangarber et al., 2000; Sudo et al., 2001; Sudo et al., 2003; Yangarber, 2003) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-13",
"text": "In these approaches extraction patterns are essentially parts of the dependency tree."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-14",
"text": "To perform extraction they are compared against the dependency analysis of a sentence to determine whether it contains the pattern."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-15",
"text": "Each of these approaches relies on a pattern model to define which parts of the dependency tree can be used to form the extraction patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-16",
"text": "A variety of pattern models have been proposed."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-17",
"text": "For example the patterns used by Yangarber et al. (2000) are the subject-verb-object tuples from the dependency tree (the remainder of the dependency parse is discarded) while Sudo et al. (2003) allow any subtree within the dependency parse to act as an extraction pattern."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-18",
"text": "Stevenson and Greenwood (2006) showed that the choice of pattern model has important implications for IE algorithms including significant differences between the various models in terms of their ability to identify information of interest in text."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-19",
"text": "However, there has been little comparison between the various pattern models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-20",
"text": "Those which have been carried out have been limited by the fact that they used indirect tasks to evaluate the various models and did not compare them in an IE scenario."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-21",
"text": "We address this limitation here by presenting a direct comparison of four previously described pattern models using an unsupervised learning method applied to a commonly used IE scenario."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-22",
"text": "The remainder of the paper is organised as follows."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-23",
"text": "The next section presents four pattern models which have been previously introduced in the litera-ture."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-24",
"text": "Section 3 describes two previous studies which compared these models and their limitations."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-25",
"text": "Section 4 describes an experiment which compares the four models on an IE task, the results of which are described in Section 5."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-26",
"text": "Finally, Section 6 discusses the conclusions which may be drawn from this work."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-28",
"text": "**IE PATTERN MODELS**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-29",
"text": "In dependency analysis (Mel'\u010duk, 1987 ) the syntax of a sentence is represented by a set of directed binary links between a word (the head) and one of its modifiers."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-30",
"text": "These links may be labelled to indicate the relation between the head and modifier (e.g. subject, object)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-31",
"text": "An example dependency analysis for the sentence \"Acme hired Smith as their new CEO, replacing Bloggs.\" is shown The remainder of this section outlines four models for representing extraction patterns which can be derived from dependency trees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-32",
"text": "Predicate-Argument Model (SVO): A simple approach, used by Yangarber et al. (2000) , Yangarber (2003) and , is to use subject-verb-object tuples from the dependency parse as extraction patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-33",
"text": "These consist of a verb and its subject and/or direct object."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-34",
"text": "Figure 2 shows the two SVO patterns 1 which are produced for the dependency tree shown in Figure 1 ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-35",
"text": "This model can identify information which is expressed using simple predicate-argument constructions such as the relation between Acme and Smith 1 The formalism used for representing dependency patterns is similar to the one introduced by Sudo et al. (2003) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-36",
"text": "Each node in the tree is represented in the format a[b/c] (e.g. subj[N/Acme]) where c is the lexical item (Acme), b its grammatical tag (N) and a the dependency relation between this node and its parent (subj)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-37",
"text": "The relationship between nodes is represented as X(A+B+C) which indicates that nodes A, B and C are direct descendents of node X."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-38",
"text": "in the dependency tree shown in Figure 1 ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-39",
"text": "However, the SVO model cannot represent information described using other linguistic constructions such as nominalisations or prepositional phrases."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-40",
"text": "For example the SVO model would not be able to recognise that Smith's new job title is CEO since these patterns ignore the part of the dependency tree containing that information."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-41",
"text": "Chains: A pattern is defined as a path between a verb node and any other node in the dependency tree passing through zero or more intermediate nodes (Sudo et al., 2001) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-42",
"text": "Figure 2 shows examples of the chains which can be extracted from the tree in Figure 1 ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-43",
"text": "Chains provide a mechanism for encoding information beyond the direct arguments of predicates and includes areas of the dependency tree ignored by the SVO model."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-44",
"text": "For example, they can represent information expressed as a nominalisation or within a prepositional phrase, e.g. \"The resignation of Smith from the board of Acme ...\" However, a potential shortcoming of this model is that it cannot represent the link between arguments of a verb."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-45",
"text": "Patterns in the chain model format are unable to represent even the simplest of sentences containing a transitive verb, e.g. \"Smith left Acme\"."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-46",
"text": "Linked Chains: The linked chains model represents extraction patterns as a pair of chains which share the same verb but no direct descendants."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-47",
"text": "Example linked chains are shown in Figure 2 ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-48",
"text": "This pattern representation encodes most of the information in the sentence with the advantage of being able to link together event participants which neither of the SVO or chain model can, for example the relation between \"Smith\" and \"Bloggs\" in Figure 1 ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-49",
"text": "Subtrees: The final model to be considered is the subtree model (Sudo et al., 2003) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-50",
"text": "In this model any subtree of a dependency tree can be used as an extraction pattern, where a subtree is any set of nodes in the tree which are connected to one another."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-51",
"text": "Single nodes are not considered to be subtrees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-52",
"text": "The subtree model is a richer representation than those discussed so far and can represent any part of a dependency tree."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-53",
"text": "Each of the previous models form a proper subset of the subtrees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-54",
"text": "By choosing an appropriate subtree it is possible to link together any pair of nodes in a tree and consequently this model can"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-56",
"text": "**PREVIOUS COMPARISONS**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-57",
"text": "There have been few direct comparisons of the various pattern models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-58",
"text": "Sudo et al. (2003) compared three models (SVO, chains and subtrees) on two IE scenarios using a entity extraction task."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-59",
"text": "Models were evaluated in terms of their ability to identify entities taking part in events and distinguish them from those which did not."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-60",
"text": "They found the SVO model performed poorly in comparison with the other two models and that the performance of the subtree model was generally the same as, or better than, the chain model."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-61",
"text": "However, they did not attempt to determine whether the models could identify the relations between these entities, simply whether they could identify the entities participating in relevant events."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-62",
"text": "Stevenson and Greenwood (2006) compared the four pattern models described in Section 2 in terms of their complexity and ability to represent relations found in text."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-63",
"text": "The complexity of each model was analysed in terms of the number of patterns which would be generated from a given dependency parse."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-64",
"text": "This is important since several of the algorithms which have been proposed to make use of dependency-based IE patterns use iterative learning (e.g. (Yangarber et al., 2000; Yangarber, 2003; ) and are unlikely to cope with very large sets of candidate patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-65",
"text": "The number of patterns generated therefore has an effect on how practical computations using that model may be."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-66",
"text": "It was found that the number of patterns generated for the SVO model is a linear function of the size of the dependency tree."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-67",
"text": "The number of chains and linked chains is a polynomial function while the number of subtrees is exponential."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-68",
"text": "Stevenson and Greenwood (2006) also analysed the representational power of each model by measuring how many of the relations found in a standard IE corpus they are expressive enough to represent."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-69",
"text": "(The documents used were taken from newswire texts and biomedical journal articles.) They found that the SVO and chain model could only represent a small proportion of the relations in the corpora."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-70",
"text": "The subtree model could represent more of the relations than any other model but that there was no statistical difference between those relations and the ones covered by the linked chain model."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-71",
"text": "They concluded that the linked chain model was optional since it is expressive enough to represent the information of interest without introducing a potentially unwieldy number of patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-72",
"text": "There is some agreement between these two studies, for example that the SVO model performs poorly in comparison with other models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-73",
"text": "However, Stevenson and Greenwood (2006) also found that the coverage of the chain model was significantly worse than the subtree model, although Sudo et al. (2003) found that in some cases their performance could not be distinguished."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-74",
"text": "In addition to these disagreements, these studies are also limited by the fact that they are indirect; they do not evaluate the various pattern models on an IE task."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-104",
"text": "The dependency trees were processed to replace the names of entities belonging to specific semantic classes with a general token."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-105",
"text": "Three of these classes were used for the management succession domain (PERSON, ORGANISA-TION and POST)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-106",
"text": "For example, in the dependency analysis of \"Smith will became CEO next year\", \"Smith\" is replaced by PERSON and \"CEO\" by POST."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-107",
"text": "This process allows more general patterns to be extracted from the dependency trees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-108",
"text": "For example, [V/become]"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-110",
"text": "**(SUBJ[N/PERSON]+OBJ[N/POST]).**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-111",
"text": "In the MUC-6 corpus items belonging to the relevant semantic classes are already identified."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-112",
"text": "Patterns for each of the four models were extracted from the processed dependency trees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-113",
"text": "For the SVO, chain and linked chain models this was achieved using depth-first search."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-114",
"text": "However, the enumeration of all subtrees is less straightforward and has been shown to be a #P -complete problem (Goldberg and Jerrum, 2000) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-115",
"text": "We made use of the rightmost extension algorithm (Abe et al., 2002; Zaki, 2002) which is an efficient way of enumerating all subtrees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-116",
"text": "This approach constructs subtrees iteratively by combining together subtrees which have already been observed."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-117",
"text": "The algorithm starts with a set of trees, each of which consists of a single node."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-118",
"text": "At each stage the known trees are extended by the addition of a single node."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-119",
"text": "In order to avoid duplication the extension is restricted to allowing nodes only to be added to the nodes on the rightmost path of the tree."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-120",
"text": "Applying the process recursively creates a search space in which all subtrees are enumerated with minimal duplication."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-121",
"text": "The rightmost extension algorithm is most suited to finding subtrees which occur multiple times and, even using this efficient approach, we were unable to generate subtrees which occurred fewer than four times in the MUC-6 texts in a reasonable time."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-122",
"text": "Similar restrictions have been encountered within other approaches which have relied on the generation of a comprehensive set of subtrees from a parse forest."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-123",
"text": "For example, Kudo et al. (2005) used subtrees for parse ranking but could only generate subtrees which appear at least ten times in a 40,000 sentence corpus."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-124",
"text": "They comment that the size of their data set meant that it would have been difficult to complete the experiments with less restrictive parameters."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-125",
"text": "In addition, Sudo et al. (2003) only generated subtrees which appeared in at least three documents."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-126",
"text": "Kudo et al. (2005) and Sudo et al. (2003) both used the rightmost extension algorithm to generate subtrees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-149",
"text": "The recall is the proportion of relations in the gold standard data which are identified by the set of patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-127",
"text": "To provide a direct comparison of the pattern models we also produced versions of the sets of patterns extracted for the SVO, chain and linked chain models in which patterns which occurred fewer than four times were removed."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-128",
"text": "Table 1 shows the number of patterns generated for each of the four models when the patterns are both filtered and unfiltered. (Although the set of unfiltered subtree patterns were not generated it is possible to determine the number of patterns which would be generated using a process described by Stevenson and Greenwood (2006 Table 1 : Number of patterns generated by each model It can be seen that the various pattern models generate vastly different numbers of patterns and that the number of subtrees is significantly greater than the other three models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-129",
"text": "Previous analysis (see Section 3) suggested that the number of subtrees which would be generated from a corpus could be difficult to process computationally and this is supported by our findings here."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-131",
"text": "**PARAMETER TUNING**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-132",
"text": "The value of \u03b2 in equation 1 was set using a separate corpus from which the patterns were generated, a methodology suggested by Sudo et al. (2003) ."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-133",
"text": "To generate this additional text we used the Reuters Corpus (Rose et al., 2002 ) which consists of a year's worth of newswire output."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-134",
"text": "Each document in the Reuters corpus has been manually annotated with topic codes indicating its general subject area(s)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-135",
"text": "One of these topic codes (C411) refers to management succession events and was used to identify documents which are relevant to the MUC6 IE scenario."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-136",
"text": "A corpus consisting of 348 documents annotated with code C411 and 250 documents without that code, representing irrelevant documents, were taken from the Reuters corpus to create a corpus with the same distribution of relevant and irrelevant documents as found in the MUC-6 corpus."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-137",
"text": "Unlike the MUC-6 corpus, items belonging to the required semantic classes are not annotated in the Reuters Corpus."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-138",
"text": "They were identified automatically using a named entity identifier."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-139",
"text": "The patterns generated from the MUC-6 texts were ranked using formula 1 with a variety of values of \u03b2."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-140",
"text": "These sets of ranked patterns were then used to carry out a document filtering task on the Reuters corpus -the aim of which is to differentiate documents based on whether or not they contain a relation of interest."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-141",
"text": "The various values for \u03b2 were compared by computing the area under the curve."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-142",
"text": "It was found that the optimal value for \u03b2 was 2 for all pattern models and this setting was used for the experiments."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-144",
"text": "**EVALUATION**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-145",
"text": "Evaluation was carried out by comparing the ranked lists of patterns against the dependency trees for the MUC-6 texts."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-146",
"text": "When a pattern is found to match against a tree the items which match any seman-tic classes in the pattern are extracted."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-147",
"text": "These items are considered to be related and compared against the gold standard data in the corpus to determine whether they are in fact related."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-148",
"text": "The precision of a set of patterns is computed as the proportion of the relations which were identified that are listed in the gold standard data."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-150",
"text": "The ranked set of patterns are evaluated incrementally with the precision and recall of the first (highest ranked) pattern computed."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-151",
"text": "The next pattern is then added to the relations extracted by both are evaluated."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-152",
"text": "This process continues until all patterns are exhausted."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-153",
"text": "Figure 3 shows the results when the four filtered pattern models, ranked using equation 1, are compared."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-155",
"text": "**RESULTS**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-156",
"text": "A first observation is that the chain model performs poorly in comparison to the other three models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-157",
"text": "The highest precision achieved by this model is 19.9% and recall never increases beyond 9%."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-158",
"text": "In comparison the SVO model includes patterns with extremely high precision but the maximum recall achieved by this model is low."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-159",
"text": "Analysis showed that the first three SVO patterns had very high precision."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-160",
"text": "These were which have precision of 90.1%, 80.8% and 78.9% respectively."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-161",
"text": "If these high precision patterns are removed the maximum precision of the SVO model is around 32%, which is comparable with the linked chain and subtree models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-162",
"text": "This suggests that, while the SVO model includes very useful extraction patterns, the format is restrictive and is unable to represent much of the information in this corpus."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-163",
"text": "The remaining two pattern models, linked chains and subtrees, have very similar performance and each achieves higher recall than the SVO model, albeit with lower precision."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-164",
"text": "The maximum recall obtained by the linked chain model is slightly lower than the subtree model but it does maintain higher precision at higher recall levels."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-165",
"text": "The maximum recall achieved by all four models is very low in this evaluation and part of the reason for this is the fact that the patterns have been filtered to allow direct comparison with the subtree model."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-166",
"text": "Figure 4 shows the results when the unfiltered SVO, chain and linked chain patterns are used."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-167",
"text": "(Performance of the filtered subtrees are also included in this graph for comparison.) This result shows that the addition of extra patterns for each model improves recall without effecting the maximum precision achieved."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-168",
"text": "The chain model also performs badly in this experiment."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-169",
"text": "Precision of the SVO model is still high (again this is due to the same three highly accurate patterns) however the maximum recall achieved by this model is not particularly increased by the addition of the unfiltered patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-170",
"text": "The linked chain model benefits most from the unfiltered patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-171",
"text": "The extra patterns lead to a maximum recall which is more than double any of the other models without overly degrading precision."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-172",
"text": "The fact that the linked chain model is able to achieve such a high recall shows that it is able to represent the relations found in the MUC-6 text, unlike the SVO and chain models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-173",
"text": "It is likely that the subtrees model would also produce a set of patterns with high recall but the number of potential patterns which are allowable within this model makes this impractical."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-175",
"text": "**DISCUSSION AND CONCLUSIONS**"
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-176",
"text": "Some of the results reported for each model in these experiments are low."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-177",
"text": "Precision levels are generally below 40% (with the exception of the SVO model which achieves high precision using a small number of patterns)."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-178",
"text": "One reason for this that the the patterns were ranked using a simple unsupervised learning algorithm which allowed direct comparison of four different pattern models."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-179",
"text": "This approach only made use of information about the distribution of patterns in the corpus and it is likely that results could be improved for a particular pattern model by employing more sophisticated approaches which make use of additional information, for example the structure of the patterns."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-180",
"text": "The results presented here provide insight into the usefulness of the various pattern models by evaluating them on an actual IE task."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-181",
"text": "It is found that SVO patterns are capable of high precision but that the restricted set of possible patterns leads to low recall."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-182",
"text": "The chain model was found to perform badly with low recall and precision regardless of whether the patterns were filtered."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-183",
"text": "Performance of the linked chain and subtree models were similar when the patterns were filtered but unfiltered linked chains were capable of achieving far higher recall than the filtered subtrees."
},
{
"sent_id": "b8244f9337456f1f90a576b2398680-C001-184",
"text": "These experiments suggest that the linked chain model is a useful one for IE since it is simple enough for an unfiltered set of patterns to be extracted and able to represent a wider range of information than the SVO and chain models."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"b8244f9337456f1f90a576b2398680-C001-12"
],
[
"b8244f9337456f1f90a576b2398680-C001-17"
],
[
"b8244f9337456f1f90a576b2398680-C001-32",
"b8244f9337456f1f90a576b2398680-C001-35"
],
[
"b8244f9337456f1f90a576b2398680-C001-49"
],
[
"b8244f9337456f1f90a576b2398680-C001-58"
],
[
"b8244f9337456f1f90a576b2398680-C001-73"
],
[
"b8244f9337456f1f90a576b2398680-C001-87",
"b8244f9337456f1f90a576b2398680-C001-88"
],
[
"b8244f9337456f1f90a576b2398680-C001-125"
],
[
"b8244f9337456f1f90a576b2398680-C001-126"
]
],
"cite_sentences": [
"b8244f9337456f1f90a576b2398680-C001-12",
"b8244f9337456f1f90a576b2398680-C001-17",
"b8244f9337456f1f90a576b2398680-C001-35",
"b8244f9337456f1f90a576b2398680-C001-49",
"b8244f9337456f1f90a576b2398680-C001-58",
"b8244f9337456f1f90a576b2398680-C001-73",
"b8244f9337456f1f90a576b2398680-C001-88",
"b8244f9337456f1f90a576b2398680-C001-125",
"b8244f9337456f1f90a576b2398680-C001-126"
]
},
"@USE@": {
"gold_contexts": [
[
"b8244f9337456f1f90a576b2398680-C001-35"
],
[
"b8244f9337456f1f90a576b2398680-C001-77"
],
[
"b8244f9337456f1f90a576b2398680-C001-132"
]
],
"cite_sentences": [
"b8244f9337456f1f90a576b2398680-C001-35",
"b8244f9337456f1f90a576b2398680-C001-77",
"b8244f9337456f1f90a576b2398680-C001-132"
]
},
"@DIF@": {
"gold_contexts": [
[
"b8244f9337456f1f90a576b2398680-C001-100",
"b8244f9337456f1f90a576b2398680-C001-99"
]
],
"cite_sentences": [
"b8244f9337456f1f90a576b2398680-C001-100"
]
}
}
},
"ABC_6580dc2f7316cea4e0933ff515a704_9": {
"x": [
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-116",
"text": "**DEPENDENCY PARSING EXPERIMENTS**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-89",
"text": "j \u2190 |x| \u2212 1 3."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-2",
"text": "Efficiency is a prime concern in syntactic MT decoding, yet significant developments in statistical parsing with respect to asymptotic efficiency haven't yet been explored in MT."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-3",
"text": "Recently, McDonald et al. (2005b) formalized dependency parsing as a maximum spanning tree (MST) problem, which can be solved in quadratic time relative to the length of the sentence."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-114",
"text": "4"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-4",
"text": "They show that MST parsing is almost as accurate as cubic-time dependency parsing in the case of English, and that it is more accurate with free word order languages."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-5",
"text": "This paper applies MST parsing to MT, and describes how it can be integrated into a phrase-based decoder to compute dependency language model scores."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-6",
"text": "Our results show that augmenting a state-ofthe-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-9",
"text": "Hierarchical approaches to machine translation have proven increasingly successful in recent years (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008) , and often outperform phrase-based systems (Och and Ney, 2004; Koehn et al., 2003) on target-language fluency and adequacy."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-10",
"text": "However, their benefits generally come with high computational costs, particularly when chart parsing, such as CKY, is integrated with language models of high orders (Wu, 1996) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-11",
"text": "Indeed, synchronous CFG parsing with m-grams runs in O(n 3m ) time, where n is the length of the sentence."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-12",
"text": "1 Furthermore, synchronous CFG approaches often only marginally outperform the most com- 1 The algorithmic complexity of (Wu, 1996) is O(n 3+4(m\u22121) ), though Huang et al. (2005) present a more efficient factorization inspired by (Eisner and Satta, 1999) that yields an overall complexity of O(n 3+3(m\u22121) ), i.e., O(n 3m )."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-13",
"text": "In comparison, phrase-based decoding can run in linear time if a distortion limit is imposed."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-14",
"text": "Of course, this comparison holds only for approximate algorithms."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-15",
"text": "Since exact MT decoding is NP complete (Knight, 1999) , there is no exact search algorithm for either phrase-based or syntactic MT that runs in polynomial time (unless P = NP)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-16",
"text": "petitive phrase-based systems in large-scale experiments such as NIST evaluations."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-17",
"text": "2 This lack of significant difference may not be completely surprising."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-18",
"text": "Indeed, researchers have shown that gigantic language models are key to state-ofthe-art performance (Brants et al., 2007) , and the ability of phrase-based decoders to handle large-size, high-order language models with no consequence on asymptotic running time during decoding presents a compelling advantage over CKY decoders, whose time complexity grows prohibitively large with higher-order language models."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-19",
"text": "While context-free decoding algorithms (CKY, Earley, etc.) may sometimes appear too computationally expensive for high-end statistical machine translation, there are many alternative parsing algorithms that have seldom been explored in the machine translation literature."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-20",
"text": "The parsing literature presents faster alternatives for both phrasestructure and dependency trees, e.g., O(n) shiftreduce parsers and variants ( (Ratnaparkhi, 1997; Nivre, 2003) , inter alia)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-21",
"text": "While deterministic parsers are often deemed inadequate for dealing with ambiguities of natural language, highly accurate O(n 2 ) algorithms exist in the case of dependency parsing."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-22",
"text": "Building upon the theoretical work of (Chu and Liu, 1965; Edmonds, 1967) , McDonald et al. (2005b) present a quadratic-time dependency parsing algorithm that is just 0.7% less accurate than \"full-fledged\" chart parsing (which, in the case of dependency parsing, runs in time O(n 3 ) (Eisner, 1996) )."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-23",
"text": "In this paper, we show how to exploit syntactic dependency structure for better machine translation, under the constraint that the depen-dency structure is built as a by-product of phrasebased decoding, without reliance on a dynamicprogramming or chart parsing algorithm such as CKY or Earley."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-24",
"text": "Adapting the approach of McDonald et al. (2005b) for machine translation, we incrementally build dependency structure left-toright in time O(n 2 ) during decoding."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-25",
"text": "Most interestingly, the time complexity of non-projective dependency parsing remains quadratic as the order of the language model increases."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-26",
"text": "This provides a compelling advantage over previous dependency language models for MT (Shen et al., 2008) , which use a 5-gram LM only during reranking."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-27",
"text": "In our experiments, we build a competitive baseline incorporating a 5-gram LM trained on a large part of Gigaword and show that our dependency language model provides improvements on five different test sets, with an overall gain of 0.92 in TER and 0.45 in BLEU scores."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-28",
"text": "These results are found to be statistically very significant (p \u2264 .01)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-30",
"text": "**DEPENDENCY PARSING FOR MACHINE TRANSLATION**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-31",
"text": "In this section, we review dependency parsing formulated as a maximum spanning tree problem (McDonald et al., 2005b) , which can be solved in quadratic time, and then present its adaptation and novel application to phrase-based decoding."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-32",
"text": "Dependency models have recently gained considerable interest in many NLP applications, including machine translation (Ding and Palmer, 2005; Quirk et al., 2005; Shen et al., 2008) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-33",
"text": "Dependency structure provides several compelling advantages compared to other syntactic representations."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-34",
"text": "First, dependency links are close to the semantic relationships, which are more likely to be consistent across languages."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-35",
"text": "Indeed, Fox (2002) found inter-lingual phrasal cohesion to be greater than for a CFG when using a dependency representation, for which she found only 12.6% of head crossings and 9.2% modifier crossings."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-36",
"text": "Second, dependency trees contain exactly one node per word, which contributes to cutting down the search space during parsing: indeed, the task of the parser is merely to connect existing nodes rather than hypothesizing new ones."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-37",
"text": "Finally, dependency models are more flexible and account for (non-projective) head-modifier relations that CFG models fail to represent adequately, which is problematic with certain types of grammatical constructions and with free word order languages, Figure 1: A dependency tree with directed edges going from heads to modifiers."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-38",
"text": "The edge between who and hired causes this tree to be non-projective."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-39",
"text": "Such a head-modifier relationship is difficult to represent with a CFG, since all words directly or indirectly headed by hired (i.e., who, think, they, and hired) do not constitute a contiguous sequence of words."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-40",
"text": "as we will see later in this section."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-41",
"text": "The most standardly used algorithm for parsing with dependency grammars is presented in (Eisner, 1996; Eisner and Satta, 1999) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-42",
"text": "It runs in time O(n 3 ), where n is the length of the sentence."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-43",
"text": "Their algorithm exploits the special properties of dependency trees to reduce the worst-case complexity of bilexical parsing, which otherwise requires O(n 4 ) for bilexical constituency-based parsing."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-44",
"text": "While it seems difficult to improve the asymptotic running time of the Eisner algorithm beyond what is presented in (Eisner and Satta, 1999) , McDonald et al. (2005b) show O(n 2 )-time parsing is possible if trees are not required to be projective."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-45",
"text": "This relaxation entails that dependencies may cross each other rather than being required to be nested, as shown in Fig. 1 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-46",
"text": "More formally, a non-projective tree is any tree that does not satisfy the following definition of a projective tree:"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-47",
"text": "Definition. Let x = x 1 \u00b7 \u00b7 \u00b7 x n be an input sentence, and let y be a rooted tree represented as a set in which each element (i, j) \u2208 y is an ordered pair of word indices of x that defines a dependency relation between a head x i and a modifier x j ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-48",
"text": "By definition, the tree y is said to be projective if each dependency (i, j) satisfies the following property: each word in"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-49",
"text": "This relaxation is key to computational efficiency, since the parser does not need to keep track of whether dependencies assemble into contiguous spans."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-50",
"text": "It is also linguistically desirable in the case of free word order languages such as Czech, Dutch, and German."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-51",
"text": "Non-projective dependency structures are sometimes even needed for languages like English, e.g., in the case of the wh-movement shown in Fig. 1 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-52",
"text": "For languages with relatively rigid word order such as English, there may be some concern that searching the space of non-projective dependency trees, which is considerably larger than the space of projective dependency trees, would yield poor performance."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-53",
"text": "That is not the case: dependency accuracy for nonprojective parsing is 90.2% for English (McDonald et al., 2005b) , only 0.7% lower than a projective parser (McDonald et al., 2005a ) that uses the same set of features and learning algorithm."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-54",
"text": "In the case of dependency parsing for Czech, (McDonald et al., 2005b) even outperforms projective parsing, and was one of the top systems in the CoNLL-06 shared task in multilingual dependency parsing."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-56",
"text": "**O(N 2 )-TIME DEPENDENCY PARSING FOR MT**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-57",
"text": "We now formalize weighted non-projective dependency parsing similarly to (McDonald et al., 2005b) and then describe a modified and more efficient version that can be integrated into a phrasebased decoder."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-58",
"text": "Given the single-head constraint, parsing an input sentence x = (x 0 , x 1 , \u00b7 \u00b7 \u00b7 , x n ) is reduced to labeling each word x j with an index i identifying its head word x i ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-59",
"text": "We include the dummy root symbol x 0 = root so that each word can be a modifier."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-60",
"text": "We score each dependency relation using a standard linear model"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-61",
"text": "whose weight vector \u03bb is trained using MIRA (Crammer and Singer, 2003) to optimize dependency parsing accuracy (McDonald et al., 2005a) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-62",
"text": "As is commonly the case in statistical parsing, the score of the full tree is decomposed as the sum of the score of all edges:"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-63",
"text": "When there is no need to ensure projectivity, one can independently select the highest scoring edge (i, j) for each modifier x j , yet we generally want to ensure that the resulting structure is a tree, i.e., that it does not contain any circular dependencies."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-64",
"text": "This optimization problem is a known instance of the maximum spanning tree (MST) problem."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-65",
"text": "In our case, the graph is directed-indeed, the equality s(i, j) = s( j, i) is generally not true and would be linguistically aberrant-so the problem constitutes an instance of the less-known MST problem for directed graphs."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-66",
"text": "This problem is solved with the Chu-Liu-Edmonds (CLE) algorithm (Chu and Liu, 1965; Edmonds, 1967) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-67",
"text": "Formally, we represent the graph G = (V, E) with a vertex set V = x = {x 0 , \u00b7 \u00b7 \u00b7 , x n } and a set of directed edges E = [0, n] \u00d7 [1, n], in which each edge (i, j), representing the dependency x i \u2192 x j , is assigned a score s(i, j)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-68",
"text": "Finding the spanning tree y \u2282 E rooted at x 0 that maximizes s(x, y) as defined in Equation 2 has a straightforward solution in O(n 2 log(n)) time for dense graphs such as G, though Tarjan (1977) shows that the problem can be solved in O(n 2 )."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-69",
"text": "Hence, non-projective dependency parsing is solved in quadratic time."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-70",
"text": "The main idea behind the CLE algorithm is to first greedily select for each word x j the incoming edge (i, j) with highest score, then to successively repeat the following two steps: (a) identify a loop in the graph, and if there is none, halt; (b) contract the loop into a single vertex, and update scores for edges coming in and out of the loop."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-71",
"text": "Once all loops have been eliminated, the algorithm maps back the maximum spanning tree of the contracted graph onto the original graph G, and it can be shown that this yields a spanning tree that is optimal with respect to G and s (Georgiadis, 2003) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-72",
"text": "The greedy approach of selecting the highest scoring edge (i, j) for each modifier x j can easily be applied left-to-right during phrase-based decoding, which proceeds in the same order."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-73",
"text": "For each hypothesis expansion, our decoder generates the following information for the new hypothesis h:"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-74",
"text": "\u2022 a partial translation x; \u2022 a coverage set of input words c; \u2022 a translation score \u03c3 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-75",
"text": "In the case of non-projective dependency parsing, we need to maintain additional information for each word x j of the partial translation x:"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-76",
"text": "\u2022 a predicted POS tag t j ; \u2022 a dependency score s j ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-77",
"text": "Dependency scores s j are initialized to \u2212\u221e. Each time a new word is added to a partial hypothesis, the decoder executes the routine shown in Table 1 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-78",
"text": "To avoid cluttering the pseudo-code, we make here the simplifying assumption that each hypothesis expansion adds exactly one word, though the real implementation supports the case of phrases of any length."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-79",
"text": "Line 3 determines whether the translation hypothesis is complete, in which case it explicitly builds the graph G and Decoding: hypothesis expansion step."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-80",
"text": "finds the maximum spanning tree."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-81",
"text": "Note that it is impractical to identify loops each time a new word is added to a translation hypothesis, since this requires explicitly storing the dense graph G, which would require an O(n 2 ) copy operation during each hypothesis expansion; this would of course increase time and space complexity (the max operation in lines 8 and 9 only keeps the current best scoring edges)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-82",
"text": "If there is any loop, the dependency score is adjusted in the last hypothesis expansion."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-83",
"text": "In practice, we delay the computation of dependency scores involving word x j until tag t j+1 is generated, since dependency parsing accuracy is particularly low (\u22120.8%) when the next tag is unknown."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-84",
"text": "We found that dependency scores with or without loop elimination are generally close and highly correlated, and that MT performance without final loop removal was about the same (generally less than 0.2% BLEU)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-85",
"text": "While it seems that loopy graphs are undesirable when the goal is to obtain a syntactic analysis, that is not necessarily the case when one just needs a language modeling score."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-87",
"text": "**INFERER GENERATES NEW HYPOTHESIS**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-88",
"text": "h = (x, c, \u03c3 ) 2."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-90",
"text": "t j \u2190 tagger(x j\u22123 , \u00b7 \u00b7 \u00b7 , x j ) 4. if complete(c) 5."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-91",
"text": "Chu-Liu-Edmonds(h) 6."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-92",
"text": "else 7. for i = 1 to j 8."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-93",
"text": "s j = max(s j , s(i, j)) 9."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-94",
"text": "s i = max(s i , s( j, i))"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-96",
"text": "**FEATURES FOR DEPENDENCY PARSING**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-97",
"text": "In our experiments, we use sets of features that are similar to the ones used in the McDonald parser, though we make a key modification that yields an asymptotic speedup that ensures a genuine O(n 2 ) running time."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-98",
"text": "The three feature sets that were used in our experiments are shown in Table 2 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-99",
"text": "We write h-word, h-pos, m-word, m-pos to refer to head and modifier words and POS tags, and append a numerical value to shift the word offset either to the left or to the right (e.g., h-pos+1 is the POS to the right of the head word)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-100",
"text": "We use the symbol \u2227 to represent feature conjunctions."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-101",
"text": "Each feature in the table has a distinct identifier, so that, e.g., the POS features In-between POS features: h-pos are all distinct from m-pos features."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-102",
"text": "3 The primary difference between our feature sets and the ones of McDonald et al. is that their set of \"in between POS features\" includes the set of all tags appearing between each pair of words."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-103",
"text": "Extracting all these tags takes time O(n) for any arbitrary pair (i, j)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-104",
"text": "Since i and j are both free variables, feature computation in (McDonald et al., 2005b) takes time O(n 3 ), even though parsing itself takes O(n 2 ) time."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-105",
"text": "To make our parser genuinely O(n 2 ), we modified the set of in-between POS features in two ways."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-106",
"text": "First, we restrict extraction of in-between POS tags to those words that appear within a window of five words relative to either the head or the modifier."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-107",
"text": "While this change alone ensures that feature extraction is now O(1) for each word pair, this causes a fairly high drop of performance (dependency accuracy Table 4 : Dependency parsing experiments on test sentences of any length."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-108",
"text": "The projective parsing algorithm is the one implemented as in (McDonald et al., 2005a) , which is known as one of the top performing dependency parsers for English."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-109",
"text": "The O(n 3 ) non-projective parser of (McDonald et al., 2005b ) is slightly more accurate than our version, though ours runs in O(n 2 ) time. \"Local classifier\" refers to non-projective dependency parsing without removing loops as a post-processing step."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-110",
"text": "The result marked with (*) identifies the parser used for our MT experiments, which is only about 1% less accurate than a state-of-the-art dependency parser (**)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-111",
"text": "on our test was down 0.9%)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-112",
"text": "To make our genuinely O(n 2 ) parser almost as accurate as the nonprojective parser of McDonald et al., we conjoin each in-between POS with its position relative to (i, j)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-113",
"text": "This relatively simple change reduces the drop in accuracy to only 0.34%."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-117",
"text": "In this section, we compare the performance of our parsing model to the ones of McDonald et al. Since our MT test sets include newswire, web, and audio, we trained our parser on different genres."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-118",
"text": "Our training data includes newswire from the English translation treebank (LDC2007T02) and the English-Arabic Treebank (LDC2006T10), which are respectively translations of sections of the Chinese treebank (CTB) and Arabic treebank (ATB)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-119",
"text": "We also trained the parser on the broadcastnews treebank available in the OntoNotes corpus (LDC2008T04), and added sections 02-21 of the WSJ Penn treebank."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-120",
"text": "Documents 001-040 of the English CTB data were set aside to constitute a test set for newswire texts."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-121",
"text": "Our other test set is the standard Section 23 of the Penn treebank."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-122",
"text": "The splits and amounts of data used for training are displayed in Table 3 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-123",
"text": "Parsing experiments are shown in Table 4 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-124",
"text": "We 4 We need to mention some practical considerations that make feature computation fast enough for MT."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-125",
"text": "Most features are precomputed before actual decoding."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-126",
"text": "All target-language words to appear during beam search can be determined in advance, and all their unigram feature scores are precomputed."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-127",
"text": "For features conditioned on both head and modifier, scores are cached whenever possible."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-128",
"text": "The only features that are not cached are the ones that include contextual POS tags, since their miss rate is relatively high. distinguish two experimental conditions: Parsing and MT."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-129",
"text": "For Parsing, sentences are cased and tokenization abides to the PTB segmentation as used in the Penn treebank version 3."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-130",
"text": "For the MT setting, texts are all lower case, and tokenization was changed to improve machine translation (e.g., most hyphenated words were split)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-131",
"text": "For this setting, we also had to harmonize the four treebanks."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-132",
"text": "The most crucial modification was to add NP internal bracketing to the WSJ (Vadas and Curran, 2007) , since the three other treebanks contain that information."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-133",
"text": "Treebanks were also transformed to be consistent with MT tokenization."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-134",
"text": "We evaluate MT parsing models on CTB rather than on WSJ, since CTB contains newswire and is thus more representative of MT evaluation conditions."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-135",
"text": "To obtain part-of-speech tags, we use a state-of-the-art maximum-entropy (CMM) tagger (Toutanova et al., 2003) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-136",
"text": "In the Parsing setting, we use its best configuration, which reaches a tagging accuracy of 97.25% on standard WSJ test data."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-137",
"text": "In the MT setting, we need to use a less effective tagger, since we cannot afford to perform Viterbi inference as a by-product of phrase-based decoding."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-138",
"text": "Hence, we use a simpler tagging model that assigns tag t i to word x i by only using features of words x i\u22123 \u00b7 \u00b7 \u00b7 x i , and that does not condition any decision based on any preceding or next tags (t i\u22121 , etc.)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-139",
"text": "Its performance is 95.02% on the WSJ, and 95.30% on the English CTB."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-140",
"text": "Additional experiments reveal two main contributing factors to this drop on WSJ: tagging uncased texts reduces tagging accuracy by about 1%, and using only wordbased features further reduces it by 0.6%."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-141",
"text": "Table 4 shows that the accuracy of our truly O(n 2 ) parser is only .25% to .34% worse than the O(n 3 ) implementation of (McDonald et al., 2005b) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-142",
"text": "5 Compared to the state-of-the-art projective parser as implemented in (McDonald et al., 2005a) , performance is 1.28% lower on WSJ, but only 0.95% when training on all our available data and using the MT setting."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-143",
"text": "Overall, we believe that the drop of performance is a reasonable price to pay considering the computational constraints imposed by integrating the dependency parser into an MT decoder."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-144",
"text": "The table also shows a gain of more than 1% in dependency accuracy by adding ATB, OntoNotes, and WSJ to the English CTB training set."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-145",
"text": "The four sources were assigned non-uniform weights: we set the weight of the CTB data to be 10 times larger than the other corpora, which seems to work best in our parsing experiments."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-146",
"text": "While this improvement of 1% may seem relatively small considering that the amount of training data is more than 20 times larger in the latter case, it is quite consistent with previous findings in domain adaptation, which is known to be a difficult task."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-147",
"text": "For example, (Daume III, 2007) shows that training a learning algorithm on the weighted union of different data sets (which is basically what we did) performs almost as well as more involved domain adaptation approaches."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-148",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-149",
"text": "**MACHINE TRANSLATION EXPERIMENTS**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-150",
"text": "In our experiments, we use a re-implementation of the Moses phrase-based decoder ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-151",
"text": "We use the standard features implemented almost exactly as in Moses: four translation features (phrase-based translation probabilities and lexically-weighted probabilities), word penalty, phrase penalty, linear distortion, and language model score."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-152",
"text": "We also incorporated the lexicalized reordering features of Moses, in order to experiment with a baseline that is stronger than the default Moses configuration."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-153",
"text": "The language pair for our experiments is Chinese-to-English."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-154",
"text": "The training data consists of about 28 million English words and 23.3 million 5 Note that our results on WSJ are not exactly the same as those reported in (McDonald et al., 2005b ), since we used slightly different head finding rules."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-155",
"text": "To extract dependencies from treebanks, we used the LTH Penn Converter (http:// nlp.cs.lth.se/pennconverter/), which extracts dependencies that are almost identical to those used for the CoNLL-2008 Shared Task."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-156",
"text": "We constrain the converter not to use functional tags found in the treebanks, in order to make it possible to use automatically parsed texts (i.e., perform selftraining) in future work."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-157",
"text": "Chinese words drawn from various news parallel corpora distributed by the Linguistic Data Consortium (LDC)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-158",
"text": "In order to provide experiments comparable to previous work, we used the same corpora as (Wang et al., 2007) : LDC2002E18, LDC2003E07, LDC2003E14, LDC2005E83, LDC2005T06, LDC2006E26, LDC2006E8, and LDC2006G05."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-159",
"text": "Chinese words were automatically segmented with a conditional random field (CRF) classifier (Chang et al., 2008 ) that conforms to the Chinese Treebank (CTB) standard."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-160",
"text": "In order to train a competitive baseline given our computational resources, we built a large 5-gram language model using the Xinhua and AFP sections of the Gigaword corpus (LDC2007T40) in addition to the target side of the parallel data."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-161",
"text": "This data represents a total of about 700 million words."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-162",
"text": "We manually removed documents of Gigaword that were released during periods that overlap with those of our development and test sets."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-163",
"text": "The language model was smoothed with the modified Kneser-Ney algorithm as implemented in (Stolcke, 2002) , and we only kept 4-grams and 5-grams that occurred at least three times in the training data."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-164",
"text": "6 For tuning and testing, we use the official NIST MT evaluation data for Chinese from 2002 to 2008 (MT02 to MT08), which all have four English references for each input sentence."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-165",
"text": "We used the 1082 sentences of MT05 for tuning and all other sets for testing."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-166",
"text": "Parameter tuning was done with minimum error rate training (Och, 2003) , which was used to maximize BLEU (Papineni et al., 2001 )."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-167",
"text": "Since MERT is prone to search errors, especially with large numbers of parameters, we ran each tuning experiment three times with different initial conditions."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-168",
"text": "We used n-best lists of size 200 and a beam size of 200."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-169",
"text": "In the final evaluations, we report results using both TER (Snover et al., 2006) and the original BLEU metric as described in (Papineni et al., 2001 )."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-170",
"text": "All our evaluations are performed on uncased texts."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-171",
"text": "The results for our translation experiments are shown in Table 5 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-172",
"text": "We compared two systems: one with the set of features described earlier in this section."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-173",
"text": "The second system incorporates one additional feature, which is the dependency language Table 5 : MT experiments with and without a dependency language model."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-174",
"text": "We use randomization tests (Riezler and Maxwell, 2005) to determine significance: differences marked with a (*) are significant at the p \u2264 .05 level, and those marked as (**) are significant at the p \u2264 .01 level."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-175",
"text": "model score computed with the dependency parsing algorithm described in Section 2."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-176",
"text": "We used the dependency model trained on the English CTB and ATB treebank, WSJ, and OntoNotes."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-177",
"text": "We see that the Moses decoder with integrated dependency language model systematically outperforms the Moses baseline."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-178",
"text": "For BLEU evaluations, differences are significant in four out of six cases, and in the case of TER, all differences are significant."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-179",
"text": "Regarding the small difference in BLEU scores on MT08, we would like to point out that tuning on MT05 and testing on MT08 had a rather adverse effect with respect to translation length: while the two systems are relatively close in terms of BLEU scores (24.83 and 24.91, respectively), the dependency LM provides a much bigger gain when evaluated with BLEU precision (27.73 vs. 28.79), i.e., by ignoring the brevity penalty."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-180",
"text": "On the other hand, the difference on MT08 is significant in terms of TER."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-181",
"text": "Table 6 provides experimental results on the NIST test data (excluding the tuning set MT05) for each of the three genres: newswire, web data, and speech (broadcast news and conversation)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-182",
"text": "The last column displays results for all test sets combined."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-183",
"text": "Results do not suggest any noticeable difference between genres, and the dependency language model provides significant gains on all genres, despite the fact that this model was primarily trained on news data."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-184",
"text": "We wish to emphasize that our positive results are particularly noteworthy because they are achieved over a baseline incorporating a competitive 5-gram language model."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-185",
"text": "As is widely acknowledged in the speech community, it can be difficult to outperform high-order n-gram models in large-scale experiments."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-186",
"text": "Finally, we quantified the effective running time of our phrase-based decoder with and without our dependency language model using MT05 (Fig. 2) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-187",
"text": "In both settings, we selected the best tuned model, which yield the performance shown in the first column of Table 5 ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-188",
"text": "Our decoder was run on an AMD Opteron Processor 2216 with 16GB of memory, and without resorting to any rescoring method such as cube pruning."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-189",
"text": "In the case of English translations of 40 words and shorter, the baseline system took 6.5 seconds per sentence, whereas the dependency LM system spent 15.6 seconds per sentence, i.e., 2.4 times the baseline running time."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-190",
"text": "In the case of translations longer than 40 words, average speeds were respectively 17.5 and 59.5 seconds per sentence, i.e., the dependency was only 3.4 times slower."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-191",
"text": "7"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-193",
"text": "**RELATED WORK**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-194",
"text": "Perhaps due to the high computational cost of synchronous CFG decoding, there have been various attempts to exploit syntactic knowledge and hierarchical structure in other machine translation experiments that do not require chart parsing."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-195",
"text": "Using a reranking framework, found that various types of syntactic features provided only minor gains in performance, suggesting that phrase-based systems (Och and Ney, 2004) should exploit such information during rather than after decoding."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-196",
"text": "Wang et al. (2007) sidestep the need to operate large-scale word order changes during decoding (and thus lessening the need for syntactic decoding) by rearranging input words in the training data to match the syntactic structure of the target language."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-197",
"text": "Finally, exploit factored phrase-based translation models to associate each word with a supertag, which contains most of the information needed to build a full parse."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-198",
"text": "When combined with a supertag n-gram language model, it helps enforce grammatical constraints on the target side."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-199",
"text": "There have been various attempts to reduce the computational expense of syntactic decoding, including multi-pass decoding approaches (Zhang and Gildea, 2008; Petrov et al., 2008) and rescoring approaches (Huang and Chiang, 2007) ."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-200",
"text": "In the latter paper, Huang and Chiang introduce rescoring methods named \"cube pruning\" and \"cube growing\", which first use a baseline decoder (either synchronous CFG or a phrase-based system) and no LM to generate a hypergraph, and then rescoring this hypergraph with a language model."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-201",
"text": "Huang and Chiang show significant speed increases with little impact on translation quality."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-202",
"text": "We believe that their approach is orthogonal (and possibly complementary) to our work, since our paper proposes a new model for fully-integrated decoding that increases MT performance, and does not rely on rescoring."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-203",
"text": "----------------------------------"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-204",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-205",
"text": "In this paper, we presented a non-projective dependency parser whose time-complexity of O(n 2 ) improves upon the cubic time implementation of (McDonald et al., 2005b) , and does so with little loss in dependency accuracy (.25% to .34%)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-206",
"text": "Since this parser does not need to enforce projectivity constraints, it can easily be integrated into a phrase-based decoder during search (rather than during rescoring)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-207",
"text": "We use dependency scores as an extra feature in our MT experiments, and found that our dependency model provides significant gains over a competitive baseline that incorporates a large 5-gram language model (0.92% TER and 0.45% BLEU absolute improvements)."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-208",
"text": "We plan to pursue other research directions using dependency models discussed in this paper."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-209",
"text": "While we use a dependency language model to exemplify the use of hierarchical structure within phrase based decoders, we could extend this work to incorporate dependency features of both sourceand target side."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-210",
"text": "Since parsing of the source is relatively inexpensive compared to the target side, it would be relatively easy to condition headmodifier dependencies not only on the two target words, but also on their corresponding Chinese words and their relative positions in the Chinese tree."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-211",
"text": "This would enable the decoder to capture syntactic reordering without requiring trees to be isomorphic or even projective."
},
{
"sent_id": "6580dc2f7316cea4e0933ff515a704-C001-212",
"text": "It would also be interesting to apply these models to target languages that have free word order, which would presumably benefit more from the flexibility of non-projective dependency models."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"6580dc2f7316cea4e0933ff515a704-C001-3"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-22"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-31"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-44"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-52",
"6580dc2f7316cea4e0933ff515a704-C001-53"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-54"
]
],
"cite_sentences": [
"6580dc2f7316cea4e0933ff515a704-C001-3",
"6580dc2f7316cea4e0933ff515a704-C001-22",
"6580dc2f7316cea4e0933ff515a704-C001-31",
"6580dc2f7316cea4e0933ff515a704-C001-44",
"6580dc2f7316cea4e0933ff515a704-C001-53",
"6580dc2f7316cea4e0933ff515a704-C001-54"
]
},
"@EXT@": {
"gold_contexts": [
[
"6580dc2f7316cea4e0933ff515a704-C001-24"
]
],
"cite_sentences": [
"6580dc2f7316cea4e0933ff515a704-C001-24"
]
},
"@USE@": {
"gold_contexts": [
[
"6580dc2f7316cea4e0933ff515a704-C001-57"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-205"
]
],
"cite_sentences": [
"6580dc2f7316cea4e0933ff515a704-C001-57",
"6580dc2f7316cea4e0933ff515a704-C001-205"
]
},
"@MOT@": {
"gold_contexts": [
[
"6580dc2f7316cea4e0933ff515a704-C001-104"
]
],
"cite_sentences": [
"6580dc2f7316cea4e0933ff515a704-C001-104"
]
},
"@DIF@": {
"gold_contexts": [
[
"6580dc2f7316cea4e0933ff515a704-C001-109"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-141"
],
[
"6580dc2f7316cea4e0933ff515a704-C001-154"
]
],
"cite_sentences": [
"6580dc2f7316cea4e0933ff515a704-C001-109",
"6580dc2f7316cea4e0933ff515a704-C001-141",
"6580dc2f7316cea4e0933ff515a704-C001-154"
]
}
}
},
"ABC_e41a1adcb5c9d91f2130bd249ed598_9": {
"x": [
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-60",
"text": "We use 12 MFCCs and the log energy feature and add the first and second derivatives resulting in 39-dimensional feature vectors."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-61",
"text": "We compute the MFCCs using 25 ms analysis windows with a 5 ms shift."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-62",
"text": "The MBN features are created using a pre-trained DNN made available by [21] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-24",
"text": "In previous work [8] we used image-caption retrieval, where given a written caption the model must return the matching image and vice versa."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-63",
"text": "In short, the network is trained on multilingual speech data (11 languages, no English) to classify phoneme states."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-25",
"text": "We trained deep neural networks (DNNs) to create sentence embeddings without the use of prior knowledge of lexical semantics (see [7, 9, 10] for other studies on this task)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-26",
"text": "The visually grounded sentence embeddings that arose capture semantic information about the sentence as measured by the Semantic Textual Similarity task (see [11] ), performing comparably to text-only methods that require word embeddings."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-27",
"text": "In the current study we present an image-caption retrieval model that extends our previous work to spoken input."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-28",
"text": "In [12, 13] , the authors adapted text based caption-image retrieval (e.g. [9] ) and showed that it is possible to perform speech-image retrieval using convolutional neural networks on spectral features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-29",
"text": "Our work is most closely related to the models presented in [12, 13, 14, 15] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-30",
"text": "In the current study we improve upon these previous approaches to visual grounding of speech and present state-of-the-art image-caption retrieval results."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-31",
"text": "The work by [12, 13, 14, 15] and the results presented here are a step towards more cognitively plausible models of language learning as it is more natural to learn language without prior assumptions about the lexical level."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-32",
"text": "For instance, research indicates that the adult lexicon contains many relatively fixed multi-word expressions (e.g., 'how-are-you-doing') [16] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-33",
"text": "Furthermore, early during language acquisition the lexicon consists of entire utterances before a child's language use becomes more adult-like [16, 17, 18, 19] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-34",
"text": "Image to spoken-caption retrieval models do not know a priori which constituents of the input are important and have no prior knowledge of lexical level semantics."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-35",
"text": "We probe the resulting model to investigate whether it learns to recognise lexical units in the input without being explicitly trained to do so."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-36",
"text": "We test two types of acoustic features; Mel Frequency Cepstral Coefficients (MFCCs) and Multilingual Bottleneck (MBN) features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-2",
"text": "Humans learn language by interaction with their environment and listening to other humans."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-3",
"text": "It should also be possible for computational models to learn language directly from speech but so far most approaches require text."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-4",
"text": "We improve on existing neural network approaches to create visually grounded embeddings for spoken utterances."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-5",
"text": "Using a combination of a multi-layer GRU, importance sampling, cyclic learning rates, ensembling and vectorial self-attention our results show a remarkable increase in image-caption retrieval performance over previous work."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-6",
"text": "Furthermore, we investigate which layers in the model learn to recognise words in the input."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-7",
"text": "We find that deeper network layers are better at encoding word presence, although the final layer has slightly lower performance."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-8",
"text": "This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-11",
"text": "Most computational models of natural language processing (NLP) are based on written language; machine translation, sentence meaning representation and language modelling to name a few (e.g. [1, 2] )."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-12",
"text": "Even if the task inherently involves speech, such as in automatic speech recognition, models require large amounts of transcribed speech [3] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-13",
"text": "Yet, humans are capable of learning language from raw sensory input, and furthermore children learn to communicate long before they are able to read."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-14",
"text": "In fact, many languages have no orthography at all and there are also languages of which the writing system is not widely used by its speakers."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-15",
"text": "Text-based models cannot be used for these languages and applications like search engines and automated translators cannot serve these populations."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-16",
"text": "There has been increasing interest in learning language from more natural input, such as directly from the speech signal, or multi-modal input (e.g. speech and vision)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-17",
"text": "This has several advantages such as removing the need for expensive annotation of speech, being applicable to low resource languages and being more plausible as a model of human language learning."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-18",
"text": "An important challenge in learning language from spoken input is the fact that the input is not presented in neatly segmented tokens."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-19",
"text": "An auditory signal does not contain neat breaks in between words like the spaces in text."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-20",
"text": "Furthermore, no two realisations of the same spoken word are ever exactly the same."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-21",
"text": "As such, spoken input cannot be represented by conventional word embeddings (e.g. word2vec [4] , GloVe [5] )."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-22",
"text": "These textbased embeddings are trained to encode word-level semantic knowledge and have become a mainstay in work on sentence representations (e.g. [6, 7] )."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-23",
"text": "When we want to learn language directly from speech, we will have to do so in a more end-to-end fashion, without prior lexical level knowledge in terms of both form and semantics."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-59",
"text": "The MFCCs were created using 40 Mel-spaced filterbanks."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-37",
"text": "MFCCs are features that can be computed for any speech signal without needing any other data, while the MBN features are 'learned' features that result from training a network on top of MFCCs in order to recognise phoneme states."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-38",
"text": "While MBN features have been shown to be useful in several speech recognition tasks (e.g. [20, 21] ), learned audio features face the same issue as word embeddings, as humans learn to extract useful features from the audio signal as a result of learning to understand language and not as a separate process."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-39",
"text": "However, the MBN features can still be useful where system performance is more important than cognitive plausibility, for instance in a low resource setting."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-40",
"text": "Furthermore, these features could provide a clue as to what performance would be possible if we had more sophisticated models or more data to improve the feature extraction from the MFCCs in an end-to-end fashion."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-41",
"text": "In summary, we improve on previous spoken-caption to image retrieval models and investigate whether it learns to recognise words in the speech signal."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-42",
"text": "We show that our model achieves state-of-the-art results on the Flickr8k database, outperforming previous models by a large margin using both MFCCs and MBN features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-43",
"text": "We find that our model learns to recognise words in the input signal and show that the deeper layers are better at encoding this information."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-44",
"text": "Recognition performance drops a little in the last two layers as the network abstracts away from the detection of specific words in the input and learns to map the utterances to the joint embedding space."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-45",
"text": "We released the code for this project on github: https://github.com/DannyMerkx/speech2image/tree/Interspeech19."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-47",
"text": "**IMAGE TO SPOKEN-CAPTION RETRIEVAL 2.1. MATERIALS**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-48",
"text": "Our model is trained on the Flickr8k database [22] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-49",
"text": "Flickr8k contains 8,000 images taken from online photo sharing application Flickr.com, for which five English captions per image are available."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-50",
"text": "Annotators were asked to write sentences that describe the depicted scenes, situations, events and entities (people, animals, other objects)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-51",
"text": "Spoken captions for Flickr8k were collected by [12] by having Amazon Mechanical Turk workers pronounce the original written captions."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-52",
"text": "We used the data split provided by [9] , with 6,000 images for training and a development and test set both of 1,000 images."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-54",
"text": "**IMAGE AND ACOUSTIC FEATURES**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-55",
"text": "To extract image features, all images are resized such that the smallest side is 256 pixels while keeping the aspect ratio intact."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-56",
"text": "We take ten 224 by 224 crops of the image: one from each corner, one from the middle and the same five crops for the mirrored image."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-57",
"text": "We use ResNet-152 [23] pretrained on ImageNet to extract visual features from these ten crops and then average the features of the ten crops into a single vector with 2,048 features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-58",
"text": "We test two types of acoustic features; Mel Frequency Cepstral Coefficients (MFCCs) and Multilingual Bottleneck (MBN) features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-64",
"text": "The MBN features consist of the outputs of intermediate network layers where the network is compressed from 1500 features to 30 features (see [21] for the full details of the network and training)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-66",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-67",
"text": "Our multimodal encoder maps images and their corresponding captions to a common embedding space."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-68",
"text": "The idea is to make matching images and captions lie close together and mismatched images and captions lie far apart in the embedding space."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-69",
"text": "Our model consists of two parts; an image encoder and a sentence encoder as depicted in Figure 1 ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-70",
"text": "The approach is based on our own text-based model described in [8] and on the speech-based models described in [13, 15] and we refer to those studies for more details."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-71",
"text": "Here, we focus on the differences with previous work."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-72",
"text": "For the image encoder we use a single-layer linear projection on top of the pretrained image recognition model, and nor- malise the result to have unit L2 norm."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-73",
"text": "The image encoder has 2048 input units and 2048 output units."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-74",
"text": "Our caption encoder consists of three main components."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-75",
"text": "First we apply a 1-dimensional convolutional layer to the acoustic input features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-76",
"text": "The convolution has a stride of size 2, kernel size 6 and 64 output channels."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-77",
"text": "This is the only layer where the model differs from the text-based model, which features a character embedding layer instead of a convolutional layer."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-78",
"text": "The resulting features are then fed into a bi-directional Gated Recurrent Unit (GRU) followed by a self-attention layer and is lastly normalised to have unit L2 norm."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-79",
"text": "We use a 3-layer bi-directional GRU which allows the network to capture long-range dependencies in the acoustic signal (see [24] for a more detailed description of the GRU)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-80",
"text": "Furthermore, by making the layer bi-directional we let the network process the output of the convolutional layer from left to right and vice versa, allowing the model to capture dependencies in both directions."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-81",
"text": "We use a GRU with 1024 units, and concatenate the bidirectional representations resulting in hidden states of size 2048."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-82",
"text": "Finally, the self-attention layer computes a weighted sum over all the hidden GRU states:"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-83",
"text": "where at is the attention vector for hidden state ht and W , V , bw, and bv indicate the weights and biases."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-84",
"text": "The applied attention is then the sum over the Hadamard product between all hidden states (h1, ..., ht) and their attention vector."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-85",
"text": "We use 128 units for W and 2048 units for V ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-87",
"text": "**TRAINING**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-88",
"text": "Following [8] , the model is trained to embed the images and captions such that the cosine similarity between image and caption pairs is larger (by a certain margin) than the similarity be-tween mismatching pairs."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-89",
"text": "This so called hinge loss L as a function of the network parameters \u03b8 is given by:"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-90",
"text": "where the other caption-image pairs in the batch serve to create mismatched pairs (c, i \u2032 ) and (c \u2032 , i)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-91",
"text": "We take the cosine similarity cos(x, y) and subtract the similarity of the mismatched pairs from the matching pairs such that the loss is only zero when the matching pair is more similar than the mismatched pairs by a margin \u03b1."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-92",
"text": "We use importance sampling to select the mismatched pairs; rather than using all the other samples in the mini-batch as mismatched pairs (as done in [8, 15] ), we calculate the loss using only the hardest examples (i.e. mismatched pairs with high cosine similarity)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-93",
"text": "While [10] used only the single hardest example in the batch for text-captions, we found that this did not work for the spoken captions."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-94",
"text": "Instead we found that using the hardest 25 percent worked well."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-95",
"text": "The networks are trained using Adam [25] with a cyclic learning rate schedule based on [26] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-96",
"text": "The learning rate schedule varies the learning rate smoothly between a minimum and maximum bound which were set to 10 \u22126 and 2 \u00d7 10 \u22124 respectively."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-97",
"text": "The learning rate schedule causes the network to visit several local minima during training, allowing us to use snapshot ensembling [27] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-98",
"text": "By saving the network parameters at each local minimum, we can ensemble the embeddings of multiple networks at no extra cost."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-99",
"text": "We use a margin \u03b1 = 0.2 for the loss function."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-100",
"text": "We train the networks for 32 epochs and take a snapshot for ensembling at every fourth epoch."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-101",
"text": "For ensembling we use the two snapshots with the highest performance on the development data and simply sum their embeddings."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-102",
"text": "The main differences with the approaches described in [13, 15] are the use of multi-layered GRUs, importance sampling, the cyclic learning rate, snapshot ensembling and the use of vectorial rather than scalar attention."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-104",
"text": "**WORD PRESENCE DETECTION**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-105",
"text": "While our model is not explicitly trained to recognise words or segment the speech signal, previous work has shown that such information can be extracted by visual grounding models [15, 28] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-106",
"text": "[15] use a binary decision task: given a word and a sentence embedding, decide if the word occurs in the sentence."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-107",
"text": "Our approach is similar to the spoken-bag-of-words prediction task described in [28] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-108",
"text": "Given a sentence embedding created by our model, a classifier has to decide which of the words in its vocabulary occur in the sentence."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-109",
"text": "Based on the original written captions, our database contains 7,374 unique words with a combined occurrence frequency of 324,480."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-110",
"text": "From these we select words that occur between 50 and a 1,000 times and are over 3 characters long so that there are enough examples in the data that the model might actually learn to recognise them, and to filter out punctuation, spelling mistakes, numerals and most function words."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-111",
"text": "This leaves 460 unique words, mostly verbs and nouns, with a combined occurrence frequency of 87,020 in our data."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-112",
"text": "We construct a vector for each sentence in Flickr8k indicating which of these words is present."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-113",
"text": "We do not encode multiple occurrences of the same word in one sentence."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-114",
"text": "The words described above are used as targets for a neural network classifier consisting of a single feed forward layer with 460 units."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-115",
"text": "This layer simply takes an embedding vector as input and maps it to the 460 target words."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-116",
"text": "We then apply the standard logistic function and calculate the Binary Cross Entropy loss to train the network."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-117",
"text": "We train five word detection networks for both the MFCC and the MBN based caption encoders, in order to see how word presence is encoded in the different neural network layers."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-118",
"text": "We train networks for the final output layer, the three intermediate layers of the GRU and the acoustic features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-119",
"text": "For the final layer we simply use the output embedding as input to the word detection network."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-120",
"text": "We apply some post-processing to the acoustic features and the intermediate layer outputs to ensure that our word detection inputs are all of the same size."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-121",
"text": "As the intermediate GRU layers produce 2048 features for each time step in the signal, we use average-pooling along the temporal dimension to create a single input vector and normalise the result to have unit L2 norm."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-122",
"text": "The acoustic features consist of 30 (MBN) or 39 (MFCC) features for each time step, so we apply the convolutional layer followed by an untrained GRU layer to the input features, use average-pooling and normalise the result to have unit L2 norm."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-123",
"text": "The word detection networks are trained for 32 epochs using Adam [25] with a constant learning rate of 0.001."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-124",
"text": "We use the same data split that was used for training the multi-modal encoder, so that we test word presence detection on data that was not seen by either the encoder or the decoder."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-125",
"text": "Table 1 shows the performance of our models on the imagecaption retrieval task."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-126",
"text": "The caption embeddings are ranked by cosine distance to the image and vice versa where R@N is the percentage of test items for which the correct image or caption was in the top N results."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-127",
"text": "We compare our models to [12] and [15] , and include our own character-based model for comparison."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-128",
"text": "[12] is a convolutional approach, whereas [15] is an approach using recurrent highway networks with scalar attention."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-129",
"text": "The character-based model is similar to the model we use here and was trained on the original Flickr8k text captions (see [8] for a full description)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-130",
"text": "Both our MFCC and MBN based model significantly outperform previous spoken captionto-image methods on the Flickr8k dataset."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-131",
"text": "The largest improvement is the MBN model which outperforms the results reported in [15] by as much as 23.2 percentage points on R@10."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-132",
"text": "The MFCC model also improves on previous results but scores significantly lower than the MBN model across the board, improving as much as 12.3 percentage points over previous work."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-133",
"text": "There is a large performance gap between the text-caption to image retrieval results and the spoken-caption to image results, showing there is still a lot of room for improvement."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-135",
"text": "**RESULTS**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-136",
"text": "The results of the word presence detection task are shown in Figure 2 and Table 2 ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-137",
"text": "Figure 2 shows the F1 score for all the classifiers at 20 equally spaced detection thresholds (i.e. a word is classified as 'present' if the word detection output is above this threshold)."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-138",
"text": "Table 2 displays the area under the curve for the receiver operating characteristic."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-139",
"text": "Even though the MBN model outperforms the MFCC model for all layers we see the same pattern emerging from both the F1 score and the AUC."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-140",
"text": "The performance on the feature level is not much better than random."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-141",
"text": "Predicting 'not present' for every word would be the best random guess as this is a heavy majority class in this task."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-142",
"text": "Inspection of the predictions shows that the classifier is indeed heavily biased towards the majority class for the input features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-143",
"text": "Then we see the performance increasing for the first layer and peaking at the second layer."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-144",
"text": "The performance then drops slightly for the third layer and the attention layer."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-145",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-146",
"text": "**DISCUSION AND CONCLUSION**"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-147",
"text": "We trained an image-caption retrieval model on spoken input and investigated whether it learns to recognise linguistic units in the input."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-148",
"text": "As improvements over previous work we used a 3-layer GRU and employed importance sampling, cyclic learning rates, ensembling and vectorial self-attention."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-149",
"text": "Our results on both MBN and MFCC features are significantly higher than the previous state-of-the-art."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-150",
"text": "The largest improvement comes from using the learned MBN features but our approach also improves results for MFCCs, which are the same features as were used in [15] ."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-151",
"text": "The learned MBN features provide better performance whereas the MFCCs are more cognitively plausible input features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-152",
"text": "The probing task shows that the model learns to recognise these words in the input."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-153",
"text": "The system is not explicitly optimised to do so, but our results show that the lower layers learn to recognise this form related information from the input."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-154",
"text": "After layer 2, the performance starts to decrease slightly which might indicate that these layers learn a more task-specific representation and it is to be expected that the final attention layer specialises in mapping from audio features to the multi-modal embedding space."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-155",
"text": "In conclusion, we presented what are, to the best of our knowledge, the best results on spoken-caption to image retrieval."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-156",
"text": "Our results improve significantly over previous approaches for both untrained and trained audio features."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-157",
"text": "In a probing task, we show that the model learns to recognise words in the input speech signal."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-158",
"text": "We are currently collecting the Semantic Textual Similarity (STS) database in spoken format and the next step will be to investigate whether the model presented here also learns to capture sentence level semantic information and understand language in a deeper sense than recognising word presence."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-159",
"text": "The work presented in [15] has made the first efforts in this regard and we aim to extend this to a larger database with sentences from multiple domains."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-160",
"text": "Furthermore, we want to investigate the linguistic units that our model learns to recognise."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-161",
"text": "In the current study, we only investigated whether the model learns to recognise words, but the potential benefit of our model is that it might learn multi-word statements or might even learn to look at sub-lexical level information."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-162",
"text": "[14, 29] have recently shown that the speech-to-image retrieval approach can be used to detect word boundaries and even discover sub-word units."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-163",
"text": "Our interest is in investigating how these word and sub-word units develop over training and through the network layers."
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "e41a1adcb5c9d91f2130bd249ed598-C001-165",
"text": "**ACKNOWLEDGEMENTS**"
}
],
"y": {
"@SIM@": {
"gold_contexts": [
[
"e41a1adcb5c9d91f2130bd249ed598-C001-29"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-150"
]
],
"cite_sentences": [
"e41a1adcb5c9d91f2130bd249ed598-C001-29",
"e41a1adcb5c9d91f2130bd249ed598-C001-150"
]
},
"@USE@": {
"gold_contexts": [
[
"e41a1adcb5c9d91f2130bd249ed598-C001-29",
"e41a1adcb5c9d91f2130bd249ed598-C001-30"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-70"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-127"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-159"
]
],
"cite_sentences": [
"e41a1adcb5c9d91f2130bd249ed598-C001-29",
"e41a1adcb5c9d91f2130bd249ed598-C001-70",
"e41a1adcb5c9d91f2130bd249ed598-C001-127",
"e41a1adcb5c9d91f2130bd249ed598-C001-159"
]
},
"@BACK@": {
"gold_contexts": [
[
"e41a1adcb5c9d91f2130bd249ed598-C001-31"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-70"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-105"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-106"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-128"
]
],
"cite_sentences": [
"e41a1adcb5c9d91f2130bd249ed598-C001-31",
"e41a1adcb5c9d91f2130bd249ed598-C001-70",
"e41a1adcb5c9d91f2130bd249ed598-C001-105",
"e41a1adcb5c9d91f2130bd249ed598-C001-106",
"e41a1adcb5c9d91f2130bd249ed598-C001-128"
]
},
"@DIF@": {
"gold_contexts": [
[
"e41a1adcb5c9d91f2130bd249ed598-C001-92"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-102"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-131"
],
[
"e41a1adcb5c9d91f2130bd249ed598-C001-150"
]
],
"cite_sentences": [
"e41a1adcb5c9d91f2130bd249ed598-C001-92",
"e41a1adcb5c9d91f2130bd249ed598-C001-102",
"e41a1adcb5c9d91f2130bd249ed598-C001-131",
"e41a1adcb5c9d91f2130bd249ed598-C001-150"
]
},
"@FUT@": {
"gold_contexts": [
[
"e41a1adcb5c9d91f2130bd249ed598-C001-159"
]
],
"cite_sentences": [
"e41a1adcb5c9d91f2130bd249ed598-C001-159"
]
}
}
},
"ABC_14fa8c3b947667244d30dd30dae89a_9": {
"x": [
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-52",
"text": "Denote the mapping as \u03a6 e (x)."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-2",
"text": "We investigate a combination of a traditional linear sparse feature model and a multi-layer neural network model for deterministic transition-based dependency parsing, by integrating the sparse features into the neural model."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-3",
"text": "Correlations are drawn between the hybrid model and previous work on integrating word embedding features into a discrete linear model."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-4",
"text": "By analyzing the results of various parsers on web-domain parsing, we show that the integrated model is a better way to combine traditional and embedding features compared with previous methods."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-7",
"text": "Transition-based parsing algorithms construct output syntax trees using a sequence of shift-reduce actions."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-8",
"text": "They are attractive in computational efficiency, allowing linear time decoding with deterministic (Nivre, 2008) or beam-search (Zhang and Clark, 2008) algorithms."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-9",
"text": "Using rich non-local features, transition-based parsers achieve state-ofthe-art accuracies for dependency parsing (Zhang and Nivre, 2011; Zhang and Nivre, 2012; Bohnet and Nivre, 2012; Choi and McCallum, 2013; Zhang et al., 2014) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-10",
"text": "Deterministic transition-based parsers works by making a sequence of greedy local decisions (Nivre et al., 2004; Honnibal et al., 2013; Goldberg et al., 2014; G\u00f3mez-Rodr\u00edguez and Fern\u00e1ndez-Gonz\u00e1lez, 2015) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-11",
"text": "They are attractive by very fast speeds."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-12",
"text": "Traditionally, a linear model has been used for the local action classifier."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-13",
"text": "Recently, Chen and Manning (2014) use a neural network (NN) to replace linear models, and report improved accuracies."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-14",
"text": "A contrast between a neural network model and a linear model is shown in Figure 1 A neural network model takes continuous vector representations of words as inputs, which can be pre-trained using large amounts of unlabeled data, thus containing more information."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-15",
"text": "In addition, using an extra hidden layer, a neural network is capable of learning non-linear relations between automatic features, achieving feature combinations automatically."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-16",
"text": "Discrete manual features and continuous features complement each other."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-17",
"text": "A natural question that arises from the contrast is whether traditional discrete features and continuous neural features can be integrated for better accuracies."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-18",
"text": "We study this problem by constructing the neural network shown in Figure 1 (e), which incorporates the discrete input layer of the linear model (Figure 1 (a) ) into the NN model (Figure 1 (b) ) by conjoining it with the hidden layer."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-19",
"text": "This architecture is connected with previous work on incorporating word embeddings into a linear model."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-20",
"text": "In particular, Turian et al. (2010) incorporate word embeddings as real-valued features into a CRF model."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-21",
"text": "The architecture is shown in Figure 1 (c), which can be regarded as Figure 1 (e) without the hidden layer."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-22",
"text": "Guo et al. (2014) find that the accuracies of Turian et al can be enhanced by discretizing the embedding features before combining them with the traditional features."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-23",
"text": "They use simple binarization and clustering to this end, finding that the latter works better."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-24",
"text": "The architecture is shown in Figure 1(d) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-25",
"text": "In contrast, Figure 1 (e) directly combines discrete and continuous features, replacing the hard-coded transformation function of Guo et al. (2014) with a hidden layer, which can be tuned by supervised training."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-26",
"text": "1 We correlate and compare all the five systems in Figure 1 empirically, using the SANCL 2012 data (Petrov and McDonald, 2012) and the standard Penn Treebank data."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-27",
"text": "Results show that the method of this paper gives higher accuracies than the other methods."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-28",
"text": "In addition, the method of Guo et al. (2014) gives slightly better accuracies compared to the method of Turian et al. (2010) for parsing task, consistent with Guo et al's observation on named entity recognition (NER)."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-29",
"text": "We make our C++ code publicly available under GPL at https://github.com/ SUTDNLP/NNTransitionParser."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-31",
"text": "**PARSER**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-32",
"text": "We take Chen and Manning (2014) , which uses the arc-standard transition system (Nivre, 2008) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-33",
"text": "Given an POS-tagged input sentence, it builds a projective output y by performing a sequence of state transition actions using greedy search."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-34",
"text": "Chen and Manning (2014) can be viewed as a neutral alternative of MaltParser (Nivre, 2008) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-35",
"text": "Although not giving state-of-the-art accuracies, deterministic parsing is attractive for its high parsing speed (1000+ sentences per second)."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-36",
"text": "Our incorporation of discrete features does not harm the overall speed significantly."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-37",
"text": "In addition, deterministic parsers use standard neural classifiers, which allows isolated study of feature influences."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-39",
"text": "**MODELS**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-40",
"text": "Following Chen and Manning (2014) , training of all the models using a cross-entropy loss objective with a L2-regularization, and mini-batched AdaGrad (Duchi et al., 2011) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-41",
"text": "We unify below the five deterministic parsing models in Figure 1 ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-43",
"text": "**BASELINE LINEAR (L)**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-44",
"text": "We build a baseline linear model using logistic regression (Figure 1(a) )."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-45",
"text": "Given a parsing state x, a vector of discrete features \u03a6 d (x) is extracted according to the arc-standard feature templates of Ma et al. (2014a) , which is based on the arc-eager templates of Zhang and Nivre (2011) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-46",
"text": "The score of an action a is defined by"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-47",
"text": "where \u03c3 represents the sigmoid activation function, \u2212 \u2192 \u03b8 d is the set of model parameters, denoting the feature weights with respect to actions, a can be SHIFT, LEFT(l) and RIGHT(l)."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-49",
"text": "**BASELINE NEURAL (NN)**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-50",
"text": "We take the Neural model of Chen and Manning (2014) as another baseline (Figure 1(b) )."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-51",
"text": "Given a parsing state x, the words are first mapped into continuous vectors by using a set of pre-trained word embeddings."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-53",
"text": "In addition, denote the hidden layer as \u03a6 h , and the ith node in the hidden as \u03a6 h,i (0 \u2264 i \u2264 |\u03a6 h |)."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-54",
"text": "The hidden layer is defined as"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-55",
"text": "where \u2212 \u2192 \u03b8 h is the set of parameters between the input and hidden layers."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-56",
"text": "The score of an action a is defined as"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-57",
"text": "where \u2212 \u2192 \u03b8 c,a is the set of parameters between the hidden and output layers."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-58",
"text": "We use the arc-standard features \u03a6 e as Chen and Manning (2014) , which is also based on the arc-eager templates of Zhang and Nivre (2011) , similar to those of the baseline model L."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-60",
"text": "**LINEAR MODEL WITH REAL-VALUED EMBEDDINGS (TURIAN)**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-61",
"text": "We apply the method of Turian et al. (2010) , combining real-valued embeddings with discrete features in the linear baseline (Figure 1(c) )."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-62",
"text": "Given a state x, the score of an action a is defined as"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-63",
"text": "where \u2295 is the vector concatenation operator."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-64",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-65",
"text": "**LINEAR MODEL WITH TRANSFORMED EMBEDDINGS (GUO)**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-66",
"text": "We apply the method of Guo et al. (2014) , combining embeddings into the linear baseline by first transforming into discrete values."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-67",
"text": "Given a state x, the score of an action is defined as"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-68",
"text": "where d is a transformation function from realvalue to binary features."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-69",
"text": "We use clustering of embeddings for d as it gives better performances according to Guo et al. (2014) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-70",
"text": "Following Guo et al. (2014) , we use compounded clusters learnt by K-means algorithm of different granularities."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-72",
"text": "**DIRECTLY COMBINING LINEAR AND NEURAL FEATURES (THIS)**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-73",
"text": "We directly combine linear and neural features (Figure 1(e) )."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-74",
"text": "Given a state x, the score of an action is defined as"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-75",
"text": "where \u03a6 h is the same as the NN baseline."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-76",
"text": "Note that like d in Guo, \u03a6 h is also a function that transforms embeddings \u03a6 e ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-77",
"text": "The main difference is that it can be tuned in supervised training."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-79",
"text": "**WEB DOMAIN EXPERIMENTS**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-81",
"text": "**SETTING**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-82",
"text": "We perform experiments on the SANCL 2012 web data (Petrov and McDonald, 2012) , using the Wall Street Journal (WSJ) training corpus to train the models and the WSJ development corpus to tune parameters."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-83",
"text": "We clean the web domain texts following the method of Ma et al. (2014b) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-84",
"text": "Automatic POS tags are produced by using a CRF model trained on the WSJ training corpus."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-85",
"text": "The POS tags are assigned automatically on the training corpus by ten-fold jackknifing."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-86",
"text": "Following Chen and Manning (2014) , we use the pre-trained word embedding released by Collobert et al. (2011) , and set h = 200 for the hidden layer size, \u03bb = 10 \u22128 for L2 regularization, and \u03b1 = 0.01 for the initial learning rate of Adagrad."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-87",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-88",
"text": "**DEVELOPMENT RESULTS**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-89",
"text": "Fine-tuning of embeddings."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-90",
"text": "Chen and Manning (2014) fine-tune word embeddings in supervised training, consistent with Socher et al. (2013) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-91",
"text": "Intuitively, fine-tuning embeddings allows in-vocabulary words to join the parameter space, thereby giving better fitting to in-domain data."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-92",
"text": "However, it also forfeits the benefit of large-scale pre-training, because out-of-vocabulary (OOV) words do not have their embeddings fine-tuned."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-93",
"text": "In this sense, the method of Chen and Manning resembles a traditional supervised sparse linear model, which can be weak on OOV."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-94",
"text": "On the other hand, the semi-supervised learning methods such as Turian et al. (2010) Table 2 : Main results on SANCL."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-95",
"text": "All systems are deterministic."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-96",
"text": "parameters."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-97",
"text": "Therefore, such methods can expect better cross-domain accuracies."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-98",
"text": "We empirically compare the models NN, Turian and This by fine-tuning (+T) and not fine-tuning (-T) word embeddings, and the results are shown in Figure 2 ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-99",
"text": "As expected, the baseline NN model gives better accuracies on WSJ with fine-tuning, but worse cross-domain accuracies."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-100",
"text": "Interestingly, our combined model gives consistently better accuracies with fine-tuning."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-101",
"text": "We attribute this to the use of sparse discrete features, which allows the model to benefit from large-scale pre-trained embeddings without sacrificing in-domain performance."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-102",
"text": "The observation on Turian is similar."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-103",
"text": "For the final experiments, we apply fine-tuning on the NN model, but not to the Turian and This."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-104",
"text": "Note also tat for all experiments, the POS and label embedding features of Chen and Manning (2014) are fine-tuned, consistent with their original method."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-105",
"text": "Dropout rate."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-106",
"text": "We test the effect of dropout (Hinton et al., 2012) during training, using a default ratio of 0.5 according to Chen and Manning (2014) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-107",
"text": "In our experiments, we find that the dense NN model and our combined model achieve better performances by using dropout, but the other models do not benefit from dropout."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-108",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-109",
"text": "**FINAL RESULTS**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-110",
"text": "The final results across web domains are shown in Table 2 ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-111",
"text": "Our logistic regression linear parser and re-implementation of Chen and Manning (2014) give comparable accuracies to the perceptron ZPar 2 and Stanford NN Parser 3 , respectively."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-112",
"text": "It can be seen from the table that both Turian and Guo 4 outperform L by incorporating embed-ding features."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-113",
"text": "Guo gives overall higher improvements, consistent with the observation of Guo et al. (2014) on NER."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-114",
"text": "Our methods gives significantly 5 better results compared with Turian and Guo, thanks to the extra hidden layer."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-115",
"text": "Our OOV performance is higher than NN, because the embeddings of OOV words are not tuned, and hence the model can handle them effectively."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-116",
"text": "Interestingly, NN gives higher accuracies on web domain out-of-embeddingvocabulary (OOE) words, out of which 54% are in-vocabulary."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-117",
"text": "Note that the accuracies of our parsers are lower than the best systems in the SANCL shared task, which use ensemble models."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-118",
"text": "Our parser enjoys the fast speed of deterministic parsers, and in particular the baseline NN parser (Chen and Manning, 2014) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-120",
"text": "**WSJ EXPERIMENTS**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-121",
"text": "For comparison with related work, we conduct experiments on Penn Treebank corpus also."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-122",
"text": "We use the WSJ sections 2-21 for training, section 22 for development and section 23 for testing."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-123",
"text": "WSJ constituent trees are converted to dependency trees using Penn2Malt 6 ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-124",
"text": "We use auto POS tags consistent with previous work."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-125",
"text": "The ZPar POS-tagger is used to assign POS tags."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-126",
"text": "Ten-fold jackknifing is performed on the training data to assign POS automatically."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-127",
"text": "For this set of experiments, the parser hyper-parameters are taken directly from the best settings in the Web Domain experiments."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-128",
"text": "The results are shown in Table 3 , together with some state-of-the-art deterministic parsers."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-129",
"text": "Comparing the L, NN and This models, the observations are consistent with the web domain."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-130",
"text": "Honnibal et al. (2013) 91.30 90.00 Ma et al. (2014a) 91.32 - Table 3 : Main results on WSJ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-131",
"text": "All systems are deterministic."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-132",
"text": "Our combined parser gives accuracies competitive to state-of-the-art deterministic parsers in the literature."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-133",
"text": "In particular, the method of Chen and Manning (2014) is the same as our NN baseline."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-134",
"text": "Note that Zhou et al. (2015) reports a UAS of 91.47% by this parser, which is higher than the results we obtained."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-135",
"text": "The main results include the use of different batch size during, while Zhou et al. (2015) used a batch size of 100,000, we used a batch size of 10,000 in all experiments."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-136",
"text": "Honnibal et al. (2013) applies dynamic oracle to the deterministic transition-based parsing, giving a UAS of 91.30%."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-137",
"text": "Ma et al. (2014a) is similar to ZPar local, except that they use the arc-standard transitions, while ZPar-local is based on arc-eager transitions."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-138",
"text": "Ma et al. (2014a) uses a special method to process punctuations, leading to about 1% UAS improvements over the vanilla system."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-139",
"text": "Recently, Dyer et al. (2015) proposed a deterministic transition-based parser using LSTM, which gives a UAS of 93.1% on Stanford conversion of the Penn Treebank."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-140",
"text": "Their work shows that more sophisticated neural network structures with long term memories can significantly improve the accuracy over local classifiers."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-141",
"text": "Their work is orthogonal to ours."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-142",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-143",
"text": "**RELATED WORK**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-144",
"text": "As discussed in the introduction, our work is related to previous work on integrating word embeddings into discrete models (Turian et al., 2010; Yu et al., 2013; Guo et al., 2014) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-145",
"text": "Along this line, there has also been work that uses a neural network to automatically vectorize the structures of a sentence, and then taking the resulting vector as features in a linear NLP model (Socher et al., 2012; Tang et al., 2014; Yu et al., 2015) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-146",
"text": "Our results show that the use of a hidden neural layer gives superior results compared with both direct integration and integration via a hard-coded transformation function (e.g binarization or clustering)."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-147",
"text": "There has been recent work integrating continuous and discrete features for the task of POS tagging (Ma et al., 2014b; Tsuboi, 2014) ."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-148",
"text": "Both models have essentially the same structure as our model."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-149",
"text": "In contrast to their work, we systematically compare various ways to integrate discrete and continuous features, for the dependency parsing task."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-150",
"text": "Our model is also different from Ma et al. (2014b) in the hidden layer."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-151",
"text": "While they use a form of restricted Boltzmann machine to pre-train the embeddings and hidden layer from large-scale ngrams, we fully rely on supervised learning to train complex feature combinations."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-152",
"text": "Wang and Manning (2013) consider integrating embeddings and discrete features into a neural CRF."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-153",
"text": "They show that combined neural and discrete features work better without a hidden layer (i.e. Turian et al. (2010) )."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-154",
"text": "They argue that nonlinear structures do not work well with high dimensional features."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-155",
"text": "We find that using a hidden layer specifically for embedding features gives better results compared with using no hidden layers."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-156",
"text": "----------------------------------"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-157",
"text": "**CONCLUSION**"
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-158",
"text": "We studied the combination of discrete and continuous features for deterministic transition-based dependency parsing, comparing several methods to incorporate word embeddings and traditional sparse features in the same model."
},
{
"sent_id": "14fa8c3b947667244d30dd30dae89a-C001-159",
"text": "Experiments on both in-domain and cross-domain parsing show that directly adding sparse features into a neural network gives higher accuracies compared with all previous methods to incorporate word embeddings into a traditional sparse linear model."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"14fa8c3b947667244d30dd30dae89a-C001-13"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-90"
]
],
"cite_sentences": [
"14fa8c3b947667244d30dd30dae89a-C001-13",
"14fa8c3b947667244d30dd30dae89a-C001-90"
]
},
"@USE@": {
"gold_contexts": [
[
"14fa8c3b947667244d30dd30dae89a-C001-32"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-40"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-50"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-58"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-86"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-104"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-106"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-111"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-133"
]
],
"cite_sentences": [
"14fa8c3b947667244d30dd30dae89a-C001-32",
"14fa8c3b947667244d30dd30dae89a-C001-40",
"14fa8c3b947667244d30dd30dae89a-C001-50",
"14fa8c3b947667244d30dd30dae89a-C001-58",
"14fa8c3b947667244d30dd30dae89a-C001-86",
"14fa8c3b947667244d30dd30dae89a-C001-104",
"14fa8c3b947667244d30dd30dae89a-C001-106",
"14fa8c3b947667244d30dd30dae89a-C001-111",
"14fa8c3b947667244d30dd30dae89a-C001-133"
]
},
"@SIM@": {
"gold_contexts": [
[
"14fa8c3b947667244d30dd30dae89a-C001-111"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-118"
],
[
"14fa8c3b947667244d30dd30dae89a-C001-132",
"14fa8c3b947667244d30dd30dae89a-C001-133"
]
],
"cite_sentences": [
"14fa8c3b947667244d30dd30dae89a-C001-111",
"14fa8c3b947667244d30dd30dae89a-C001-118",
"14fa8c3b947667244d30dd30dae89a-C001-133"
]
}
}
},
"ABC_7be8bcb17980dee5e94df9faec8183_9": {
"x": [
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-2",
"text": "Segmenting text into semantically coherent fragments improves readability of text and facilitates tasks like text summarization and passage retrieval."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-3",
"text": "In this paper, we present a novel unsupervised algorithm for linear text segmentation (TS) that exploits word embeddings and a measure of semantic relatedness of short texts to construct a semantic relatedness graph of the document."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-4",
"text": "Semantically coherent segments are then derived from maximal cliques of the relatedness graph."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-5",
"text": "The algorithm performs competitively on a standard synthetic dataset and outperforms the best-performing method on a real-world (i.e., non-artificial) dataset of political manifestos."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-8",
"text": "Despite the fact that in mainstream natural language processing (NLP) and information retrieval (IR) texts are modeled as bags of unordered words, texts are sequences of semantically coherent segments, designed (often very thoughtfully) to ease readability and understanding of the ideas conveyed by the authors."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-9",
"text": "Although authors may explicitly define coherent segments (e.g., as paragraphs), many texts, especially on the web, lack any explicit segmentation."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-10",
"text": "Linear text segmentation aims to represent texts as sequences of semantically coherent segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-11",
"text": "Besides improving readability and understandability of texts for readers, automated text segmentation is beneficial for NLP and IR tasks such as text summarization (Angheluta et al., 2002; Dias et al., 2007) and passage retrieval (Huang et al., 2003; Dias et al., 2007) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-12",
"text": "Whereas early approaches to unsupervised text segmentation measured the coherence of segments via raw term overlaps between sentences (Hearst, 1997; Choi, 2000) , more recent methods (Misra et al., 2009; Riedl and Biemann, 2012) addressed the issue of sparsity of term-based representations by replacing term-vectors with vectors of latent topics."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-13",
"text": "A topical representation of text is, however, merely a vague approximation of its meaning."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-14",
"text": "Considering that the goal of TS is to identify semantically coherent segments, we propose a TS algorithm aiming to directly capture the semantic relatedness between segments, instead of approximating it via topical similarity."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-15",
"text": "We employ word embeddings (Mikolov et al., 2013 ) and a measure of semantic relatedness of short texts (\u0160ari\u0107 et al., 2012) to construct a relatedness graph of the text in which nodes denote sentences and edges are added between semantically related sentences."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-16",
"text": "We then derive segments using the maximal cliques of such similarity graphs."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-17",
"text": "The proposed algorithm displays competitive performance on the artifically-generated benchmark TS dataset (Choi, 2000) and, more importantly, outperforms the best-performing topic modeling-based TS method on a real-world dataset of political manifestos."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-18",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-19",
"text": "**RELATED WORK**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-20",
"text": "Automated text segmentation received a lot of attention in NLP and IR communities due to its usefulness for text summarization and text indexing."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-21",
"text": "Text segmentation can be performed in two different ways, namely (1) with the goal of obtaining linear segmentations (i.e. detecting the sequence of different segments in a text) , or (2) in order to obtain hierarchical segmentations (i.e. defining a structure of subtopics between the detected segments)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-22",
"text": "Like the majority of TS methods (Hearst, 1994; Brants et al., 2002; Misra et al., 2009; Riedl and Biemann, 2012) , in this work we focus on linear segmentation of text, but there is also a solid body of work on hierarchical TS, where each toplevel segment is further broken down (Yaari, 1997; Eisenstein, 2009 )."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-23",
"text": "Hearst (1994 introduced TextTiling, one of the first unsupervised algorithms for linear text segmentation."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-24",
"text": "She exploits the fact that words tend to be repeated in coherent segments and measures the similarity between paragraphs by comparing their sparse term-vectors."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-25",
"text": "Choi (2000) introduced the probabilistic algorithm using matrix-based ranking and clustering to determine similarities between segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-26",
"text": "Galley et al. (2003) combined contentbased information with acoustic cues in order to detect discourse shifts whereas Utiyama and Isahara (2001) and Fragkou et al. (2004) minimized different segmentation cost functions with dynamic programming."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-27",
"text": "The first segmentation approach based on topic modeling (Brants et al., 2002) employed the probabilistic latent semantic analysis (pLSA) to derive latent representations of segments and determined the segmentation based on similarities of segments' latent vectors."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-28",
"text": "More recent models (Misra et al., 2009; Riedl and Biemann, 2012) employed the latent Dirichlet allocation (LDA) (Blei et al., 2003) to compute the latent topics and displayed superior performance to previous models on standard synthetic datasets (Choi, 2000; Galley et al., 2003) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-29",
"text": "Misra et al. (2009) used dynamic programming to find globally optimal segmentation over the set of LDA-based segment representations, whereas Riedl and Biemann (2012) introduced TopicTiling, an LDA-driven extension of Hearst's TextTiling algorithm where segments are, represented as dense vectors of dominant topics of terms they contain (instead of as sparse term vectors)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-30",
"text": "Riedl and Biemann (2012) show that TopicTiling outperforms at-that-time state-of-the-art methods for unsupervised linear segmentation (Choi, 2000; Utiyama and Isahara, 2001; Galley et al., 2003; Fragkou et al., 2004; Misra et al., 2009 ) and that it is also faster than other LDA-based methods (Misra et al., 2009 )."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-31",
"text": "In the most closely related work to ours, Malioutov and Barzilay (2006) proposed a graphbased TS approach in which they first construct the fully connected graph of sentences, with edges weighted via the cosine similarity between bagof-words sentence vectors, and then run the minimum normalized multiway cut algorithm to obtain the segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-32",
"text": "Similarly, Ferret (2007) builds the similarity graph, only between words instead of between sentences, using sparse co-occurrence vectors as semantic representations for words."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-33",
"text": "He then identifies topics by clustering the word similarity graph via the Shared Nearest Neighbor algorithm (Ert\u00f6z et al., 2004) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-34",
"text": "Unlike these works, we use the dense semantic representations of words and sentences (i.e., embeddings), which have been shown to outperform sparse semantic vectors on a range of NLP tasks."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-35",
"text": "Also, instead of looking for minimal cuts in the relatedness graph, we exploit the maximal cliques of the relatedness graph between sentences to obtain the topic segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-36",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-37",
"text": "**TEXT SEGMENTATION ALGORITHM**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-38",
"text": "Our TS algorithm, dubbed GRAPHSEG, builds a semantic relatedness graph in which nodes denote sentences and edges are created for pairs of semantically related sentences."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-39",
"text": "We then determine the coherent segments by finding maximal cliques of the relatedness graph."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-40",
"text": "The novelty of GRAPHSEG is in the fact that it directly exploits the semantics of text instead of approximating the meaning with topicality."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-41",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-42",
"text": "**SEMANTIC RELATEDNESS OF SENTENCES**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-43",
"text": "The measure of semantic relatedness between sentences we use is an extension of a salient greedy lemma alignment feature proposed in a supervised model by\u0160ari\u0107 et al. (2012) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-44",
"text": "They greedily align content words between sentences by the similarity of their distributional vectors and then sum the similarity scores of aligned word pairs."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-45",
"text": "However, such greedily obtained alignment is not necessarily optimal."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-46",
"text": "In contrast, we compute the optimal alignment by (1) creating a weighted complete bipartite graph between the sets of content words of the two sentences (i.e., each word from one sentence is connected with a relatedness edge to all of the words in the other sentence) and (2) running a bipartite graph matching algorithm known as the Hungarian method (Kuhn, 1955) that has the polynomial complexity."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-47",
"text": "The similarities of content words between sentences (i.e., the weights of the bipartite graph) are computed as the cosine of the angle between their corresponding embedding vectors (Mikolov et al., 2013) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-48",
"text": "Let A be the set of word pairs in the optimal alignment between the content-word sets of the two sentences S 1 and S 2 , i.e., A = {(w 1 , w 2 ) | w 1 \u2208 S 1 \u2227 w 2 \u2208 S 2 }."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-49",
"text": "We then compute the semantic relatedness for two given sentences S 1 and S 2 as follows:"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-50",
"text": "where v i is the embedding vector of the word w i and ic(w ) is the information content (IC) of the word w, computed based on the relative frequency of w in some large corpus C:"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-51",
"text": "We utilize the IC weighting of embedding similarity because we assume that matches between less frequent words (e.g., guitar and ukulele) contribute more to sentence relatedness than pairs of similar but frequent words (e.g., do and make)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-52",
"text": "We used Google Books Ngrams (Michel et al., 2011) as a large corpus C for estimating relative frequencies of words in a language."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-53",
"text": "Because there will be more aligned pairs between longer sentences, the relatedness score will be larger for longer sentences merely because of their length (regardless of their actual similarity)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-54",
"text": "Thus, we normalize the sr(S 1 , S 2 ) score first with the length of S 1 and then with the length S 2 and we finally average these two normalized scores:"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-56",
"text": "**GRAPH-BASED SEGMENTATION**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-57",
"text": "All sentences in a text become nodes of the relatedness graph G. We then compute the semantic similarity, as described in the previous subsection, between all pairs of sentences in a given document."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-58",
"text": "For each pair of sentences for which the semantic relatedness is above some treshold value \u03c4 we add an edge between the corresponding nodes of G. Next, we employ the Bron-Kerbosch algorithm (Bron and Kerbosch, 1973) to compute the set Q of all maximal cliques of G. We then create the initial set of segments SG by merging adjacent sentences found in at least one maximal clique Q \u2208 Q of graph G. Next, we merge the adjacent segments sg i and sg i+1 for which there is at least one clique Q \u2208 Q containing at least one sentence from sg i and one sentence from sg i+1 ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-59",
"text": "Finally, given the"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-60",
"text": "Step Sets Cliques Q {1, 2, 6}, {2, 4, 7}, {3, 4, 5}, {1, 8, 9} Init."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-61",
"text": "seg."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-62",
"text": "{1, 2}, {3, 4, 5}, {6}, {7} {8, 9} Merge seg."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-63",
"text": "{1, 2, 3, 4, 5}, {6}, {7}, {8, 9} Merge small {1, 2, 3, 4, 5}, {6, 7}, {8, 9} Table 1 : Creating segments from graph cliques (n = 2)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-64",
"text": "In the third step we merge segments {1, 2, 3} and {4, 5} because the second clique contains sentences 2 (from the left segment) and 4 (from the right segment)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-65",
"text": "In the final step we merge single sentence segments (assuming segs({1, 2, 3, 4, 5}, {6}) < segs({6}, {7}) and segs({7}, {8, 9}) < segs({6}, {7}))."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-66",
"text": "minimal segment size n, we merge segments sg i with less than n sentences with the semantically more related of the two adjacent segments -sg i\u22121 or sg i+1 ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-67",
"text": "The relatedness between two adjacent segments (sgr (sg i , sg i+1 )) is computed as the average relatedness between their respective sentences:"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-68",
"text": "We exemplify the creation of segments from maximal cliques in Table 1 ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-69",
"text": "The complete segmentation algorithm is fleshed out in Algorithm 1. 1"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-71",
"text": "**EVALUATION**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-72",
"text": "In this section, we first introduce the two evaluation datasets that we use one being the commonly used synthetic dataset and the other a realistic dataset of politi-cal manifestos."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-73",
"text": "Following, we present the experimental setting and finally describe and discuss the results achieved by our GRAPHSEG algorithm and how it compares to other TS models."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-75",
"text": "**DATASETS**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-76",
"text": "Unsupervised methods for text segmentation have most often been evaluated on synthetic datasets with segments from different sources being concatenated in artificial documents (Choi, 2000; Galley et al., 2003) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-77",
"text": "Segmenting such artificial texts is easier than segmenting real-world documents."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-78",
"text": "This is why besides on the artificial Choi dataset we also evaluate GRAPHSEG on a real-world dataset of political texts from the Manifesto Project, 2,3 manually"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-79",
"text": "labeled by domain experts with segments of seven different topics (e.g., economy and welfare, quality of life, foreign affairs)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-80",
"text": "The selected manifestos contain between 1000 and 2500 sentences, with segments ranging in length from 1 to 78 sentences, which is in sharp contrast to the Choi dataset where all segments are of similar size."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-81",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-82",
"text": "**EXPERIMENTAL SETTING**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-83",
"text": "To allow for comparison with previous work, we evaluate GRAPHSEG on four subsets of the Choi dataset, differing in number of sentences the seg-2008, and 2012 U.S. elections ments contain."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-84",
"text": "For the evaluation on the Choi dataset, the GRAPHSEG algorithm made use of the publicly available word embeddings built from a Google News dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-85",
"text": "4 Both LDA-based models (Misra et al., 2009; Riedl and Biemann, 2012) and GRAPHSEG rely on corpus-derived word representations."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-86",
"text": "Thus, we evaluated on the Manifesto dataset both the domainadapted and domain-unadapted variants of these methods."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-87",
"text": "The domain-adapted variants of the models used the unlabeled domain corpus -a test set of 466 unlabeled political manifestos -to train the domain-specific word representations."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-88",
"text": "This means that we obtain (1) in-domain topics for the LDAbased TopicTiling model of Riedl and Biemann (2012) and (2) domain-specific embeddings for the GRAPHSEG algorithm."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-89",
"text": "On the Manifesto dataset we also evaluate a baseline that randomly (50% chance) starts a new segment at points m sentences apart, with m being set to half of the average length of gold segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-90",
"text": "We evaluate the performance using two standard TS evaluation metrics -P k (Beeferman et al., 1999) and WindowDiff (WD) (Pevzner and Hearst, 2002) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-91",
"text": "P k is the probability that two randomly drawn sentences mutually k sentences apart are classified incorrectly -either as belonging to the same segment when they are in different gold segments or as being in different segments when they are in the same gold segment."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-92",
"text": "Following Riedl and Biemann (2012) , we set k to half of the document length divided by the number of gold segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-93",
"text": "WindowDiff is a stricter version of P k as, instead of only checking if the randomly chosen sentences are in the same predicted segment or not, it compares the exact number of segments between the sentences in the predicted segmentation with the number of segments in between the same sentences in the gold standard."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-94",
"text": "Lower scores indicate better performance for both these metrics."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-95",
"text": "The GRAPHSEG algorithm has two parameters: (1) the sentence similarity treshold \u03c4 which is used when creating edges of the sentence relatedness graph and (2) the minimal segment size n, which we utilize to merge adjacent segments that are too small."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-96",
"text": "In all experiments we use grid-search in a folded cross-validation setting to jointly optimize both parameters."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-97",
"text": "In view of comparison with other models, the parameter optimization is justified be-3-5 6-8 9-11 3-11 Brants et al. (2002) 7."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-98",
"text": "cause other models, e.g., TopicTiling (Riedl and Biemann, 2012) , also have parameters (e.g., number of topics for the topic model) which are optimized using cross-validation."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-100",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-101",
"text": "In Table 2 we report the performance of GRAPH-SEG and prominent TS methods on the synthetic Choi dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-102",
"text": "GRAPHSEG performs competitively, outperforming all methods but (Fragkou et al., 2004) and domain-adapted versions of LDA-based models (Misra et al., 2009; Riedl and Biemann, 2012) ."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-103",
"text": "However, the approach by (Fragkou et al., 2004) uses the gold standard information -the average gold segment size -as input."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-104",
"text": "On the other hand, the LDA-based models adapt their topic models on parts of the Choi dataset itself."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-105",
"text": "Despite the fact that they use different documents for training the topic models from those used for evaluating segmentation quality, the evaluation is still tainted because snippets from the original documents appear in multiple artificial documents -some of which belong to the the training set and others to the test set, as admitted by Riedl and Biemann (2012) and this is why their reported performance on this dataset is overestimated."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-106",
"text": "In Table 3 we report the results on the Manifesto dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-107",
"text": "Results of both TopicTiling and GRAPHSEG indicate that the realistic Manifesto dataset is much more difficult to segment than the artificial Choi dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-108",
"text": "The GRAPHSEG algorithm significantly outperforms the TopicTiling method (p < 0.05, Student's t-test)."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-109",
"text": "In-domain training of word representations, topics for TopicTiling and word embeddings for GraphSeg, does not significantly improve the performance for neither of the two models."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-110",
"text": "This result contrasts previous findings (Misra et al., 2009; Riedl and Biemann, 2012) in which the performance boost was credited to the indomain trained topics and supports our hypothesis that the performance boost of the LDA-based methods' with in-domain trained topics originates from information leakage between different portions of the synthetic Choi dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-112",
"text": "**CONCLUSION**"
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-113",
"text": "In this work we presented GRAPHSEG, a novel graph-based algorithm for unsupervised text segmentation."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-114",
"text": "GRAPHSEG employs word embeddings and extends a measure of semantic relatedness to construct a relatedness graph with edges established between semantically related sentences."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-115",
"text": "The segmentation is then determined by the maximal cliques of the relatedness graph and improved by semantic comparison of adjacent segments."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-116",
"text": "GRAPHSEG displays competitive performance compared to best-performing LDA-based methods on a synthetic dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-117",
"text": "However, we identify and discuss evaluation issues pertaining to LDA-based TS on this dataset."
},
{
"sent_id": "7be8bcb17980dee5e94df9faec8183-C001-118",
"text": "We also performed an evaluation on the real-world dataset of political manifestos and showed that in a realistic setting GRAPHSEG significantly outperforms the state-of-the-art LDAbased TS model."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"7be8bcb17980dee5e94df9faec8183-C001-12"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-28"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-29"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-30"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-85"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-105"
]
],
"cite_sentences": [
"7be8bcb17980dee5e94df9faec8183-C001-12",
"7be8bcb17980dee5e94df9faec8183-C001-28",
"7be8bcb17980dee5e94df9faec8183-C001-29",
"7be8bcb17980dee5e94df9faec8183-C001-30",
"7be8bcb17980dee5e94df9faec8183-C001-85",
"7be8bcb17980dee5e94df9faec8183-C001-105"
]
},
"@SIM@": {
"gold_contexts": [
[
"7be8bcb17980dee5e94df9faec8183-C001-22"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-97",
"7be8bcb17980dee5e94df9faec8183-C001-98"
]
],
"cite_sentences": [
"7be8bcb17980dee5e94df9faec8183-C001-22",
"7be8bcb17980dee5e94df9faec8183-C001-98"
]
},
"@USE@": {
"gold_contexts": [
[
"7be8bcb17980dee5e94df9faec8183-C001-88"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-92"
]
],
"cite_sentences": [
"7be8bcb17980dee5e94df9faec8183-C001-88",
"7be8bcb17980dee5e94df9faec8183-C001-92"
]
},
"@DIF@": {
"gold_contexts": [
[
"7be8bcb17980dee5e94df9faec8183-C001-102"
],
[
"7be8bcb17980dee5e94df9faec8183-C001-110"
]
],
"cite_sentences": [
"7be8bcb17980dee5e94df9faec8183-C001-102",
"7be8bcb17980dee5e94df9faec8183-C001-110"
]
}
}
},
"ABC_8e738a8f52e5931a92c9e4577a1ad3_9": {
"x": [
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-120",
"text": "**INTEGRATING SIMPLIFICATION RANKING**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-171",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-172",
"text": "**NUMBER OF USERS**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-2",
"text": "We propose a new dataset for evaluating a Japanese lexical simplification method."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-3",
"text": "Previous datasets have several deficiencies."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-4",
"text": "All of them substitute only a single target word, and some of them extract sentences only from newswire corpus."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-5",
"text": "In addition, most of these datasets do not allow ties and integrate simplification ranking from all the annotators without considering the quality."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-6",
"text": "In contrast, our dataset has the following advantages: (1) it is the first controlled and balanced dataset for Japanese lexical simplification with high correlation with human judgment and (2) the consistency of the simplification ranking is improved by allowing candidates to have ties and by considering the reliability of annotators."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-9",
"text": "Lexical simplification is the task to find and substitute a complex word or phrase in a sentence with its simpler synonymous expression."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-10",
"text": "We define complex word as a word that has lexical and subjective difficulty in a sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-11",
"text": "It can help in reading comprehension for children and language learners (De Belder and Moens, 2010) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-12",
"text": "This task is a rather easier task which prepare a pair of complex and simple representations than a challenging task which changes the substitute pair in a given context (Specia et al., 2012; Kajiwara and Yamamoto, 2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-13",
"text": "Construction of a benchmark dataset is important to ensure the reliability and reproducibility of evaluation."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-14",
"text": "However, few resources are available for the automatic evaluation of lexical simplification."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-15",
"text": "Specia et al. (2012) and De Belder and Moens (2010) created benchmark datasets for evaluating English lexical simplification."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-16",
"text": "In addition, Horn et al. (2014) extracted simplification candidates and constructed an evaluation dataset using English Wikipedia and Simple English Wikipedia."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-17",
"text": "In contrast, such a parallel corpus does not exist in Japanese."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-18",
"text": "Kajiwara and Yamamoto (2015) constructed an evaluation dataset for Japanese lexical simplification 1 in languages other than English."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-19",
"text": "However, there are four drawbacks in the dataset of Kajiwara and Yamamoto (2015) : (1) they extracted sentences only from a newswire corpus; (2) they substituted only a single target word; (3) they did not allow ties; and (4) they did not integrate simplification ranking considering the quality."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-20",
"text": "Hence, we propose a new dataset addressing the problems in the dataset of Kajiwara and Yamamoto (2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-21",
"text": "The main contributions of our study are as follows:"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-22",
"text": "\u2022 It is the first controlled and balanced dataset for Japanese lexical simplification."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-23",
"text": "We extract sentences from a balanced corpus and control sentences to have only one complex word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-24",
"text": "Experimental results show that our dataset is more suitable than previous datasets for evaluating systems with respect to correlation with human judgment."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-25",
"text": "\u2022 The consistency of simplification ranking is greatly improved by allowing candidates to have ties and by considering the reliability of annotators."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-26",
"text": "Our dataset is available at GitHub 2 ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-28",
"text": "**RELATED WORK**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-29",
"text": "The evaluation dataset for the English Lexical Simplification task (Specia et al., 2012) Figure 1: A part of the dataset of Kajiwara and Yamamoto (2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-30",
"text": "notated on top of the evaluation dataset for English lexical substitution (McCarthy and Navigli, 2007) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-31",
"text": "They asked university students to rerank substitutes according to simplification ranking."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-32",
"text": "Sentences in their dataset do not always contain complex words, and it is not appropriate to evaluate simplification systems if a test sentence does not include any complex words."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-33",
"text": "In addition, De Belder and Moens (2012) built an evaluation dataset for English lexical simplification based on that developed by McCarthy and Navigli (2007) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-34",
"text": "They used Amazon's Mechanical Turk to rank substitutes and employed the reliability of annotators to remove outlier annotators and/or downweight unreliable annotators."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-35",
"text": "The reliability was calculated on penalty based agreement (McCarthy and Navigli, 2007) and Fleiss' Kappa."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-36",
"text": "Unlike the dataset of Specia et al. (2012) , sentences in their dataset contain at least one complex word, but they might contain more than one complex word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-37",
"text": "Again, it is not adequate for the automatic evaluation of lexical simplification because the human ranking of the resulting simplification might be affected by the context containing complex words."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-38",
"text": "Furthermore, De Belder and Moens' (2012) dataset is too small to be used for achieving a reliable evaluation of lexical simplification systems."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-39",
"text": "3 Problems in previous datasets for Japanese lexical simplification Kajiwara and Yamamoto (2015) followed Specia et al. (2012) to construct an evaluation dataset for Japanese lexical simplification."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-40",
"text": "Namely, they split the data creation process into two steps: substitute extraction and simplification ranking."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-41",
"text": "During the substitute extraction task, they collected substitutes of each target word in 10 different contexts."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-42",
"text": "These contexts were randomly selected from a newswire corpus."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-43",
"text": "The target word was a content word (noun, verb, adjective, or adverb) , and was neither a simple word nor part of any compound words."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-44",
"text": "They gathered substitutes from five annotators using crowdsourcing."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-45",
"text": "These procedures were the same as for De Belder and Moens (2012) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-46",
"text": "During the simplification ranking task, annotators were asked to reorder the target word and its substitutes in a single order without allowing ties."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-47",
"text": "They used crowdsourcing to find five annotators different from those who performed the substitute extraction task."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-48",
"text": "Simplification ranking was integrated on the basis of the average of the simplification ranking from each annotator to generate a gold-standard ranking that might include ties."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-49",
"text": "During the substitute extraction task, agreement among the annotators was 0.664, whereas during the simplification ranking task, Spearman's rank correlation coefficient score was 0.332."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-50",
"text": "Spearman's score of this work was lower than that of Specia et al. (2012) by 0.064."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-51",
"text": "Thus, there was a big blur between annotators, and the simplification ranking collected using crowdsourcing tended to have a lower quality."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-52",
"text": "Figure 1 shows a part of the dataset of Kajiwara and Yamamoto (2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-53",
"text": "Our discussion in this paper is based on this example."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-54",
"text": "Domain of the dataset is limited."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-55",
"text": "Because Kajiwara and Yamamoto (2015) extracted sentences from a newswire corpus, their dataset has a poor variety of expression."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-56",
"text": "English lexical simplification datasets (Specia et al., 2012; De Belder and Moens, 2012) do not have this problem because both of them use a balanced corpus of English (Sharoff, 2006) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-57",
"text": "Complex words might exist in context."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-58",
"text": "In Figure 1, even when a target word such as \" (feel exalted)\" is simplified, another complex word \" (skill)\" is left in a sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-59",
"text": "Lexical simplification is a task of simplifying complex words in a sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-60",
"text": "Previous datasets may include multiple complex words in a sentence but target only one complex word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-61",
"text": "Not only the target word but also other complex words should be considered as well, but annotation of substitutes and simplification ranking to all complex words in a sentence produces a huge number of patterns, therefore takes a very high cost of annotation."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-62",
"text": "For example, when three complex words which have 10 substitutes each in a sentence, annotators should consider 10 3 patterns."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-63",
"text": "Thus, it is desired that a sentence includes only simple words after the target word is substituted."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-64",
"text": "Therefore, in this work, we extract sentences containing only one complex word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-65",
"text": "Ties are not permitted in simplification ranking."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-66",
"text": "When each annotator assigns a simplification ranking to a substitution list, a tie cannot be assigned in previous datasets (Specia et al., 2012; Kajiwara and Yamamoto, 2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-67",
"text": "This deteriorates ranking consistency if some substitutes have a similar simplicity."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-68",
"text": "De Belder and Moens (2012) allow ties in simplification ranking and report considerably higher agreement among annotators than Specia et al. (2012) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-69",
"text": "The method of ranking integration is na\u00efve."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-70",
"text": "Kajiwara and Yamamoto (2015) and Specia et al. (2012) use an average score to integrate rankings, but it might be biased by outliers."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-71",
"text": "De Belder and Moens (2012) report a slight increase in agreement by greedily removing annotators to maximize the agreement score."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-72",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-73",
"text": "**BALANCED DATASET FOR EVALUATION OF JAPANESE LEXICAL SIMPLIFICATION**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-74",
"text": "We create a balanced dataset for the evaluation of Japanese lexical simplification."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-75",
"text": "Figure 2 illustrates how we constructed the dataset."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-76",
"text": "It follows the data creation procedure of Kajiwara and Yamamoto's (2015) dataset with improvements to resolve the problems described in Section 3."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-77",
"text": "We use a crowdsourcing application, Lancers, 3 3 http://www.lancers.jp/ Figure 3 : Example of annotation of extracting substitutes."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-78",
"text": "Annotators are provided with substitutes that preserve the meaning of target word which is shown bold in the sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-79",
"text": "In addition, annotators can write a substitute including particles."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-80",
"text": "to perform substitute extraction, substitute evaluation, and substitute ranking."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-81",
"text": "In each task, we requested the annotators to complete at least 95% of their previous assignments correctly."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-82",
"text": "They were native Japanese speakers."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-84",
"text": "**EXTRACTING SENTENCES**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-85",
"text": "Our work defines complex words as \"High Level\" words in the Lexicon for Japanese Language Education (Sunakawa et al., 2012) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-86",
"text": "4 The word level is calculated by five teachers of Japanese, based on their experience and intuition."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-87",
"text": "There were 7,940 high-level words out of 17,921 words in the lexicon."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-88",
"text": "In addition, target words of this work comprised content words (nouns, verbs, adjectives, adverbs, adjectival nouns, sahen nouns, 5 and sahen verbs 6 )."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-89",
"text": "Sentences that include a complex word were randomly extracted from the Balanced Corpus of Contemporary Written Japanese (Maekawa et al., 2010) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-90",
"text": "Sentences shorter than seven words or longer than 35 words were excluded."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-91",
"text": "We excluded target words that appeared as a part of compound words."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-92",
"text": "Following previous work, 10 contexts of occurrence were collected for each complex word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-93",
"text": "We assigned 30 complex words for each part of speech."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-94",
"text": "The total number of sentences was 2,100 (30 words 10 sentences 7 parts of speech)."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-95",
"text": "We used a crowdsourcing application to annotate 1,800 sentences, and we asked university students majoring in computer science to annotate 300 sentences to investigate the quality of crowdsourcing."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-96",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-97",
"text": "**EXTRACTING SUBSTITUTES**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-98",
"text": "Simplification candidates were collected using crowdsourcing techniques."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-99",
"text": "For each complex word, five annotators wrote substitutes that did not 4 http://jhlee.sakura.ne.jp/JEV.html 5 Sahen noun is a kind of noun that can form a verb by adding a generic verb \"suru (do)\" to the noun."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-100",
"text": "(e.g. \" repair\") 6 Sahen verb is a sahen noun that accompanies with \"suru\"."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-101",
"text": "(e.g. \""
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-102",
"text": "(do repair)\") change the sense of the sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-103",
"text": "Substitutions could include particles in context."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-104",
"text": "Conjugation was allowed to cover variations of both verbs and adjectives."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-105",
"text": "Figure 3 shows an example of annotation."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-106",
"text": "To improve the quality of the lexical substitution, inappropriate substitutes were deleted for later use, as described in the next subsection."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-108",
"text": "**EVALUATING SUBSTITUTES**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-109",
"text": "Five annotators selected an appropriate word to include as a substitution that did not change the sense of the sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-110",
"text": "Substitutes that won a majority were defined as correct."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-111",
"text": "Figure 4 shows an example of annotation."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-112",
"text": "Nine complex words that were evaluated as not having substitutes were excluded at this point."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-113",
"text": "As a result, 2,010 sentences were annotated, as described in next subsection."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-115",
"text": "**RANKING SUBSTITUTES**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-116",
"text": "Five annotators arranged substitutes and complex words according to the simplification ranking."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-117",
"text": "Annotators were permitted to assign a tie, but they could select up to four items to be in a tie because we intended to prohibit an insincere person from selecting a tie for all items."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-118",
"text": "Figure 5 shows an example of annotation."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-121",
"text": "Annotators' rankings were integrated into one ranking, using a maximum likelihood estimation (Matsui et al., 2014) to penalize deceptive annotators as was done by De Belder and Moens (2012) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-122",
"text": "This method estimates the reliability of annotators in addition to determining the true order of rankings."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-123",
"text": "We applied the reliability score to exclude extraordinary annotators."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-124",
"text": "Table 1 shows the characteristics of our dataset."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-125",
"text": "It is about the same size as previous work (Specia et al., 2012; Kajiwara and Yamamoto, 2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-126",
"text": "Our dataset has two advantages: (1) improved correlation with human judgment by making a controlled and balanced dataset, and (2) enhanced consistency by allowing ties in ranking and removing outlier annotators."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-127",
"text": "In the following subsections, we evaluate our dataset in detail."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-129",
"text": "**RESULT**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-131",
"text": "**INTRINSIC EVALUATION**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-132",
"text": "To evaluate the quality of the ranking integration, the Spearman rank correlation coefficient was calculated."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-133",
"text": "The baseline integration ranking used an average score (Kajiwara and Yamamoto, 2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-134",
"text": "Our proposed method excludes outlier annotators by using a reliability score calculated using the method developed by Matsui et al. (2014) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-135",
"text": "Pairwise agreement is calculated between each pair of sets (p 1 , p 2 \u2208 P ) from all the possible pairings (P) (Equation 1)."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-136",
"text": "The agreement among annotators from the substitute evaluation phase was 0.669, and agreement among the students is 0.673, which is similar to the level found in crowdsourcing."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-137",
"text": "This score is almost the same as that from Kajiwara and Yamamoto (2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-138",
"text": "On the contrary, Table 3 : Detail of sentences and substitutes in our dataset."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-139",
"text": "(BCCWJ comprise three main subcorpora: publication (P), library (L), special-purpose (O)."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-140",
"text": "PB = book, PM = magazine, PN = newswire, LB = book, OW = white paper, OT = textbook, OP =PR paper, OB = bestselling books, OC = Yahoo! Answers, OY = Yahoo! Blogs, OL = Law, OM = Magazine) baseline outlier removal Average 0.541 0.580 Table 2 : Correlation of ranking integration."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-141",
"text": "the Spearman rank correlation coefficient of the substitute ranking phase was 0.522."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-142",
"text": "This score is higher than that from Kajiwara and Yamamoto (2015) by 0.190."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-143",
"text": "This clearly shows the importance of allowing ties during the substitute ranking task."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-144",
"text": "Table 2 shows the results of the ranking integration."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-170",
"text": "The familiarity score is an averaged score 28 annotators with seven grades."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-145",
"text": "Our method achieved better accuracy in ranking integration than previous methods (Specia et al., 2012; Kajiwara and Yamamoto, 2015) and is similar to the results from De Belder and Moens (2012) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-146",
"text": "This shows that the reliability score can be used for improving the quality."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-147",
"text": "Table 3 shows the number of sentences and average substitutes in each genre."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-148",
"text": "In our dataset, the number of acquired substitutes is 8,636 words and the average number of substitutes is 4.30 words per sentence."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-149",
"text": "Figure 6 illustrates a part of our dataset."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-150",
"text": "Substitutes that include particles are found in 75 context (3.7%)."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-151",
"text": "It is shown that if particles are not permitted in substitutes, we obtain only two substitutes (4 and 7)."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-152",
"text": "By permitting substitutes to include particles, we are able to obtain 7 substitutes."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-153",
"text": "In ranking substitutes, Spearman rank correlation coefficient is 0.729, which is substantially higher than crowdsourcing's score."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-154",
"text": "Thus, it is necessary to consider annotation method."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-156",
"text": "**EXTRINSIC EVALUATION**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-157",
"text": "In this section, we evaluate our dataset using five simple lexical simplification methods."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-158",
"text": "We calcu- late 1-best accuracy in our dataset and the dataset of Kajiwara and Yamamoto (2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-159",
"text": "Annotated data is collected by our and Kajiwara and Yamamoto (2015)'s work in ranking substitutes task, and which size is 21,700 ((2010 + 2330) 5) rankings."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-160",
"text": "Then, we calculate correlation between the accuracies of annotated data and either those of Kajiwara and Yamamoto (2015) or those of our dataset."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-162",
"text": "**LEXICAL SIMPLIFICATION SYSTEMS**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-163",
"text": "We used several metrics for these experiments:"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-164",
"text": "Frequency Because it is said that a high frequent word is simple, most frequent word is selected as a simplification candidate from substitutes using uni-gram frequency of Japanese Web N-gram (Kudo and Kazawa, 2007) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-165",
"text": "This uni-gram frequency is counted from two billion sentences in Japanese Web text."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-166",
"text": "Aramaki et al. (2013) claimed that a word used by many people is simple, so we pick the word used by the most of users."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-167",
"text": "Number of Users were estimated from the Twitter corpus created by Aramaki et al. (2013) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-168",
"text": "The corpus contains 250 million tweets from 100,000 users."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-169",
"text": "Familiarity Assuming that a word which is known by many people is simple, replace a target word with substitutes according to the familiarity score using familiarity data constructed by Amano and Kondo (2000) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-173",
"text": "JEV We hypothesized a word which is low difficulty for non-native speakers is simple, so we select a word using a Japanese learner dictionary made by Sunakawa et al. (2012) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-174",
"text": "The word in dictionary has a difficulty score averaged by 5 Japanese teachers with their subjective annotation according to six grade system."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-175",
"text": "JLPT Same as above, but uses a different source called Japanese Language Proficient Test (JLPT)."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-176",
"text": "We choose the lowest level word using levels of JLPT."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-177",
"text": "These levels are a scale of one to five."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-179",
"text": "**EVALUATION**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-180",
"text": "We ranked substitutes according to the metrics, and calculated the 1-best accuracy for each target word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-181",
"text": "Finally, to compare two datasets, we used the Pearson product-moment correlation coefficient between our dataset and the dataset of Kajiwara and Yamamoto (2015) against the annotated data."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-182",
"text": "Table 4 shows the result of this experiment."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-183",
"text": "The Pearson coefficient shows that our dataset correlates with human annotation better than the dataset of Kajiwara and Yamamoto (2015) , possibly because we controlled each sentence to include only one complex word."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-184",
"text": "Because our dataset is balanced, the accuracy of Web corpus-based metrics (Frequency and Number of Users) closer than the dataset of Kajiwara and Yamamoto (2015) ."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-185",
"text": "----------------------------------"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-186",
"text": "**CONCLUSION**"
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-187",
"text": "We have presented a new controlled and balanced dataset for the evaluation of Japanese lexical simplification."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-188",
"text": "Experimental results show that (1) our dataset is more consistent than the previous datasets and (2) lexical simplification methods using our dataset correlate with human annotation better than the previous datasets."
},
{
"sent_id": "8e738a8f52e5931a92c9e4577a1ad3-C001-189",
"text": "Future work includes increasing the number of sentences, so as to leverage the dataset for machine learning-based simplification methods."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-12",
"8e738a8f52e5931a92c9e4577a1ad3-C001-9"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-29",
"8e738a8f52e5931a92c9e4577a1ad3-C001-30"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-33",
"8e738a8f52e5931a92c9e4577a1ad3-C001-36"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-39"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-55",
"8e738a8f52e5931a92c9e4577a1ad3-C001-56"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-68"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-70"
]
],
"cite_sentences": [
"8e738a8f52e5931a92c9e4577a1ad3-C001-12",
"8e738a8f52e5931a92c9e4577a1ad3-C001-29",
"8e738a8f52e5931a92c9e4577a1ad3-C001-36",
"8e738a8f52e5931a92c9e4577a1ad3-C001-39",
"8e738a8f52e5931a92c9e4577a1ad3-C001-56",
"8e738a8f52e5931a92c9e4577a1ad3-C001-68",
"8e738a8f52e5931a92c9e4577a1ad3-C001-70"
]
},
"@DIF@": {
"gold_contexts": [
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-45",
"8e738a8f52e5931a92c9e4577a1ad3-C001-50"
],
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-145"
]
],
"cite_sentences": [
"8e738a8f52e5931a92c9e4577a1ad3-C001-50",
"8e738a8f52e5931a92c9e4577a1ad3-C001-145"
]
},
"@USE@": {
"gold_contexts": [
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-64",
"8e738a8f52e5931a92c9e4577a1ad3-C001-65",
"8e738a8f52e5931a92c9e4577a1ad3-C001-66"
]
],
"cite_sentences": [
"8e738a8f52e5931a92c9e4577a1ad3-C001-66"
]
},
"@MOT@": {
"gold_contexts": [
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-70"
]
],
"cite_sentences": [
"8e738a8f52e5931a92c9e4577a1ad3-C001-70"
]
},
"@SIM@": {
"gold_contexts": [
[
"8e738a8f52e5931a92c9e4577a1ad3-C001-124",
"8e738a8f52e5931a92c9e4577a1ad3-C001-125"
]
],
"cite_sentences": [
"8e738a8f52e5931a92c9e4577a1ad3-C001-125"
]
}
}
},
"ABC_ab6f114b2ce4e62e6d8a639e8183eb_9": {
"x": [
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-2",
"text": "Pronouns are frequently omitted in pro-drop languages, such as Chinese, generally leading to significant challenges with respect to the production of complete translations."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-3",
"text": "Recently, Wang et al. (2018) proposed a novel reconstruction-based approach to alleviating dropped pronoun (DP) translation problems for neural machine translation models."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-4",
"text": "In this work, we improve the original model from two perspectives."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-5",
"text": "First, we employ a shared reconstructor to better exploit encoder and decoder representations."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-6",
"text": "Second, we jointly learn to translate and predict DPs in an end-to-end manner, to avoid the errors propagated from an external DP prediction model."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-7",
"text": "Experimental results show that our approach significantly improves both translation performance and DP prediction accuracy."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-10",
"text": "Pronouns are important in natural languages as they imply rich discourse information."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-11",
"text": "However, in pro-drop languages such as Chinese and Japanese, pronouns are frequently omitted when their referents can be pragmatically inferred from the context."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-12",
"text": "When translating sentences from a pro-drop language into a non-pro-drop language (e.g. Chinese-to-English), translation models generally fail to translate invisible dropped pronouns (DPs)."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-13",
"text": "This phenomenon leads to various translation problems in terms of completeness, syntax and even semantics of translations."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-14",
"text": "A number of approaches have been investigated for DP translation (Le Nagard and Koehn, 2010; Xiang et al., 2013; Wang et al., 2016 Wang et al., , 2018 ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-15",
"text": "Wang et al. (2018) is a pioneering work to model DP translation for neural machine trans- * Zhaopeng Tu is the corresponding author of the paper."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-16",
"text": "This work was conducted when Longyue Wang was studying and Qun Liu was working at the ADAPT Centre in the School of Computing at Dublin City University."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-17",
"text": "lation (NMT) models."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-18",
"text": "They employ two separate reconstructors to respectively reconstruct encoder and decoder representations back to the DP-annotated source sentence."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-19",
"text": "The annotation of DP is provided by an external prediction model, which is trained on the parallel corpus using automatically learned alignment information (Wang et al., 2016) ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-20",
"text": "Although this model achieved significant improvements, there nonetheless exist two drawbacks: 1) there is no interaction between the two separate reconstructors, which misses the opportunity to exploit useful relations between encoder and decoder representations; and 2) the external DP prediction model only has an accuracy of 66% in F1-score, which propagates numerous errors to the translation model."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-21",
"text": "In this work, we propose to improve the original model from two perspectives."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-22",
"text": "First, we use a shared reconstructor to read hidden states from both encoder and decoder."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-23",
"text": "Second, we integrate a DP predictor into NMT to jointly learn to translate and predict DPs."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-24",
"text": "Incorporating these as two auxiliary loss terms can guide both the encoder and decoder states to learn critical information relevant to DPs."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-25",
"text": "Experimental results on a largescale Chinese-English subtitle corpus show that the two modifications can accumulatively improve translation performance, and the best result is +1.5 BLEU points better than that reported by Wang et al. (2018) ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-26",
"text": "In addition, the jointly learned DP prediction model significantly outperforms its external counterpart by 9% in F1-score."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-27",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-28",
"text": "**BACKGROUND**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-29",
"text": "As shown in Figure 1 , Wang et al. (2018) introduced two independent reconstructors with their own parameters, which reconstruct the DPannotated source sentence from the encoder and decoder hidden states, respectively."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-30",
"text": "The central Table 1 : Evaluation of external models on predicting the positions of DPs (\"DP Position\") and the exact words of DP (\"DP Words\")."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-31",
"text": "idea underpinning their approach is to guide the corresponding hidden states to embed the recalled source-side DP information and subsequently to help the NMT model generate the missing pronouns with these enhanced hidden representations."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-32",
"text": "The DPs can be automatically annotated for training and test data using two different strategies (Wang et al., 2016) ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-33",
"text": "In the training phase, where the target sentence is available, we annotate DPs for the source sentence using alignment information."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-56",
"text": "The intuition behind this is that the interaction between two attention models can lead to a better exploitation of the encoder and decoder representations."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-34",
"text": "These annotated source sentences can be used to build a neural-based DP predictor, which can be used to annotate test sentences since the target sentence is not available during the testing phase."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-35",
"text": "As shown in Table 1 , Wang et al. (2016 Wang et al. ( , 2018 explored to predict the exact DP words 1 , the accuracy of which is only 66% in F1-score."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-36",
"text": "By analyzing the translation outputs, we found that 16.2% of errors are newly introduced and caused by errors from the DP predictor."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-37",
"text": "Fortunately, the accuracy of predicting DP positions (DPPs) is much higher, which provides the chance to alleviate the error propagation problem."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-38",
"text": "Intuitively, we can learn to generate DPs at the predicted positions using a jointly trained DP predictor, which is fed with informative representations in the reconstructor."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-39",
"text": "1 Unless otherwise indicated, in the paper, the terms \"DP\" and \"DP word\" are identical."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-41",
"text": "**APPROACH**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-43",
"text": "**SHARED RECONSTRUCTOR**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-44",
"text": "Recent work shows that NMT models can benefit from sharing a component across different tasks and languages."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-45",
"text": "Taking multi-language translation as an example, Firat et al. (2016) share an attention model across languages while Dong et al. (2015) share an encoder."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-46",
"text": "Our work is most similar to the work of Zoph and Knight (2016) and Anastasopoulos and Chiang (2018) , which share a decoder and two separate attention models to read from two different sources."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-47",
"text": "In contrast, we share information at the level of reconstructed frames."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-48",
"text": "The architectures of our proposed shared reconstruction model are shown in Figure 2 (a)."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-49",
"text": "Formally, the reconstructor reads from both the encoder and decoder hidden states, as well as the DP-annotated source sentence, and outputs a reconstruction score."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-50",
"text": "It uses two separate attention models to reconstruct the annotated source sentencex = {x 1 ,x 2 , . . . ,x T } word by word, and the reconstruction score is computed by"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-51",
"text": "Note that the weights\u03b1 enc and\u03b1 dec are calculated by two separate attention models."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-52",
"text": "We propose two attention strategies which differ as to whether the two attention models have interactions or not."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-53",
"text": "Independent Attention calculates the two weight matrices independently, as in Equation (4) and (5):"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-54",
"text": "where ATT enc (\u00b7) and ATT dec (\u00b7) are two separate attention models with their own parameters."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-55",
"text": "Interactive Attention feeds the context vector produced by one attention model to another attention model."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-57",
"text": "As the interactive attention is directional, we have two options (Equation (6) and (7)) which modify either ATT enc (\u00b7) or ATT dec (\u00b7) while leaving the other one unchanged:"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-58",
"text": "\u2022 enc\u2192dec:"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-59",
"text": "\u2022 dec\u2192enc:"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-61",
"text": "**JOINT PREDICTION OF DROPPED PRONOUNS**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-62",
"text": "Inspired by recent successes of multi-task learning (Dong et al., 2015; Luong et al., 2016) , we propose to jointly learn to translate and predict DPs (as shown in Figure 2(b) )."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-63",
"text": "To ease the learning difficulty, we leverage the information of DPPs predicted by an external model, which can achieve an accuracy of 88% in F1-score."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-64",
"text": "Accordingly, we transform the original DP prediction problem to DP word generation given the pre-predicted DP positions."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-65",
"text": "Since the DPP-annotated source sentence serves as the reconstructed input, we introduce an additional DP-generation loss, which measures how well the DP is generated from the corresponding hidden state in the reconstructor."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-66",
"text": "Let dp = {dp 1 , dp 2 , . . . , dp D } be the list of DPs in the annotated source sentence, and h rec = {h rec 1 , h rec 2 , . . . , h rec D } be the corresponding hidden states in the reconstructor."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-67",
"text": "The generation probability is computed by"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-68",
"text": "where g p (\u00b7) is softmax for the DP predictor."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-70",
"text": "**TRAINING AND TESTING**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-71",
"text": "We train both the encoder-decoder and the shared reconstructors together in a single end-to-end process, and the training objective is"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-72",
"text": "where {\u03b8, \u03b3, \u03c8} are respectively the parameters associated with the encoder-decoder, shared reconstructor and the DP prediction model."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-73",
"text": "The auxiliary reconstruction objective R(\u00b7) guides the related part of the parameter matrix \u03b8 to learn better latent representations, which are used to reconstruct the DPP-annotated source sentence."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-74",
"text": "The auxiliary prediction loss P (\u00b7) guides the related part of both the encoder-decoder and the reconstructor to learn better latent representations, which are used to predict the DPs in the source sentence."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-75",
"text": "Following Table 2 : Evaluation of translation performance for Chinese-English. \"Baseline\" is trained and evaluated on the original data, while \"Baseline (+DPs)\" and \"Baseline (+DPPs)\" are trained on the data annotated with DPs and DPPs, respectively."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-76",
"text": "Training and decoding (beam size is 10) speeds are measured in words/second."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-77",
"text": "\" \u2020\" and \" \u2021\" indicate statistically significant difference (p < 0.01) from \"Baseline (+DDPs)\" and \"Separate-Recs\u21d2(+DPs)\", respectively."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-78",
"text": "as a reranking technique to select the best translation candidate from the generated n-best list at testing time."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-79",
"text": "Different from Wang et al. (2018), we reconstruct DPP-annotated source sentence, which is predicted by an external model."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-81",
"text": "**EXPERIMENT**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-83",
"text": "**SETUP**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-84",
"text": "To compare our work with the results reported by previous work (Wang et al., 2018) , we conducted experiments on their released Chinese\u21d2English TV Subtitle corpus."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-85",
"text": "2 The training, validation, and test sets contain 2.15M, 1.09K, and 1.15K sentence pairs, respectively."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-86",
"text": "We used case-insensitive 4-gram NIST BLEU metrics (Papineni et al., 2002) for evaluation, and sign-test (Collins et al., 2005) to test for statistical significance."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-87",
"text": "We implemented our models on the code repository released by Wang et al. (2018) ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-88",
"text": "3 We used the same configurations (e.g. vocabulary size = 30K, hidden size = 1000) and reproduced their reported results."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-89",
"text": "It should be emphasized that we did not use the pre-train strategy as done in Wang et al. (2018) , since we found training from scratch achieved a better performance in the shared reconstructor setting."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-90",
"text": "2 https://github.com/longyuewangdcu/ tvsub 3 https://github.com/tuzhaopeng/nmt Table 2 shows the translation results."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-91",
"text": "It is clear that the proposed models significantly outperform the baselines in all cases, although there are considerable differences among different variations."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-93",
"text": "**RESULTS**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-94",
"text": "Baselines (Rows 1-4): The three baselines (Rows 1, 2, and 4) differ regarding the training data used."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-95",
"text": "\"Separate-Recs\u21d2(+DPs)\" (Row 3) is the best model reported in Wang et al. (2018) , which we employed as another strong baseline."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-96",
"text": "The baseline trained on the DPP-annotated data (\"Baseline (+DPPs)\", Row 4) outperforms the other two counterparts, indicating that the error propagation problem does affect the performance of translating DPs."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-97",
"text": "It suggests the necessity of jointly learning to translate and predict DPs."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-98",
"text": "Our Models (Rows 5-8): Using our shared reconstructor (Row 5) not only outperforms the corresponding baseline (Row 4), but also surpasses its separate reconstructor counterpart (Row 3)."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-99",
"text": "Introducing a joint prediction objective (Row 6) can achieve a further improvement of +0.61 BLEU points."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-100",
"text": "These results verify that shared reconstructor and jointly predicting DPs can accumulatively improve translation performance."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-101",
"text": "Among the variations of shared reconstructors (Rows 6-8), we found that an interaction attention from encoder to decoder (Row 7) achieves the best performance, which is +3.45 BLEU points better than our baseline (Row 4) and +1.45 BLEU points better than the best result reported by Wang et al. (2018) (Row 3) ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-102",
"text": "We attribute the superior performance of \"Shared-Rec enc\u2192dec \" to the fact that the attention context over encoder representations embeds useful DP information, which can help to better attend to the representations of the corresponding pronouns in the decoder side."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-103",
"text": "Similar to Wang et al. (2018) , the proposed approach improves BLEU scores at the cost of decreased training and decoding speed, which is due to the large number of newly introduced parameters resulting from the incorporation of reconstructors into the NMT model."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-104",
"text": "Table 3 : Evaluation of DP prediction accuracy. \"External\" model is separately trained on DP-annotated data with external neural methods (Wang et al., 2016) , while \"Joint\" model is jointly trained with the NMT model (Section 3.2)."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-106",
"text": "**DP PREDICTION ACCURACY**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-107",
"text": "As shown in Table 3 , the jointly learned model significantly outperforms the external one by 9% in F1-score."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-108",
"text": "We attribute this to the useful contextual information embedded in the reconstructor representations, which are used to generate the exact DP words."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-109",
"text": "Table 4 : Translation results when reconstruction is used in training only while not used in testing."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-110",
"text": "Table 4 lists translation results when the reconstruction model is used in training only."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-111",
"text": "We can see that the proposed model outperforms both the strong baseline and the best model reported in Wang et al. (2018) ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-112",
"text": "This is encouraging since no extra resources and computation are introduced to online decoding, which makes the approach highly practical, for example for translation in industry applications."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-113",
"text": "Translation performance gap (\" \") between manually (\"Man.\") and automatically (\"Auto.\") labelling DPs/DPPs for input sentences in testing."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-115",
"text": "**MODEL**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-117",
"text": "**CONTRIBUTION ANALYSIS**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-118",
"text": "Effect of DPP Labelling Accuracy For each sentence in testing, the DPs and DPPs are labelled automatically by two separate external prediction models, the accuracy of which are respectively 66% and 88% measured in F1 score."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-119",
"text": "We investigate the best performance the models can achieve with manual labelling, which can be regarded as an \"Oracle\", as shown in Table 5 ."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-120",
"text": "As seen, there still exists a significant gap in performance, and this could be improved by improving the accuracy of our DPP generator."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-121",
"text": "In addition, our models show a relatively smaller distance in performance from the oracle performance (\"Man\"), indicating that the error propagation problem is alleviated to some extent."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-122",
"text": "----------------------------------"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-123",
"text": "**CONCLUSION**"
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-124",
"text": "In this paper, we proposed effective approaches of translating DPs with NMT models: shared reconstructor and jointly learning to translate and predict DPs."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-125",
"text": "Through experiments we verified that 1) shared reconstruction is helpful to share knowledge between the encoder and decoder; and 2) joint learning of the DP prediction model indeed alleviates the error propagation problem by improving prediction accuracy."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-126",
"text": "The two approaches accumulatively improve translation performance."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-127",
"text": "The method is not restricted to the DP translation task and could potentially be applied to other sequence generation problems where additional source-side information could be incorporated."
},
{
"sent_id": "ab6f114b2ce4e62e6d8a639e8183eb-C001-128",
"text": "In future work we plan to: 1) build a fully end-to-end NMT model for DP translation, which does not depend on any external component (i.e. DPP predictor); 2) exploit cross-sentence context (Wang et al., 2017) to further improve DP translation; 3) investigate a new research strand that adapts our model in an inverse translation direction by learning to drop pronouns instead of recovering DPs."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-3"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-14"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-15"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-29"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-34",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-35"
]
],
"cite_sentences": [
"ab6f114b2ce4e62e6d8a639e8183eb-C001-3",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-14",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-15",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-29",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-35"
]
},
"@USE@": {
"gold_contexts": [
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-3",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-4"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-84"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-87"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-95"
]
],
"cite_sentences": [
"ab6f114b2ce4e62e6d8a639e8183eb-C001-3",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-84",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-87",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-95"
]
},
"@EXT@": {
"gold_contexts": [
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-3",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-4",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-5"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-21",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-22",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-23",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-24",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-25"
]
],
"cite_sentences": [
"ab6f114b2ce4e62e6d8a639e8183eb-C001-3",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-25"
]
},
"@DIF@": {
"gold_contexts": [
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-21",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-22",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-23",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-24",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-25"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-79"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-89"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-101"
],
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-111"
]
],
"cite_sentences": [
"ab6f114b2ce4e62e6d8a639e8183eb-C001-25",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-79",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-89",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-101",
"ab6f114b2ce4e62e6d8a639e8183eb-C001-111"
]
},
"@SIM@": {
"gold_contexts": [
[
"ab6f114b2ce4e62e6d8a639e8183eb-C001-103"
]
],
"cite_sentences": [
"ab6f114b2ce4e62e6d8a639e8183eb-C001-103"
]
}
}
},
"ABC_8abb7b77fd6996a905395de9693d42_9": {
"x": [
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-2",
"text": "In recent years, online social networks have allowed world-wide users to meet and discuss."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-3",
"text": "As guarantors of these communities, the administrators of these platforms must prevent users from adopting inappropriate behaviors."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-4",
"text": "This verification task, mainly done by humans, is more and more difficult due to the ever growing amount of messages to check."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-5",
"text": "Methods have been proposed to automatize this moderation process, mainly by providing approaches based on the textual content of the exchanged messages."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-6",
"text": "Recent work has also shown that characteristics derived from the structure of conversations, in the form of conversational graphs, can help detecting these abusive messages."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-7",
"text": "In this paper, we propose to take advantage of both sources of information by proposing fusion methods integrating content-and graph-based features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-8",
"text": "Our experiments on raw chat logs show that the content of the messages, but also of their dynamics within a conversation contain partially complementary information, allowing performance improvements on an abusive message classification task with a final F -measure of 93.26%."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-11",
"text": "Internet widely impacted the way we communicate."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-12",
"text": "Online communities, in particular, have grown to become important places for interpersonal communications."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-13",
"text": "They get more and more attention from companies to advertise their products or from governments interested in monitoring public discourse."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-14",
"text": "Online communities come in various shapes and forms, but they are all exposed to abusive behavior."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-15",
"text": "The definition of what exactly is considered as abuse depends on the community, but generally includes personal attacks, as well as discrimination based on race, religion or sexual orientation."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-16",
"text": "Abusive behavior is a risk, as it is likely to make important community members leave, therefore endangering the community, and even trigger legal issues in some countries."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-17",
"text": "Moderation consists in detecting users who act abusively, and in taking action against them."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-18",
"text": "Currently this moderation work is mainly a manual process, and since it implies high human and financial costs, companies have a keen interest in its automation."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-19",
"text": "One way of doing so is to consider this task as a classification problem consisting in automatically determining if a user message is abusive or not."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-20",
"text": "A number of works have tackled this problem, or related ones, in the literature."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-21",
"text": "Most of them focus only on the content of the targeted message to detect abuse or similar properties."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-22",
"text": "For instance, Spertus (1997) applies this principle to detect hostility, Dinakar et al. (2011) for cyberbullying, and Chen et al. (2012) for offensive language."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-23",
"text": "These approaches rely on a mix of standard NLP features and manually crafted application-specific resources (e.g. linguistic rules)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-24",
"text": "We also proposed a content-based method (Papegnies et al., 2017a ) using a wide array of language features (Bag-of-Words, tf -idf scores, sentiment scores)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-25",
"text": "Other approaches are more machine learning intensive, but require larger amounts of data."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-26",
"text": "Recently, Wulczyn et al. (2017) created three datasets containing individual messages collected from Wikipedia discussion pages, annotated for toxicity, personal attacks and aggression, respectively."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-27",
"text": "They have been leveraged in recent works to train Recursive Neural Network operating on word embeddings and character n-gram features (Pavlopoulos et al., 2017; Mishra et al., 2018) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-28",
"text": "However, the quality of these direct content-based approaches is very often related to the training data used to learn abuse detection models."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-29",
"text": "In the case of online social networks, the great variety of users, including very different language registers, spelling mistakes, as well as intentional users obfuscation, makes it almost impossible to have models robust enough to be applied in all cases."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-30",
"text": "(Hosseini et al., 2017) have then shown that it is very easy to bypass automatic toxic comment detection systems by making the abusive content difficult to detect (intentional spelling mistakes, uncommon negatives...)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-31",
"text": "Because the reactions of other users to an abuse case are completely beyond the abuser's control, some authors consider the content of messages occurring around the targeted message, instead of focusing only on the targeted message itself."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-52",
"text": "Representation of our processing pipeline."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-32",
"text": "For instance, (Yin et al., 2009 ) use features derived from the sentences neighboring a given message to detect harassment on the Web. (Balci and Salah, 2015) take advantage of user features such as the gender, the number of in-game friends or the number of daily logins to detect abuse in the community of an online game."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-33",
"text": "In our previous work (Papegnies et al., 2019) , we proposed a radically different method that completely ignores the textual content of the messages, and relies only on a graph-based modeling of the conversation."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-34",
"text": "This is the only graph-based approach ignoring the linguistic content proposed in the context of abusive messages detection."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-35",
"text": "Our conversational network extraction process is inspired from other works leveraging such graphs for other purposes: chat logs (Mutton, 2004) or online forums (Forestier et al., 2011) interaction modeling, user group detection (Camtepe et al., 2004) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-36",
"text": "Additional references on abusive message detection and conversational network modeling can be found in (Papegnies et al., 2019) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-37",
"text": "In this paper, based on the assumption that the interactions between users and the content of the exchanged messages convey different information, we propose a new method to perform abuse detection while leveraging both sources."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-38",
"text": "For this purpose, we take advantage of the content- (Papegnies et al., 2017b) and graph-based (Papegnies et al., 2019 ) methods that we previously developed."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-39",
"text": "We propose three different ways to combine them, and compare their performance on a corpus of chat logs originating from the community of a French multiplayer online game."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-40",
"text": "We then perform a feature study, finding the most informative ones and discussing their role."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-41",
"text": "Our contribution is twofold: the exploration of fusion methods, and more importantly the identification of discriminative features for this problem."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-42",
"text": "The rest of this article is organized as follows."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-43",
"text": "In Section 2, we describe the methods and strategies used in this work."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-44",
"text": "In Section 3 we present our dataset, the experimental setup we use for this classification task, and the performances we obtained."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-45",
"text": "Finally, we summarize our contributions in Section 4 and present some perspectives for this work."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-47",
"text": "**METHODS**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-48",
"text": "In this section, we summarize the content-based method from (Papegnies et al., 2017b ) (Section 2.1) and the graph-based method from (Papegnies et al., 2019 ) (Section 2.2)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-49",
"text": "We then present the fusion method proposed in this paper, aiming at taking advantage of both sources of information (Section 2.3)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-50",
"text": "Figure 1 shows the whole process, and is discussed through this section."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-51",
"text": "Figure 1 ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-53",
"text": "Existing methods refers to our previous work described in (Papegnies et al., 2017b ) (content-based method) and (Papegnies et al., 2019) (graph-based method), whereas the contribution presented in this article appears on the right side (fusion strategies)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-54",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-55",
"text": "**CONTENT-BASED METHOD**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-56",
"text": "This method corresponds to the bottom-left part of Figure 1 (in green)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-57",
"text": "It consists in extracting certain features from the content of each considered message, and to train a Support Vector Machine (SVM) classifier to distinguish abusive (Abuse class) and non-abusive (Non-abuse class) messages (Papegnies et al., 2017b) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-58",
"text": "These features are quite standard in Natural Language Processing (NLP), so we only describe them briefly here."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-59",
"text": "We use a number of morphological features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-60",
"text": "We use the message length, average word length, and maximal word length, all expressed in number of characters."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-61",
"text": "We count the number of unique characters in the message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-62",
"text": "We distinguish between six classes of characters (letters, digits, punctuation, spaces, and others) and compute two features for each one: number of occurrences, and proportion of characters in the message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-63",
"text": "We proceed similarly with capital letters."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-64",
"text": "Abusive messages often contain a lot of copy/paste."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-65",
"text": "To deal with such redundancy, we apply the Lempel-Ziv-Welch (LZW) compression algorithm (Batista and Meira, 2004) to the message and take the ratio of its raw to compress lengths, expressed in characters."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-66",
"text": "Abusive messages also often contain extra-long words, which can be identified by collapsing the message: extra occurrences of letters repeated more than two times consecutively are removed."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-67",
"text": "For instance, \"looooooool\" would be collapsed to \"lool\"."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-68",
"text": "We compute the difference between the raw and collapsed message lengths."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-69",
"text": "We also use language features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-70",
"text": "We count the number of words, unique words and bad words in the message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-71",
"text": "For the latter, we use a predefined list of insults and symbols considered as abusive, and we also count them in the collapsed message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-72",
"text": "We compute two overall tf -idf scores corresponding to the sums of the standard tf -idf scores of each individual word in the message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-73",
"text": "One is processed relatively to the Abuse class, and the other to the Non-abuse class."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-74",
"text": "We proceed similarly with the collapsed message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-75",
"text": "Finally, we lower-case the text and strip punctuation, in order to represent the message as a basic Bag-of-Words (BoW)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-76",
"text": "We then train a Naive Bayes classifier to detect abuse using this sparse binary vector (as represented in the very bottom part of Figure 1 )."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-77",
"text": "The output of this simple classifier is then used as an input feature for the SVM classifier."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-79",
"text": "**GRAPH-BASED METHOD**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-80",
"text": "This method corresponds to the top-left part of Figure 1 (in red)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-81",
"text": "It completely ignores the content of the messages, and only focuses on the dynamics of the conversation, based on the interactions between its participants (Papegnies et al., 2019) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-82",
"text": "It is three-stepped: 1) extracting a conversational graph based on the considered message as well as the messages preceding and/or following it; 2) computing the topological measures of this graph to characterize its structure; and 3) using these values as features to train an SVM to distinguish between abusive and non-abusive messages."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-83",
"text": "The vertices of the graph model the participants of the conversation, whereas its weighted edges represent how intensely they communicate."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-84",
"text": "The graph extraction is based on a number of concepts illustrated in Figure 2 , in which each rectangle represents a message."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-85",
"text": "The extraction process is restricted to a so-called context period, i.e. a sub-sequence of messages including the message of interest, itself called targeted message and represented in red in Figure 2 ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-86",
"text": "Each participant posting at least one message during this period is modeled by a vertex in the produced conversational graph."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-87",
"text": "A mobile window is slid over the whole period, one message at a time."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-88",
"text": "At each step, the network is updated either by creating new links, or by updating the weights of existing ones."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-89",
"text": "This sliding window has a fixed length expressed in number of messages, which is derived from ergonomic constraints relative to the online conversation platform studied in Section 3."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-90",
"text": "It allows focusing on a smaller part of the context period."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-91",
"text": "At a given time, the last message of the window (in blue in Figure 2 ) is called current message and its author current author."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-92",
"text": "The weight update method assumes that the current message is aimed at the authors of the other messages present in the window, and therefore connects the current author to them (or strengthens their weights if the edge already exists)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-93",
"text": "It also takes chronology into account by favoring the most recent authors in the window."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-94",
"text": "Three different variants of the conversational network are extracted for one given targeted message: the Before network is based on the messages posted before the targeted message, the After network on those posted after, and the Full network on the whole context period."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-202",
"text": "**CONCLUSION AND PERSPECTIVES**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-95",
"text": "Figure 3 shows an example of such networks obtained for a message of the corpus described in Section 3.1."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-96",
"text": "Once the conversational networks have been extracted, they must be described through numeric values in order to feed the SVM classifier."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-97",
"text": "This is done through a selection of standard topological measures allowing to describe a graph in a number of distinct ways, focusing on different scales and scopes."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-98",
"text": "The scale denotes the nature of the characterized entity."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-99",
"text": "In this work, the individual vertex and the whole graph are considered."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-100",
"text": "When considering a single vertex, the measure focuses on the targeted author (i.e. the author of the targeted message)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-101",
"text": "The scope can be either micro-, meso-or macroscopic: it corresponds to the amount of information considered by the measure."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-102",
"text": "For instance, the graph density is microscopic, the modularity is mesoscopic, and the diameter is macroscopic."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-103",
"text": "All these measures are computed for each graph, and allow describing the conversation surrounding the message of interest."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-104",
"text": "The SVM is then trained using these values as features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-105",
"text": "In this work, we use exactly the same measures as in (Papegnies et al., 2019) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-107",
"text": "**FUSION**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-108",
"text": "We now propose a new method seeking to take advantage of both previously described ones."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-109",
"text": "It is based on the assumption that the content-and graph-based features convey different information."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-110",
"text": "Therefore, they could be complementary, and their combination could improve the classification performance."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-111",
"text": "We experiment with three different fusion strategies, which are represented in the right-hand part of Figure 1 ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-112",
"text": "The first strategy follows the principle of Early Fusion."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-113",
"text": "It consists in constituting a global feature set containing all content-and graph-based features from Sections 2.1 and 2.2, then training a SVM directly using these features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-114",
"text": "The rationale here is that the classifier has access to the whole raw data, and must determine which part is relevant to the problem at hand."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-115",
"text": "The second strategy is Late Fusion, and we proceed in two steps."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-116",
"text": "First, we apply separately both methods described in Sections 2.1 and 2.2, in order to obtain two scores corresponding to the output probability of each message to be abusive given by the content-and graph-based methods, respectively."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-117",
"text": "Second, we fetch these two scores to a third SVM, trained to determine if a message is abusive or not."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-118",
"text": "This approach relies on the assumption that these scores contain all the information the final classifier needs, and not the noise present in the raw features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-119",
"text": "Finally, the third fusion strategy can be considered as Hybrid Fusion, as it seeks to combine both previous proposed ones."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-120",
"text": "We create a feature set containing the content-and graph-based features, like with Early Fusion, but also both scores used in Late Fusion."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-121",
"text": "This whole set is used to train a new SVM."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-122",
"text": "The idea is to check whether the scores do not convey certain useful information present in the raw features, in which case combining scores and features should lead to better results."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-124",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-125",
"text": "In this section, we first describe our dataset and the experimental protocol followed in our experiments (Section 3.1)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-126",
"text": "We then present and discuss our results, in terms of classification performance (Sections 3.2) and feature selection (Section 3.3)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-128",
"text": "**EXPERIMENTAL PROTOCOL**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-129",
"text": "The dataset is the same as in our previous publications (Papegnies et al., 2017b (Papegnies et al., , 2019 ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-130",
"text": "It is a proprietary database containing 4,029,343 messages in French, exchanged on the in-game chat of SpaceOrigin 1 , a Massively Multiplayer Online Role-Playing Game (MMORPG)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-131",
"text": "Among them, 779 have been flagged as being abusive by at least one user in the game, and confirmed as such by a human moderator."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-132",
"text": "They constitute what we call the Abuse class."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-133",
"text": "Some inconsistencies in the database prevent us from retrieving the context of certain messages, which we remove from the set."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-134",
"text": "After this cleaning, the Abuse class contains 655 messages."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-135",
"text": "In order to keep a balanced dataset, we further extract the same number of messages at random from the ones that have not been flagged as abusive."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-136",
"text": "This constitutes our Non-abuse class."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-137",
"text": "Each message, whatever its class, is associated to its surrounding context (i.e. messages posted in the same thread)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-138",
"text": "The graph extraction method used to produce the graph-based features requires to set certain parameters."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-139",
"text": "We use the values matching the best performance, obtained during the greedy search of the parameter space performed in (Papegnies et al., 2019) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-140",
"text": "In particular, regarding the two most important parameters (see Section 2.2), we fix the context period size to 1,350 messages and the sliding window length to 10 messages."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-141",
"text": "Implementation-wise, we use the iGraph library (Csardi and Nepusz, 2006) to extract the conversational networks and process the corresponding features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-142",
"text": "We use the Sklearn toolkit (Pedregosa et al., 2011) to get the text-based features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-143",
"text": "We use the SVM classifier implemented in Sklearn under the name SVC (C-Support Vector Classification)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-144",
"text": "Because of the relatively small dataset, we set-up our experiments using a 10-fold cross-validation."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-145",
"text": "Each fold is balanced between the Abuse and Non-abuse classes, 70% of the dataset being used for training and 30% for testing."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-146",
"text": "Table 1 presents the Precision, Recall and F -measure scores obtained on the Abuse class, for both baselines (Content-based (Papegnies et al., 2017b) and Graph-based (Papegnies et al., 2019) ) and all three proposed fusion strategies (Early Fusion, Late Fusion and Hybrid Fusion)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-147",
"text": "It also shows the number of features used to perform the classification, the time required to compute the features and perform the cross validation (Total Runtime) and to compute one message in average (Average Runtime)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-148",
"text": "Note that Late Fusion has only 2 direct inputs (content-and graph-based SVMs), but these in turn have their own inputs, which explains the values displayed in the table."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-150",
"text": "**CLASSIFICATION PERFORMANCE**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-151",
"text": "Our first observation is that we get higher F -measure values compared to both baselines when performing the fusion, independently from the fusion strategy."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-152",
"text": "This confirms what we expected, i.e. that the information encoded in the interactions between the users differs from the information conveyed by the content of the messages they exchange."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-153",
"text": "Moreover, this shows that both sources are at least partly complementary, since the performance increases when merging them."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-154",
"text": "On a side note, the correlation between the score of the graph-and content-based classifiers is 0.56, which is consistent with these observations."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-155",
"text": "Next, when comparing the fusion strategies, it appears that Late Fusion performs better than the others, with an F -measure of 93.26."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-156",
"text": "This is a little bit surprising: we were expecting to get superior results from the Early Fusion, which has direct access to a much larger number of raw features (488)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-157",
"text": "By comparison, the Late Fusion only gets 2 features, which are themselves the outputs of two other classifiers."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-158",
"text": "This means that the Content-Based and Graph-Based classifiers do a good work in summarizing their inputs, without loosing much of the information necessary to efficiently perform the classification task."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-159",
"text": "Moreover, we assume that the Early Fusion classifier struggles to estimate an appropriate model when dealing with such a large number of features, whereas the Late Fusion one benefits from the pre-processing performed by its two predecessors, which act as if reducing the dimensionality of the data."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-160",
"text": "This seems to be confirmed by the results of the Hybrid Fusion, which produces better results than the Early Fusion, but is still below the Late Fusion."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-161",
"text": "This point could be explored by switching to classification algorithm less sensitive to the number of features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-162",
"text": "Alternatively, when considering the three SVMs used for the Late Fusion, one could see a simpler form of a very basic Multilayer Perceptron, in which each neuron has been trained separately (without system-wide backpropagation)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-163",
"text": "This could indicate that using a regular Multilayer Perceptron directly on the raw features could lead to improved results, especially if enough training data is available."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-164",
"text": "Regarding runtime, the graph-based approach takes more than 8 hours to run for the whole corpus, mainly because of the feature computation step."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-165",
"text": "This is due to the number of features, and to the compute-intensive nature of some of them."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-166",
"text": "The content-based approach is much faster, with a total runtime of less than 1 minute, for the exact opposite reasons."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-167",
"text": "Fusion methods require to compute both content-and graph-based features, so they have the longest runtime."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-169",
"text": "**FEATURE STUDY**"
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-170",
"text": "We now want to identify the most discriminative features for all three fusion strategies."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-171",
"text": "We apply an iterative method based on the Sklearn toolkit, which allows us to fit a linear kernel SVM to the dataset and provide a ranking of the input features reflecting their importance in the classification process."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-172",
"text": "Using this ranking, we identify the least discriminant feature, remove it from the dataset, and train a new model with the remaining features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-173",
"text": "The impact of this deletion is measured by the performance difference, in terms of F -measure."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-174",
"text": "We reiterate this process until only one feature remains."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-175",
"text": "We call Top Features (TF) the minimal subset of features allowing to reach 97% of the original performance (when considering the complete feature set)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-176",
"text": "We apply this process to both baselines and all three fusion strategies."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-177",
"text": "We then perform a classification using only their respective TF."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-178",
"text": "The results are presented in Table 1 . Note that the Late Fusion TF performance is obtained using the scores produced by the SVMs trained on Content-based TF and Graphbased TF."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-179",
"text": "These are also used as features when computing the TF for Hybrid Fusion TF (together with the raw content-and graph-based features)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-180",
"text": "In terms of classification performance, by construction, the methods are ranked exactly like when considering all available features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-181",
"text": "The Top Features obtained for each method are listed in Table 2 ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-182",
"text": "The last 4 columns precise which variants of the graph-based features are concerned."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-183",
"text": "Indeed, as explained in Section 2.2, most of these topological measures can handle/ignore edge weights and/or edge directions, can be vertex-or graph-focused, and can be computed for each of the three types of networks (Before, After and Full)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-184",
"text": "There are three Content-Based TF."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-185",
"text": "The first is the Naive Bayes prediction, which is not surprising as it comes from a fully fledged classifier processing BoWs."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-186",
"text": "The second is the tf -idf score computed over the Abuse class, which shows that considering term frequencies indeed improve the classification performance."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-187",
"text": "The third is the Capital Ratio (proportion of capital letters in the comment), which is likely to be caused by abusive message tending to be shouted, and therefore written in capitals."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-188",
"text": "The Graph-Based TF are discussed in depth in our previous article (Papegnies et al., 2019) ."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-189",
"text": "To summarize, the most important features help detecting changes in the direct neighborhood of the targeted author (Coreness, Strength), in the average node centrality at the level of the whole graph in terms of distance (Closeness), and in the general reciprocity of exchanges between users (Reciprocity)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-190",
"text": "We obtain 4 features for Early Fusion TF."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-191",
"text": "One is the Naive Bayes feature (content-based), and the other three are topological measures (graph-based features)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-192",
"text": "Two of the latter correspond to the Corenessof the targeted author, computed for the Before and After graphs."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-193",
"text": "The third topological measure is his/her Eccentricity."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-194",
"text": "This reflects important changes in the interactions around the targeted author."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-195",
"text": "It is likely caused by angry users piling up on the abusive user after he has posted some inflammatory remark."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-196",
"text": "For Hybrid Fusion TF, we also get 4 features, but those include in first place both SVM outputs from the content-and graph-based classifiers."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-197",
"text": "Those are completed by 2 graph-based features, including Strength (also found in the Graph-based and Late Fusion TF) and Coreness (also found in the Graph-based, Early Fusion and Late Fusion TF)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-198",
"text": "Besides a better understanding of the dataset and classification process, one interesting use of the TF is that they can allow decreasing the computational cost of the classification."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-199",
"text": "In our case, this is true for all methods: we can retain 97% of the performance while using only a handful of features instead of hundreds."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-200",
"text": "For instance, with the Late Fusion TF, we need only 3% of the total Late Fusion runtime."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-203",
"text": "In this article, we tackle the problem of automatic abuse detection in online communities."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-204",
"text": "We take advantage of the methods that we previously developed to leverage message content (Papegnies et al., 2017a) and interactions between users (Papegnies et al., 2019) , and create a new method using both types of information simultaneously."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-205",
"text": "We show that the features extracted from our content-and graph-based approaches are complementary, and that combining them allows to sensibly improve the results up to 93.26 (F -measure)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-206",
"text": "One limitation of our method is the computational time required to extract certain features."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-207",
"text": "However, we show that using only a small subset of relevant features allows to dramatically reduce the processing time (down to 3%) while keeping more than 97% of the original performance."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-208",
"text": "Another limitation of our work is the small size of our dataset."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-209",
"text": "We must find some other corpora to test our methods at a much higher scale."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-210",
"text": "However, all the available datasets are composed of isolated messages, when we need threads to make the most of our approach."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-211",
"text": "A solution could be to start from datasets such as the Wikipedia-based corpus proposed by Wulczyn et al. (2017) , and complete them by reconstructing the original conversations containing the annotated messages."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-212",
"text": "This could also be the opportunity to test our methods on an other language than French."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-213",
"text": "Our content-based method may be impacted by this change, but this should not be the case for the graph-based method, as it is independent from the content (and therefore the language)."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-214",
"text": "Besides language, a different online community is likely to behave differently from the one we studied before."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-215",
"text": "In particular, its members could react differently differently to abuse."
},
{
"sent_id": "8abb7b77fd6996a905395de9693d42-C001-216",
"text": "The Wikipedia dataset would therefore allow assessing how such cultural differences affect our classifiers, and identifying which observations made for Space Origin still apply to Wikipedia."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"8abb7b77fd6996a905395de9693d42-C001-33"
],
[
"8abb7b77fd6996a905395de9693d42-C001-36"
],
[
"8abb7b77fd6996a905395de9693d42-C001-48"
],
[
"8abb7b77fd6996a905395de9693d42-C001-188"
]
],
"cite_sentences": [
"8abb7b77fd6996a905395de9693d42-C001-33",
"8abb7b77fd6996a905395de9693d42-C001-36",
"8abb7b77fd6996a905395de9693d42-C001-48",
"8abb7b77fd6996a905395de9693d42-C001-188"
]
},
"@EXT@": {
"gold_contexts": [
[
"8abb7b77fd6996a905395de9693d42-C001-38",
"8abb7b77fd6996a905395de9693d42-C001-39"
],
[
"8abb7b77fd6996a905395de9693d42-C001-204"
]
],
"cite_sentences": [
"8abb7b77fd6996a905395de9693d42-C001-38",
"8abb7b77fd6996a905395de9693d42-C001-204"
]
},
"@USE@": {
"gold_contexts": [
[
"8abb7b77fd6996a905395de9693d42-C001-38"
],
[
"8abb7b77fd6996a905395de9693d42-C001-52",
"8abb7b77fd6996a905395de9693d42-C001-53"
],
[
"8abb7b77fd6996a905395de9693d42-C001-105"
],
[
"8abb7b77fd6996a905395de9693d42-C001-129"
],
[
"8abb7b77fd6996a905395de9693d42-C001-139"
],
[
"8abb7b77fd6996a905395de9693d42-C001-146"
]
],
"cite_sentences": [
"8abb7b77fd6996a905395de9693d42-C001-38",
"8abb7b77fd6996a905395de9693d42-C001-53",
"8abb7b77fd6996a905395de9693d42-C001-105",
"8abb7b77fd6996a905395de9693d42-C001-129",
"8abb7b77fd6996a905395de9693d42-C001-139",
"8abb7b77fd6996a905395de9693d42-C001-146"
]
},
"@MOT@": {
"gold_contexts": [
[
"8abb7b77fd6996a905395de9693d42-C001-79",
"8abb7b77fd6996a905395de9693d42-C001-80",
"8abb7b77fd6996a905395de9693d42-C001-81"
]
],
"cite_sentences": [
"8abb7b77fd6996a905395de9693d42-C001-81"
]
}
}
},
"ABC_05bf376f0a18cf313ead7189b029b6_9": {
"x": [
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-2",
"text": "In this paper, we describe the DeepNNNER entry to The 2nd Workshop on Noisy User-generated Text (WNUT) Shared Task #2: Named Entity Recognition in Twitter."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-3",
"text": "Our shared task submission adopts the bidirectional LSTM-CNN model of Chiu and Nichols (2016), as it has been shown to perform well on both newswire and Web texts."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-4",
"text": "It uses word embeddings trained on large-scale Web text collections together with text normalization to cope with the diversity in Web texts, and lexicons for target named entity classes constructed from publicly-available sources."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-5",
"text": "Extended evaluation comparing the effectiveness of various word embeddings, text normalization, and lexicon settings shows that our system achieves a maximum F1-score of 47.24, performance surpassing that of the shared task's second-ranked system."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-8",
"text": "Named entity recognition (NER) is an important part of natural language processing."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-9",
"text": "It is a challenging task that requires robust recognition to detect common entities over a large variety of expressions and vocabularies."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-10",
"text": "These problems are intensified when targeting Web texts because of challenges such as differences in spelling and punctuation conventions, neologisms, and Web markup (Baldwin et al., 2015) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-11",
"text": "Traditional approaches to NER on newswire texts has been dominated by machine learning methods that rely heavily on manual feature engineering and external knowledge sources (Ratinov and Roth, 2009; Lin and Wu, 2009; Passos et al., 2014) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-12",
"text": "Recently, neural network models -especially those that use recursive models -have shown that state of the art performance can be achieved with little feature engineering (Collobert et al., 2011; Santos et al., 2015; Chiu and Nichols, 2016) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-13",
"text": "However, despite their popularity for NER on newswire texts, neural networks have not been widely adopted for NER on Web texts, with the exception of the feed-forward neural network (FFNN) model of Godin et al. (2015) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-14",
"text": "In this paper, we present the DeepNNNER entry to the WNUT 2016 Shared Task #2: Named Entity Recognition in Twitter."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-15",
"text": "Our shared task submission is based on the model of Chiu and Nichols (2016) , a hybrid model of bidirectional long short-term memory (BLSTM) networks and convolutional neural networks (CNN) that automatically learns both character-and word-level features, and which holds the current state-of-the-art on both newswire texts (CoNLL 2003) and diverse corpora including Web texts (OntoNotes 5.0)."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-16",
"text": "In contrast to CRFs, FFNNs, and other windowed models, the BLSTM gives our model effectively infinite context on both sides of a word during sequential labeling."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-17",
"text": "The character-level CNN allows our model to learn relevant features from the orthography of words, which is important in task where unseen words are commonplace."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-18",
"text": "Finally, it also encodes partial lexicon matches in neural networks, allowing it to make effective use of lexical knowledge."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-19",
"text": "Our primary contribution is adapting the model of Chiu and Nichols (2016) to Twitter data by developing a text normalization method to effectively apply word embeddings to large vocabulary Web texts and automatically constructing lexicons for the shared task's target NE classes from publicly-available sources."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-20",
"text": "The rest of our paper is organized as follows."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-21",
"text": "In Section 2, we describe the adaptations made to Chiu and Nichols (2016) 's model."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-22",
"text": "In Section 3, we describe the evaluation methodology."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-23",
"text": "In Section 4, we discuss the results and present an error analysis."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-24",
"text": "In Section 5, we summarize related research."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-25",
"text": "Finally, in Section 6, we give concluding remarks."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-26",
"text": "Figure 1: Our proposed system architecture for NER."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-27",
"text": "Feature embeddings are constructed following Section 2.1."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-28",
"text": "The output from both the forward and the backward LSTM are fed through a linear and a log-softmax layer before being added together (shown as \"Output layers\") to produce the tag scores."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-30",
"text": "**MODEL**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-31",
"text": "In this section, we describe the architecture of our shared task submission."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-32",
"text": "An overview is given Figure 1 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-33",
"text": "Our system is based on the BLSTM-CNN model of Chiu and Nichols (2016) , and, unless otherwise noted, follows their training and tagging methodology, which the reader is referred to for more details."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-35",
"text": "**FEATURES**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-36",
"text": "Feature embeddings for words are constructed by concatenating together the features listed here."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-38",
"text": "**WORD EMBEDDINGS**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-39",
"text": "Word embeddings are critical for high-performance neural networks in NLP tasks (Turian et al., 2010) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-40",
"text": "In this paper, we compare six publicly available pre-trained word embeddings."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-41",
"text": "The embeddings are described in detail in Table 3 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-42",
"text": "The neural embeddings of Collobert et al. (2011) were chosen because Chiu and Nichols (2016) reported them to be the highest performing on both CoNLL-2003 and OntoNotes 5.0 datasets."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-43",
"text": "To evaluate embeddings trained on data closer to the WNUT dataset, we also selected the GloVe embeddings of Pennington et al. (2014) , trained on both Web text and tweets, and word2vec embeddings trained on Google News data (Mikolov et al., 2013 ) and on tweets (Godin et al., 2015) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-44",
"text": "Preliminary evaluation on the Dev1 data showed that GloVe 27B outperformed Collobert's embeddings (see Table 5 ) and word2vec 3B, so they were used in our submission."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-45",
"text": "Following Collobert et al. (2011) , we use lookup tables to extract embeddings and every word is lower cased before lookup."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-47",
"text": "**CNN-EXTRACTED CHARACTER FEATURES**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-48",
"text": "Following Chiu and Nichols (2016) , we use a CNN to extract features from 25 dim."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-49",
"text": "character embeddings randomly-initialized from a uniform distribution between -0.5 and 0.5."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-50",
"text": "To accommodate text normalization, we added embeddings for the normalization symbols described in Section 2.2, namely , , , , , , , and . All experiments were conducted with the same character embeddings."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-52",
"text": "**LEXICON FEATURES**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-53",
"text": "Prior knowledge in the form of lexicons (also known as \"gazetteers\") has been shown to be essential to NER (Ratinov and Roth, 2009; Passos et al., 2014) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-54",
"text": "This section describes how the lexicons employed by our system were constructed."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-55",
"text": "We designed the lexicon categories to be as close as possible to the shared task NE classes by extracting corresponding descendants from the DBpedia ontology (Auer et al., 2007) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-56",
"text": "The lexicon used by our system contains 2.2 million entries over 9 different categories, as shown in Table 1 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-57",
"text": "While most of the lexicons were extracted using only one descendant from the ontology, Misc, Music, and Product were constructed using multiple classes."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-58",
"text": "First, in order to match entries such as festivals, holidays, songs, and more from the other class, we constructed the Misc lexicon from Event and Work types in the DBpedia ontology excluding Movie and TelevisionShow to avoid overlap with other classes."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-59",
"text": "Second, in order to deal with inconsistencies between person and musicartist classes as discussed in Section 3, the Music lexicon is a combination of the subtypes Band and MusicalArtist 1 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-60",
"text": "Finally, in order to maximize coverage, the Product lexicon is a combination of the subtype Device from the DBpedia ontology and the lexicon product distributed with WNUT dataset."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-61",
"text": "Every other category is as described in Table 1 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-62",
"text": "To generate lexicon features, we apply the partial matching algorithm of Chiu and Nichols (2016) to the input text, as shown in Figure 2 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-63",
"text": "Each lexicon and match type (BIOES) is associated with a randomly-initialized 5 dim."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-64",
"text": "embedding."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-65",
"text": "The embeddings for all lexicons are concatenated together to produce the lexicon feature for each word in the input."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-66",
"text": "To facilitate matching, all entries were stripped of parentheses and tokenized with the Penn Treebank tokenization script."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-67",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-68",
"text": "**CAPITALIZATION FEATURE**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-69",
"text": "Following Chiu and Nichols (2016), we used different symbols for word-level capitalization feature each assigned a randomly initialized embedding: allCaps, upperInitial, lowercase, mixedCaps and noinfo."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-70",
"text": "Similar symbols were used for character-level (upper case, lower case, punctuation, other)."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-71",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-72",
"text": "**TEXT NORMALIZATION**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-73",
"text": "In order to maximize word embedding lookup coverage, we modify the publicly available GloVe preprocessing script 2 to normalize irregular spelling and replace special symbols with special embeddings: , , , , , , , and . Repeated punctuation is also removed."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-74",
"text": "When processing hashtags, the hashtag body is split on capital letters, distributing the NE tag across the resulting tokens."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-75",
"text": "This helps increase word embedding coverage."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-76",
"text": "Refer to Figure 1 for an example."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-77",
"text": "Additionally, we attempted to correct the most obvious spelling irregularities where letters in a word are repeated more than twice, consulting a dictionary to decide wether to keep one or two occurrences of that repeated letter."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-78",
"text": "When consulting the dictionary, we prioritized shorter matches when the repeated letter appeared at the end of the word and longer matches otherwise."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-79",
"text": "For evaluation of the final system, we mapped the NE tags onto the original test data tokens, as shown in Figure 1 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-80",
"text": "Because of the tokenization, some of the original entries could end up with more than one tag."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-81",
"text": "In this case, we prioritize entity over non-entity tags, and keep the most frequent tag."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-82",
"text": "Prioritizing entity over non-entity tags was meant to improve recall, albeit at the expense of precision."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-83",
"text": "Initial experiments on Dev1 comparing word2vec 3B, Collobert, and GloVe 27B embeddings showed that text normalization improved performance for word2vec 3B and GloVe 27B but not Collobert 3 (Table 5) ; that word type coverage increased drastically for all embeddings; and that while word token coverage greatly increased for GloVe 27B, it slightly decreased for other embeddings (see Table 4 )."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-84",
"text": "We thus selected GloVe 27B embeddings for our submission due to their superior performance and coverage."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-86",
"text": "**TRAINING AND INFERENCE**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-87",
"text": "We follow the training and inference methodology of Chiu and Nichols (2016) , training our neural network to maximize the sentence-level log-likelihood from Collobert et al. (2011) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-88",
"text": "Training is done by mini-batch SGD with a fixed learning rate, and we apply dropout (Pham et al., 2014) to the output nodes."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-89",
"text": "All feature representations are \"unfrozen\" and allowed to be updated by the training algorithm."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-90",
"text": "We used the IOB tag scheme to annotate named entities."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-91",
"text": "We also explored the BIOES tag scheme 4 , as it was reported to outperform IOB (Ratinov and Roth, 2009 ), however, IOB outperformed BIOES in preliminary experiments."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-92",
"text": "We suspect that data sparsity prevented the model from learning meaningful representations for the extra tags."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-93",
"text": "Our shared task submission's model trained in approximately 90 minutes and tags the test set in approximately 20 seconds, with memory usage peaking at 350MB 5 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-94",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-95",
"text": "**HYPER-PARAMETER OPTIMIZATION**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-96",
"text": "To maximize performance, we perform hyper-parameter optimization using Optunity's implementation of particle swarm (Claesen et al., 2014) , as there is some evidence that it is more efficient than random search (Clerc and Kennedy, 2002) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-97",
"text": "The hyper-parameters of our model and final selected values are given in Table 2 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-98",
"text": "We evaluated 800 hyper-parameter settings in total."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-99",
"text": "The search used 5-fold validation to maximize the influence of the entire dataset, as it was small, and we kept the best performing setting."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-101",
"text": "**EVALUATION**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-102",
"text": "The WNUT 2016 dataset consists of user-generated tweets tagged with 10 types of named entities: company, facility, geo-loc, movie, musicartist, other, person, product, sportsteam, and tvshow."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-103",
"text": "Table 4 shows the train, dev and test set data splits."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-104",
"text": "Compared to the well-researched CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003) or the OntoNotes 5.0 dataset (Pradhan et al., 2013) , the WNUT dataset contains a lot of spelling irregularities and special symbols."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-105",
"text": "For example, Christmas written as xmas, Guys written as Gaiiissss, emoticons such as \":-)\", \":(\", \"<3\" and so on are commonplace."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-106",
"text": "Such examples illustrate the diversity of the dataset's vocabulary, motivating us to perform text normalization as described in Section 2.2."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-107",
"text": "Some inconsistencies were found between Dev2 and the other data."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-108",
"text": "The most obvious one is where singers previously tagged as person in Train were tagged as musicartist in Dev2."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-109",
"text": "This is easily verifiable by comparing tags for the entity Justin Bieber in those datasets."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-110",
"text": "These tag inconsistencies make it difficult to learn a robust model for those classes, so we manually retagged all person entities, keeping the most precise tag (i.e. tagging all singers as musicartist)."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-111",
"text": "We did so by searching for every person entity with Google and used the surrounding context to determine the most precise tag, replacing a total of 82 person entities out of 664."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-112",
"text": "Other local inconsistencies were not corrected as not enough evidence was found."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-113",
"text": "In Section 4.3.2 we explore inconsistencies in common tagging errors."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-114",
"text": "For each experiment, we report the average for precision and recall, and the average and standard deviation for f1-score for 10 successful trials."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-115",
"text": "Minor inconsistencies in reported f1-scores and precision and recall result from those scores being averaged independently."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-116",
"text": "Statistical significance is calculated using the Wilcoxon rank sum test, due to its robustness against small sample sizes with unknown distributions."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-118",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-119",
"text": "In this section, we (1) compare the performance of different word embeddings, (2) analyze the influence of our lexicon over the performance of our final model, and (3) perform error analysis of various aspects of both our system and the dataset."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-120",
"text": "Table 7 shows the final results for different settings."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-121",
"text": "Following Cherry et al. (2015) , we compare our system settings to other shared task entries (Strauss et al., 2016) and present their retroactive ranks."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-122",
"text": "While our submitted system uses GloVe embeddings trained on Twitter (GloVe 27B), we found that GloVe embeddings trained on Common Crawl (GloVe 42B) with text normalization and lexicons was our best performing setting, achieving a retroactive rank of second place."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-123",
"text": "with performance for GloVe and word2vec embeddings 6 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-124",
"text": "Note that word type coverage appears to be more important than token coverage."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-125",
"text": "It could be the case that NEs are more likely to contain lowfrequency words, necessitating a large token vocabulary."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-126",
"text": "GloVe 42B's increased performance over GloVe 27B could be explained thus, though it is also possible that its larger, more diverse dataset is responsible."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-127",
"text": "It is interesting to point out that Collobert was able to outperform embeddings trained on much larger datasets with much larger vocabularies."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-128",
"text": "While all embeddings improved with text normalization, only GloVe 27B and 42B got statistically significant 7 increases."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-129",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-130",
"text": "**WORD EMBEDDINGS**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-131",
"text": "Finally, in order to save time on training, we reduced the vocabulary of the word embedding lookup table to contain only words from the training data."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-132",
"text": "This allowed us to reduce the network's training time by half and reduce its memory usage by over 90%."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-133",
"text": "However, due to a bug, words outside of the train and dev set vocabulary were treated as unknown, considerably degrading our system's performance."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-134",
"text": "When the vocabulary bug is fixed, our submission setting achieves performance with a retroactive rank of third place."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-135",
"text": "See Table 6 for a comparison to other shared task entries taken from Strauss et al. (2016) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-137",
"text": "**LEXICON FEATURES**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-138",
"text": "Usage of lexicons greatly improved performance, providing a statistically significant increase in f1-score for Collobert 8 , GloVe 27B 9 , GloVe 42B 10 , and word2vec-400MT 11 embeddings (see Tables 7 and 8 )."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-139",
"text": "Figure 3 (left) shows a heat map of lexical coverage."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-140",
"text": "As many of the cells along the diagonal are bright, it shows that we were able to produce lexicons for many categories with high coverage and low ambiguity."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-141",
"text": "However, there are some notable exceptions, such as the Location lexicon showing high coverage on both facility and geo-loc, and both Music and Person lexicons showing high coverage on musicartist."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-142",
"text": "This lexical overlap likely contributes to misclassification errors; the confusion matrix in Figure 3 (right) shows that misclassifications between facility and geo-loc and musicartist and person are quite frequent."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-143",
"text": "Some lexical overlap makes sense considering the fact that sports teams will often include city names such as Montreal Canadiens or Philadelphia Eagles."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-144",
"text": "Table 8 compares our fixed shared task submission's entity-level f1-scores with and without lexicons."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-145",
"text": "These results show that while many lexicons were effective -particularly company, geo-loc, and 6 It is surprising that word2vec-400MT underperforms Glove 42B, despite its superior word type coverage, but this could be due to differences in training algorithm, preprocessing (word2vec-400MT used Ritter et al. (2011) 's Twitter NLP Tools), or casing (word2vec-400MT preserved case, while Glove 42B did not)."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-146",
"text": "We also evaluated 300 dim."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-147",
"text": "GloVe embeddings trained on 840B words of Common Crawl data with a vocabulary size of 2.2M, however, they underperformed the GloVe 42B embeddings."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-148",
"text": "7 Wilcoxon rank sum test, p < 0.05."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-149",
"text": "Figure 4 : Examples of (1) contextual ambiguity (2) tagging inconsistencies and (3) hashtags inconsistencies."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-150",
"text": "The upper tag is gold annotation."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-151",
"text": "The lower tag is our system's prediction."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-152",
"text": "other -the lexicons MusicArtist, Person, SportsTeam, and TVShow were detrimental to NER performance."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-153",
"text": "As noted above, the MusicArtist and Person lexicons had substantial overlap, most likely contributing to poor performance."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-155",
"text": "**ERROR ANALYSIS**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-156",
"text": "In this section, we describe different sources of errors from a subsample of mistagged test set entities."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-157",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-158",
"text": "**UNSEEN ENTITIES**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-159",
"text": "One of the biggest source of errors when trying to tag noisy Web-text is the amount of unseen entities the system will face."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-160",
"text": "In the WNUT dataset, roughly 40% of the entities present in the test set are not in the train or dev datasets."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-161",
"text": "This underscores the importance of high-coverage word embeddings, lexicon construction, and lexical matching, since the tagger has not encountered almost half of the entities."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-163",
"text": "**CONTEXTUAL AMBIGUITY**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-164",
"text": "With fine-grained entities such as the ones defined for this task, our system tends to make errors due to confusion between entity classes."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-165",
"text": "Figure 3 shows the confusion matrix when the system is evaluated over the test dataset."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-166",
"text": "One common error occurs between geo-loc and other classes, more specifically company, facility and sportsteam."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-167",
"text": "We extracted 50 examples for each type of confusion and found out that place names were mostly being tagged as geo-loc even though context indicates otherwise."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-168",
"text": "Figure 4 shows a few examples."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-169",
"text": "Another important class ambiguity is between musicartist and person."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-170",
"text": "In a subsample of 64 examples, 49 were tagged as person."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-171",
"text": "Furthermore, the entity matched both entity's lexicons in 59% of the cases."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-172",
"text": "This is also supported by the confusion matrix where music artists get tagged as person more than twice as often as they get tagged correctly."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-173",
"text": "This contextual ambiguity seems to have led to a few tagging inconsistencies that could also explain lower overall performance."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-174",
"text": "Either from train to test set or within the same set, entities sometimes ended up with multiple tags or no tags at all."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-175",
"text": "Such examples are: singers like Justin Bieber being tagged as person in the training set and musicartist in the test set; devices such as BlackBerry being tagged either as company or product."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-176",
"text": "Some of these inconsistencies are understandable because most of the time more than one tag could fit 12 ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-177",
"text": "Refining lexicons to maximize coverage while minimizing ambiguity remains an essential area of future work."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-179",
"text": "**HASHTAGS**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-180",
"text": "In tweets, hashtags are omnipresent."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-181",
"text": "They are a way to highlight relevant keywords or phrases making it easier to categorize the tweets they are in."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-182",
"text": "It then becomes important to be able to retrieve important information from those relevant keywords."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-183",
"text": "From the subsample we observed that most entities containing a hashtag were not tagged at all."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-184",
"text": "This can be explained by the fact that only 4% of hashtags are part of entities in the training set making our network biased against tagging hashtags."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-185",
"text": "This likely lead to more errors on the test set where more than 15% of hashtags are part of entities."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-186",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-187",
"text": "**RELATED RESEARCH**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-188",
"text": "Named entity recognition is a task with a long history, dating back to MUC-7 (Chinchor and Robinson, 1997) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-189",
"text": "In this section, we describe the NER research that influenced our system and give an overview of the work on NER for Twitter."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-190",
"text": "For a more detailed survey, see (Chiu and Nichols, 2016) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-191",
"text": "Most recent approaches to NER have been characterized by the use of CRF, SVM, and perceptron models, where performance is heavily dependent on feature engineering."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-192",
"text": "Ratinov and Roth (2009) used non-local features, a gazetteer extracted from Wikipedia, and Brown-cluster-like word representations."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-193",
"text": "Lin and Wu (2009) used phrase features obtained by performing k-means clustering over a private database of search engine query logs in place of a lexicon."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-194",
"text": "Passos et al. (2014) proposed a model that infused word embeddings with lexical knowledge."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-195",
"text": "In order to combat the problem of sparse features, Suzuki et al. (2011) performed feature reduction with large-scale unlabelled data."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-196",
"text": "Recently, the state-of-the-art for NER neural networks have overtaken other approaches to NER."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-197",
"text": "Most approaches build on the pioneering work of Collobert et al. (2011) , which showed that word embeddings could be employed in a deep FFNN to achieve near state-of-the-art results on POS tagging, chunking, NER, and SRL."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-198",
"text": "Santos et al. (2015) augmented the architecture of Collobert et al. (2011) with character-level CNNs, reporting improved performance on Spanish and Portuguese NER."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-199",
"text": "Huang et al. (2015) employed BLSTMs in place of FFNNs for the POS-tagging, chunking, and NER tasks, but they employed heavy feature engineering instead of using a CNN to automatically extract character-level features."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-200",
"text": "Lample et al. (2016) proposed LSTM-CRF and Stack-LSTM architectures for NER."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-201",
"text": "The earliest work on NER for Twitter, used a CRF model with global features from tweet clusters to conduct NER with the MUC-7 4 class task definition (Liu et al., 2011) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-202",
"text": "Ritter et al. (2011) developed a suite of NLP tools explicitly for Twitter and expanded the task to the 10 class definition used in the WNUT shared tasks."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-203",
"text": "A key difference between NER for Twitter and conventional NER is that the former also considers peripheral tasks such as named entity tokenization (Li et al., 2012) , normalization (Liu et al., 2012) , and linking (Guo et al., 2013; Yamada et al., 2015) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-204",
"text": "The WNUT 2015 Shared Task included text normalization and named entity tokenization and detection tasks (Baldwin et al., 2015) , with most systems using machine learning methods like CRF together with a variety of features including lexicons, orthographic features, and distributional information."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-205",
"text": "In contrast with conventional NER, there was only one neural network entry (Godin et al., 2015) , and most systems tended to prefer Brown clusters to word embeddings."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-206",
"text": "The state of the art at WNUT 2015 used a cascaded model of entity tokenization, followed by linking to knowledge bases, and, finally, classification with random forests (Yamada et al., 2015) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-207",
"text": "Our system adopts the architecture of Chiu and Nichols (2016) , which combined BLSTMs to maximize context over the tagged word sequence and word-level CNNs to automatically generate characterlevel features with a partial-matching lexicon to achieve the state-of-the-art for NER on both CoNLL 2003 and OntoNotes datasets."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-208",
"text": "Our system can be viewed as an investigation into how well state-of-theart neural approaches adapt to the challenges of NER on noisy Web data."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-210",
"text": "**CONCLUSION**"
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-211",
"text": "In this paper, we described the DeepNNNER entry to the WNUT 2016 Shared Task #2: Named Entity Recognition in Twitter, which adopted the BLSTM-CNN model of Chiu and Nichols (2016) ."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-212",
"text": "Extensive evaluation showed that high word type coverage for word embeddings is crucial to NER performance, likely due to rare words in entities, and that both text normalization and partial matching on lexicons constructed from DBpedia (Auer et al., 2007) contribute significantly to performance."
},
{
"sent_id": "05bf376f0a18cf313ead7189b029b6-C001-213",
"text": "Our best-performing system uses text normalization, lexicon partial matching, and the GloVe word embeddings of Pennington et al. (2014) trained on 42B words of Common Crawl data, and it achieves a maximum F1-score of 47.24, performance surpassing that of the shared task's second-ranked system."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"05bf376f0a18cf313ead7189b029b6-C001-3"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-15"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-21"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-33"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-42"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-48"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-62"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-69"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-87"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-207"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-211"
]
],
"cite_sentences": [
"05bf376f0a18cf313ead7189b029b6-C001-3",
"05bf376f0a18cf313ead7189b029b6-C001-15",
"05bf376f0a18cf313ead7189b029b6-C001-21",
"05bf376f0a18cf313ead7189b029b6-C001-33",
"05bf376f0a18cf313ead7189b029b6-C001-42",
"05bf376f0a18cf313ead7189b029b6-C001-48",
"05bf376f0a18cf313ead7189b029b6-C001-62",
"05bf376f0a18cf313ead7189b029b6-C001-69",
"05bf376f0a18cf313ead7189b029b6-C001-87",
"05bf376f0a18cf313ead7189b029b6-C001-207",
"05bf376f0a18cf313ead7189b029b6-C001-211"
]
},
"@BACK@": {
"gold_contexts": [
[
"05bf376f0a18cf313ead7189b029b6-C001-12"
],
[
"05bf376f0a18cf313ead7189b029b6-C001-190"
]
],
"cite_sentences": [
"05bf376f0a18cf313ead7189b029b6-C001-12",
"05bf376f0a18cf313ead7189b029b6-C001-190"
]
},
"@EXT@": {
"gold_contexts": [
[
"05bf376f0a18cf313ead7189b029b6-C001-19"
]
],
"cite_sentences": [
"05bf376f0a18cf313ead7189b029b6-C001-19"
]
}
}
},
"ABC_79ac70221fa28e577876425cad627f_10": {
"x": [
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-2",
"text": "Language is designed to convey useful information about the world, thus serving as a scaffold for efficient human learning."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-3",
"text": "How can we let language guide representation learning in machine learning models?"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-4",
"text": "We explore this question in the setting of few-shot visual classification, proposing models which learn to perform visual classification while jointly predicting natural language task descriptions at train time."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-5",
"text": "At test time, with no language available, we find that these language-influenced visual representations are more generalizable, compared to meta-learning baselines and approaches that explicitly use language as a bottleneck for classification."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-26",
"text": "We are interested in N -way, K-shot learning, where a task t consists of N support classes {S"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-8",
"text": "Humans are powerful and data-efficient learners partially due to the ability to learn from language [6, 30] : for instance, we can learn about robins not by seeing thousands of examples, but by being told that a robin is a bird with a red belly and brown feathers."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-9",
"text": "This language not only helps us learn about robins, but shapes the way we view the world, constraining the hypotheses we form for other concepts [12] : given a new bird like seagulls, even without language we know to attend to salient features including belly and feather color."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-10",
"text": "In this paper, we propose to use language as a guide for representation learning, building few-shot classification models that learn visual representations while jointly predicting task-specific language during training."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-11",
"text": "Crucially, our models can operate without language at test time: a more practical setting, since it is often unrealistic to assume that linguistic supervision is available for unseen classes encountered in the wild."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-12",
"text": "Compared to meta-learning baselines and recent approaches which use language supervision as a more fundamental bottleneck in a model [1] , we find this simple auxiliary training objective results in learned representations that generalize better to new concepts."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-13",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-14",
"text": "**RELATED WORK**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-15",
"text": "Language has been shown to assist visual classification in various settings, including traditional visual classification with no transfer [16] and with language available at test time in the form of class labels or descriptions for zero- [10, 11, 27] or few-shot [24, 33] learning."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-16",
"text": "Unlike this work, we study a setting where we have no language at test time and test tasks are unseen, so language from training can no longer be used as additional class information [cf. 16] or weak supervision for labeling additional in-domain data [cf. 15] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-17",
"text": "Our work can thus be seen as an instance of the learning using privileged information (LUPI) [31] framework, where richer supervision augments a model during training only."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-18",
"text": "In this framework, learning with attributes and other domain-specific rationales has been tackled extensively [8, 9, 29] , but language remains relatively unexplored."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-19",
"text": "[13] use METEOR scores between captions as a similarity metric for specializing embeddings for image retrieval, but do not directly Figure 1 : Building on prototype networks [26] , we propose few-shot classification models whose learned representations are constrained to predict natural language descriptions of the task during training, in contrast to models [1] which explicitly use language as a bottleneck for classification."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-20",
"text": "ground language explanations."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-21",
"text": "[28] explore a supervision setting similar to ours, except in highly structured text and symbolic domains where descriptions can be easily converted to executable forms via semantic parsing."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-22",
"text": "Another line of work studies models which generate natural language explanations of decisions for interpretability for both textual (e.g. natural language inference; [3] ) and visual [17, 18] tasks, but here we examine whether this act of predicting language can actually improve downstream task performance; similar ideas have been explored in text [22] and reinforcement learning [2, 14] domains."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-23",
"text": "Our work is most similar to [1] , which we describe and compare to later."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-24",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-25",
"text": "**LANGUAGE-SHAPED LEARNING**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-27",
"text": "The goal is to predict each y The approach we propose is applicable to any meta-learning framework that learns an embedding of its input."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-28",
"text": "Here we use prototype networks [26] , which have a simple but powerful inductive bias for few-shot learning."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-29",
"text": "Prototype networks learn an embedding function f \u03b8 for exemplars; the embeddings of all examples of a class n are then averaged to form a class \"prototype\" (omitting task (t) for clarity):"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-30",
"text": "Given a query point (x m , y m ), we predict class n with probability proportional to some similarity function s between c n and f \u03b8 (x m ):"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-31",
"text": "The classification loss for a single task is"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-32",
"text": "Adding language."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-33",
"text": "Now assume that during training we have for each class S n a set of J n associated natural language descriptions W n = {w 1 , . . . , w Jn }, where w j is a sequence of words w j = (w j,1 , . . . , w j,|wj | )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-34",
"text": "Each w j should explain features of S n and need not be associated with individual examples."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-35",
"text": "1 In Figure 1 , we have one description w 1 = (A, red, . . . , square)."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-36",
"text": "Our approach is simple: we constrain f \u03b8 to learn prototypes that can also decode the class language descriptions."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-37",
"text": "Letc n be the prototype formed by averaging the support and query examples of class n. Then define a language model g \u03c6 (e.g. a recurrent neural network), which conditioned onc n provides a probability distribution over descriptions g \u03c6 (\u0175 j |c n ) with a corresponding natural language loss:"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-38",
"text": "i.e. the total negative log-likelihood of the class descriptions across all classes in the task."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-39",
"text": "Now we jointly minimize both losses:"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-40",
"text": "where \u03bb NL is a tunable parameter controlling the weight of the natural language loss."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-41",
"text": "At test, we simply discard g \u03c6 and use f \u03b8 to classify."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-42",
"text": "With this component, we call our approach language-shaped learning (LSL) (Figure 1 )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-43",
"text": "Relation to L3."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-44",
"text": "LSL is similar to another recent model for this setting: Learning with Latent Language (L3) [1] , which proposes to use language not only as a supervision source, but as a bottleneck for classification ( Figure 1 )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-45",
"text": "L3 has the same basic architecture of LSL, but the concepts c n are the language descriptions themselves, embedded with an additional recurrent neural network (RNN) encoder h \u03b7 : c n = h \u03b7 (w n )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-46",
"text": "During training, the ground-truth description is used for classification, while g \u03c6 is trained to produce the description; at test, L3 samples descriptions\u0175 n from g \u03c6 , keeping the description most similar to the support according to the similarity function s."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-47",
"text": "While L3 has been shown to outperform meta-learning baselines, there are two potential sources of this benefit: is it the linguistic bottleneck itself, or the regularization imposed by training f \u03b8 to predict language?"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-48",
"text": "Our evaluation aims to disentangle these effects: LSL isolates the regularization component, and thus is simpler than L3 since it (1) does not require the additional embedding module h \u03b7 and (2) does not need the test-time language sampling procedure."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-49",
"text": "2"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-51",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-52",
"text": "Here we describe our two tasks and models."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-53",
"text": "For full training details and code, see Appendix A."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-54",
"text": "ShapeWorld."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-55",
"text": "First, we use the ShapeWorld [20] dataset devised by [1] , which consists of 9000 training, 1000 validation, and 4000 test tasks ( Figure 2 )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-56",
"text": "3 Each task contains a single support set of K = 4 images representing a visual concept with an associated (artificial) English language description, generated with a minimal recursion semantics representation of the concept [7] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-57",
"text": "Each concept is a spatial relation between two objects, optionally qualified by color and/or shape; 2-3 distractor shapes are also present in each image."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-58",
"text": "The task is to predict whether a single query image x belongs to the concept."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-59",
"text": "Model details are identical to [1] for easy comparison."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-60",
"text": "f \u03b8 is the final convolutional layer of a fixed ImageNet-pretrained VGG-16 [25] fed through two fully-connected layers:"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-61",
"text": "Since this is a binary classification task with only 1 (positive) support class S and prototype c, we define the similarity function s(a, b) = \u03c3(a \u00b7 b) and the prediction P (\u0177 = 1 | x) = s (f \u03b8 (x), c)."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-62",
"text": "g \u03c6 is a gated recurrent unit (GRU) RNN [5] with hidden size h = 512, trained with teacher forcing."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-63",
"text": "Using a grid search on the validation set, we set \u03bb NL = 20."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-64",
"text": "Birds. To see if LSL can scale to more realistic scenarios, we use the Caltech-UCSD Birds dataset [32] , which contains 200 bird species, each with 40-60 images, split into 100 train, 50 validation,"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-65",
"text": "The bird has a white underbelly, black feathers in the wings, a large wingspan, and a white beak."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-67",
"text": "**BIRDS SHAPEWORLD**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-68",
"text": "a cyan pentagon is to the right of a magenta shape Support"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-70",
"text": "**TRUE QUERY FALSE QUERY**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-71",
"text": "This bird has distinctive-looking brown and white stripes all over its body, and its brown tail sticks up. and 50 test classes."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-72",
"text": "We use the language descriptions collected by [23] , where AMT crowdworkers were asked to describe images of birds in detail, without reference to the species (Figure 2 )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-73",
"text": "While 10 English descriptions per image are available in [23] , we assume a more realistic scenario where we have much less language available only at the class level: removing one-to-one associations between images and their descriptions, we aggregate a total of D descriptions for each class, and for each k-shot training episode we sample k descriptions from each class n to use as descriptions W n ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-74",
"text": "In practice, we found good results with as little as D = 20 descriptions per class (2000 total) which we report here; for results varying this number, see Appendix B."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-75",
"text": "We evaluate on the 5-way, 1-shot setting, and as f \u03b8 use the 4-layer convolutional backbone used in much of the few-shot literature [4] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-76",
"text": "Here we use a learned bilinear similarity function, s(a, b) = a Wb, where W is learned jointly with the model."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-77",
"text": "g \u03c6 is a GRU with hidden size h = 200, and with another grid search we set \u03bb NL = 3."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-78",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-79",
"text": "**RESULTS**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-80",
"text": "Results are located in Table 1 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-81",
"text": "For ShapeWorld, LSL outperforms its meta-learning baseline (Meta) by 6.7 points, and does as well as L3."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-82",
"text": "For Birds, we observe a smaller but still significant 3.3 point increase over Meta, while L3's performance drops below baseline."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-83",
"text": "We thus conclude that any benefit of L3 is mostly due to the regularizing effect that language has on its image representations, rather than the linguistic bottleneck."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-84",
"text": "Isolating the regularization, we find that LSL is the superior yet conceptually simpler model, and L3's discrete bottleneck can actually hurt in some settings."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-85",
"text": "To identify which aspects of language are most helpful for the model, we examine LSL performance under ablated language supervision: (1) keeping only a list of common color words, (2) filtering out color words, (3) shuffling the words in each caption, and (4) shuffling the captions across tasks (Figure 3 )."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-86",
"text": "We find that while the benefits of color/no-color language varies for ShapeWorld and Birds, neither component is completely sufficient for the full benefit of language supervision, demonstrating that LSL is able to leverage both colors and other attributes (e.g. size, shape) exposed through language."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-87",
"text": "Word order is important for Birds but surprisingly unimportant for ShapeWorld."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-88",
"text": "Finally, when the captions are shuffled and thus the linguistic signal is random, LSL for Birds suffers no performance loss compared to Meta, while LSL for ShapeWorld drops significantly, likely because the language descriptions are more central to the task."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-90",
"text": "**DISCUSSION**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-91",
"text": "We presented a method for regularizing a few-shot visual recognition model by forcing the model to predict natural language descriptions during training."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-92",
"text": "Across two tasks, the language-influenced representations learned with such models improved generalization over those without linguistic supervision."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-93",
"text": "By comparing to L3, we found that if a model has been trained to learn representations which expose the features and abstractions in language, a linguistic bottleneck on top of this already language-shaped representation is unnecessary, at least for the kinds of visual tasks explored here."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-94",
"text": "The line between language and sufficiently rich attributes and rationales is blurry, and as recent work has shown [29] , the performance gains in this work can likely be observed by regularizing with attributes."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-95",
"text": "However, unlike attributes and annotator rationales, language is (1) a more natural medium for annotators, (2) does not require preconceived restrictions on the kinds of features relevant to the task, and (3) is abundant in unsupervised forms."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-96",
"text": "This last point suggests we can shape representations with language from external resources (e.g. the Web), a promising future direction of work."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-97",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-98",
"text": "**A TASK/MODEL TRAINING DETAILS**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-99",
"text": "Our code is publicly available at https://github.com/jayelm/lsl."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-100",
"text": "A.1 ShapeWorld f \u03b8 . Like [1] , f \u03b8 starts with features extracted from the last convolutional layer of a fixed ImageNetpretrained VGG-19 network [25] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-101",
"text": "These 4608-d embeddings are then fed into two fully connected layers \u2208 R 4608\u00d7512 , R 512\u00d7512 with one ReLU nonlinearity in between."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-102",
"text": "LSL."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-103",
"text": "For LSL, the 512-d embedding from f \u03b8 directly initializes the 512-d hidden state of the GRU g \u03c6 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-104",
"text": "We use 300-d word embeddings initialized randomly (initializing with GloVe made no significant difference)."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-105",
"text": "L3."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-106",
"text": "f \u03b8 and g \u03c6 are the same as in LSL and Meta."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-107",
"text": "h \u03b7 is a unidirectional 1-layer GRU with hidden size 512 sharing the same word embeddings as g \u03c6 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-108",
"text": "The output of the last hidden state is taken as the embedding of the description w (t) ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-109",
"text": "Like [1] , a total of 10 descriptions per task are sampled at test time."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-110",
"text": "Training."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-111",
"text": "We train for 50 epochs, each epoch consisting of 100 batches with 100 tasks in each batch, with the Adam optimizer [19] and a learning rate of 0.001."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-112",
"text": "We selected the model with highest epoch validation accuracy during training."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-113",
"text": "This differs slightly from [1] , who use different numbers of epochs per model and did not specify how they were chosen; otherwise, the training and evaluation process is the same."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-114",
"text": "Data."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-115",
"text": "We recreated the ShapeWorld dataset using the same code as [1] , except generating 4x as many test tasks (4000 vs 1000) for more stable confidence intervals."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-116",
"text": "Note that results for both L3 and Baseline (Meta) are 3-4 points lower than the scores of the corresponding implementations in [1] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-117",
"text": "This is likely due to (1) differences in model initialization due to our PyTorch reimplementation, (2) recreation of the dataset, and (3) our use of early stopping."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-118",
"text": "A.2 Birds f \u03b8 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-119",
"text": "The 4-layer convolutional backbone f \u03b8 is the same as the one used in much of the few-shot literature [4, 26] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-120",
"text": "The model has 4 convolutional blocks, each consisting of a 64-filter 3x3 convolution, batch normalization, ReLU nonlinearity, and 2x2 max-pooling layer."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-121",
"text": "With an input image size of 84 \u00d7 84 this results in 1600-d image embeddings."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-122",
"text": "Finally, the similarity metric matrix W has dimension 1600 \u00d7 1600."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-123",
"text": "LSL."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-124",
"text": "The resulting 1600-d image embeddings are fed into a single linear layer \u2208 R 1600\u00d7200 which initializes the 200-d hidden state of the GRU."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-125",
"text": "We initialize embeddings with GloVe [21] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-126",
"text": "We did not observe significant gains from increasing the size of the decoder g \u03c6 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-128",
"text": "**L3**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-129",
"text": ". f \u03b8 and g \u03c6 are the same."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-130",
"text": "h \u03b7 is a unidirectional GRU with hidden size 200 sharing the same embeddings as g \u03c6 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-131",
"text": "The last hidden state is taken as the concept c n ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-132",
"text": "10 descriptions per class are sampled at test time."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-133",
"text": "We did not observe significant gains from increasing the size of the decoder g \u03c6 or encoder h \u03b7 , nor increasing the number of descriptions sampled per class at test."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-134",
"text": "Training."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-135",
"text": "For ease of comparison to the few-shot literature we use the same training and evaluation process as [4] ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-136",
"text": "Models were trained for 60000 episodes, each episode consisting of one randomly sampled task with 16 query images per class."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-137",
"text": "Models were trained end-to-end with the Adam optimizer [19] and a learning rate of 0.001."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-138",
"text": "We select the model with the highest validation accuracy after training."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-139",
"text": "Data. Like [4] , we use standard data preprocessing and training augmentation: ImageNet mean pixel normalization, random cropping, horizontal flipping, and color jittering."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-141",
"text": "**B AMOUNT OF LANGUAGE SUPERVISION**"
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-142",
"text": "See Figure 4 ."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-143",
"text": "With enough language supervision (one caption for each image), L3's performance approaches baseline."
},
{
"sent_id": "79ac70221fa28e577876425cad627f-C001-144",
"text": "Meanwhile, LSL shows limited gains as the amount of language supervision increases past 10 captions per class."
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"79ac70221fa28e577876425cad627f-C001-12"
],
[
"79ac70221fa28e577876425cad627f-C001-19"
],
[
"79ac70221fa28e577876425cad627f-C001-113"
],
[
"79ac70221fa28e577876425cad627f-C001-115"
],
[
"79ac70221fa28e577876425cad627f-C001-116"
]
],
"cite_sentences": [
"79ac70221fa28e577876425cad627f-C001-12",
"79ac70221fa28e577876425cad627f-C001-19",
"79ac70221fa28e577876425cad627f-C001-113",
"79ac70221fa28e577876425cad627f-C001-115",
"79ac70221fa28e577876425cad627f-C001-116"
]
},
"@SIM@": {
"gold_contexts": [
[
"79ac70221fa28e577876425cad627f-C001-23"
],
[
"79ac70221fa28e577876425cad627f-C001-42",
"79ac70221fa28e577876425cad627f-C001-43",
"79ac70221fa28e577876425cad627f-C001-44"
],
[
"79ac70221fa28e577876425cad627f-C001-100"
],
[
"79ac70221fa28e577876425cad627f-C001-109"
],
[
"79ac70221fa28e577876425cad627f-C001-115"
]
],
"cite_sentences": [
"79ac70221fa28e577876425cad627f-C001-23",
"79ac70221fa28e577876425cad627f-C001-42",
"79ac70221fa28e577876425cad627f-C001-44",
"79ac70221fa28e577876425cad627f-C001-100",
"79ac70221fa28e577876425cad627f-C001-109",
"79ac70221fa28e577876425cad627f-C001-115"
]
},
"@USE@": {
"gold_contexts": [
[
"79ac70221fa28e577876425cad627f-C001-55"
],
[
"79ac70221fa28e577876425cad627f-C001-59"
],
[
"79ac70221fa28e577876425cad627f-C001-115"
]
],
"cite_sentences": [
"79ac70221fa28e577876425cad627f-C001-55",
"79ac70221fa28e577876425cad627f-C001-59",
"79ac70221fa28e577876425cad627f-C001-115"
]
}
}
},
"ABC_ecb6e93a5254b86ef49a5ffd0a52a0_10": {
"x": [
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-2",
"text": "We propose a method for modeling pronunciation variation in the context of spell checking for non-native writers of English."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-3",
"text": "Spell checkers, typically developed for native speakers, fail to address many of the types of spelling errors peculiar to non-native speakers, especially those errors influenced by differences in phonology."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-4",
"text": "Our model of pronunciation variation is used to extend a pronouncing dictionary for use in the spelling correction algorithm developed by Toutanova and Moore (2002), which includes models for both orthography and pronunciation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-5",
"text": "The pronunciation variation modeling is shown to improve performance for misspellings produced by Japanese writers of English."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-112",
"text": "The frequency of phone alignments for all utterances in the ERJ are calculated."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-8",
"text": "Spell checkers identify misspellings, select appropriate words as suggested corrections, and rank the suggested corrections so that the likely intended word is high in the list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-9",
"text": "Since traditional spell checkers have been developed with competent native speakers as the target users, they do not appropriately address many types of errors made by nonnative writers and they often fail to suggest the appropriate corrections."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-10",
"text": "Non-native writers of English struggle with many of the same idiosyncrasies of English spelling that cause difficulty for native speakers, but differences between English phonology and the phonology of their native language lead to types of spelling errors not anticipated by traditional spell checkers (Okada, 2004; Mitton and Okada, 2007) ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-11",
"text": "Okada (2004) and Mitton and Okada (2007) investigate spelling errors made by Japanese writers of English as a foreign language (JWEFL)."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-12",
"text": "Okada (2004) identifies two main sources of errors for JWEFL: differences between English and Japanese phonology and differences between the English alphabet and the Japanese romazi writing system, which uses a subset of English letters."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-13",
"text": "Phonological differences result in number of distinctions in English that are not present in Japanese and romazi causes difficulties for JWEFL because the Latin letters correspond to very different sounds in Japanese."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-14",
"text": "We propose a method for creating a model of pronunciation variation from a phonetically untranscribed corpus of read speech recorded by nonnative speakers."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-15",
"text": "The pronunciation variation model is used to generate multiple pronunciations for each canonical pronunciation in a pronouncing dictionary and these variations are used in the spelling correction approach developed by Toutanova and Moore (2002) , which uses statistical models of spelling errors that consider both orthography and pronunciation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-16",
"text": "Several conventions are used throughout this paper: a word is a sequence of characters from the given alphabet found in the word list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-17",
"text": "A word list is a list of words."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-18",
"text": "A misspelling, marked with * , is a sequence of characters not found in the word list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-19",
"text": "A candidate correction is a word from the word list proposed as a potential correction."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-20",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-21",
"text": "**BACKGROUND**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-22",
"text": "Research in spell checking (see Kukich, 1992 , for a survey of spell checking research) has focused on three main problems: non-word error detection, isolated-word error correction, and contextdependent word correction."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-23",
"text": "We focus on the first two tasks."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-24",
"text": "A non-word is a sequence of letters that is not a possible word in the language in any context, e.g., English * thier."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-25",
"text": "Once a sequence of letters has been determined to be a non-word, isolatedword error correction is the process of determining the appropriate word to substitute for the non-word."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-26",
"text": "Given a sequence of letters, there are thus two main subtasks: 1) determine whether this is a nonword, 2) if so, select and rank candidate words as potential corrections to present to the writer."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-27",
"text": "The first subtask can be accomplished by searching for the sequence of letters in a word list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-28",
"text": "The second subtask can be stated as follows (Brill and Moore, 2000) : Given an alphabet \u03a3, a word list D of strings \u2208 \u03a3 * , and a string r / \u2208 D and \u2208 \u03a3 * , find w \u2208 D such that w is the most likely correction."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-29",
"text": "Minimum edit distance is used to select the most likely candidate corrections."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-30",
"text": "The general idea is that a minimum number of edit operations such as insertion and substitution are needed to convert the misspelling into a word."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-31",
"text": "Words requiring the smallest numbers of edit operations are selected as the candidate corrections."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-33",
"text": "**EDIT OPERATIONS AND EDIT WEIGHTS**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-34",
"text": "In recent spelling correction approaches, edit operations have been extended beyond single character edits and the methods for calculating edit operation weights have become more sophisticated."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-35",
"text": "The spelling error model proposed by Brill and Moore (2000) allows generic string edit operations up to a certain length."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-36",
"text": "Each edit operation also has an associated probability that improves the ranking of candidate corrections by modeling how likely particular edits are."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-37",
"text": "Brill and Moore (2000) estimate the probability of each edit from a corpus of spelling errors."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-38",
"text": "Toutanova and Moore (2002) extend Brill and Moore (2000) to consider edits over both letter sequences and sequences of phones in the pronunciations of the word and misspelling."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-39",
"text": "They show that including pronunciation information improves performance as compared to Brill and Moore (2000) ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-41",
"text": "**NOISY CHANNEL SPELLING CORRECTION**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-42",
"text": "The spelling correction models from Brill and Moore (2000) and Toutanova and Moore (2002) use the noisy channel model approach to determine the types and weights of edit operations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-43",
"text": "The idea behind this approach is that a writer starts out with the intended word w in mind, but as it is being written the word passes through a noisy channel resulting in the observed non-word r. In order to determine how likely a candidate correction is, the spelling correction model determines the probability that the word w was the intended word given the misspelling r: P (w|r)."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-44",
"text": "To find the best correction, the word w is found for which P (w|r) is maximized: argmax w P (w|r)."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-45",
"text": "Applying Bayes' Rule and discarding the normalizing constant P (r) gives the correction model:"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-46",
"text": "argmax w P (w|r) = argmax w P (w)P (r|w) P (w), how probable the word w is overall, and P (r|w), how probable it is for a writer intending to write w to output r, can be estimated from corpora containing misspellings."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-47",
"text": "In the following experiments, P (w) is assumed be equal for all words to focus this work on estimating the error model P (r|w) for JWEFL."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-48",
"text": "1 Brill and Moore (2000) allow all edit operations \u03b1 \u2192 \u03b2 where \u03a3 is the alphabet and \u03b1, \u03b2 \u2208 \u03a3 * , with a constraint on the length of \u03b1 and \u03b2."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-49",
"text": "In order to consider all ways that a word w may generate r with the possibility that any, possibly empty, substring \u03b1 of w becomes any, possibly empty, substring \u03b2 of r, it is necessary to consider all ways that w and r may be partitioned into substrings."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-50",
"text": "This error model over letters, called P L , is approximated by Brill and Moore (2000) as shown in Figure 1 by considering only the pair of partitions of w and r with the maximum product of the probabilities of individual substitutions."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-51",
"text": "P art(w) is all possible partitions of w, |R| is number of segments in a particular partition, and R i is the i th segment of the partition."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-52",
"text": "The parameters for P L (r|w) are estimated from a corpus of pairs of misspellings and target words."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-53",
"text": "The method, which is described in detail in Brill and Moore (2000) , involves aligning the letters in pairs of words and misspellings, expanding each alignment with up to N neighboring alignments, and calculating the probability of each \u03b1 \u2192 \u03b2 alignment."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-54",
"text": "Since we will be using a training corpus that consists solely of pairs of misspellings and words (see section 3), we would have lower probabilities for"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-55",
"text": "Figure 1: Approximations of P L from Brill and Moore (2000) and P P HL from Toutanova and Moore (2002) \u03b1 \u2192 \u03b1 than would be found in a corpus with misspellings observed in context with correct words."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-56",
"text": "To compensate, we approximate P (\u03b1 \u2192 \u03b1) by assigning it a minimum probability m:"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-58",
"text": "**EXTENDING TO PRONUNCIATION**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-59",
"text": "Toutanova and Moore (2002) describe an extension to Brill and Moore (2000) where the same noisy channel error model is used to model phone sequences instead of letter sequences."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-60",
"text": "Instead of the word w and the non-word r, the error model considers the pronunciation of the non-word r, pron r , and the pronunciation of the word w, pron w ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-61",
"text": "The error model over phone sequences, called P P H , is just like P L shown in Figure 1 except that r and w are replaced with their pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-62",
"text": "The model is trained like P L using alignments between phones."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-63",
"text": "Since a spelling correction model needs to rank candidate words rather than candidate pronunciations, Toutanova and Moore (2002) derive an error model that determines the probability that a word w was spelled as the non-word r based on their pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-64",
"text": "Their approximation of this model, called P P HL , is also shown in Figure 1 ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-65",
"text": "P P H (pron w |pron r ) is the phone error model described above and P (pron r |r) is provided by the letter-to-phone model described below."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-67",
"text": "**LETTER-TO-PHONE MODEL**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-68",
"text": "A letter-to-phone (LTP) model is needed to predict the pronunciation of misspellings for P P HL , since they are not found in a pronouncing dictionary."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-69",
"text": "Like Toutanova and Moore (2002) , we use the n-gram LTP model from Fisher (1999) to predict these pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-70",
"text": "The n-gram LTP model predicts the pronunciation of each letter in a word considering up to four letters of context to the left and right."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-71",
"text": "The most specific context found for each letter and its context in the training data is used to predict the pronunciation of a word."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-72",
"text": "We extended the prediction step to consider the most probable phone for the top M most specific contexts."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-73",
"text": "We implemented the LTP algorithm and trained and evaluated it using pronunciations from CMU-DICT."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-74",
"text": "A training corpus was created by pairing the words from the size 70 CMUDICT-filtered SCOWL word list (see section 3) with their pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-75",
"text": "This list of approximately 62,000 words was split into a training set with 80% of entries and a test set with the remaining 20%."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-76",
"text": "We found that the best performance is seen when M = 3, giving 95.5% phone accuracy and 74.9% word accuracy."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-78",
"text": "**CALCULATING FINAL SCORES**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-79",
"text": "For a misspelling r and a candidate correction w, the letter model P L gives the probability that w was written as r due to the noisy channel taking into account only the orthography."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-80",
"text": "P P H does the same for the pronunciations of r and w, giving the probability that pron w was output was pron r ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-81",
"text": "The pronunciation model P P HL relates the pronunciations modeled by P P H to the orthography in order to give the probability that r was written as w based on pronunciation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-82",
"text": "P L and P P HL are then combined as follows to calculate a score for each candidate correction."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-84",
"text": "**RESOURCES AND DATA PREPARATION**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-85",
"text": "Our spelling correction approach, which includes error models for both orthography and pronunciation (see section 2.2) and which considers pronunciation variation for JWEFL requires a number of resources: 1) spoken corpora of American English (TIMIT, TIMIT 1991) and Japanese English (ERJ, see below) are used to model pronunciation variation, 2) a pronunciation dictionary (CMUDICT, CMUDICT 1998) provides American English pronunciations for the target words, 3) a corpus of spelling errors made by JWEFL (Atsuo-Henry Corpus, see below) is used to train spelling error models and test the spell checker's performance, and 4) Spell Checker Oriented Word Lists (SCOWL, see below) are adapted for our use."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-86",
"text": "The English Read by Japanese Corpus (Minematsu et al., 2002) consists of 70,000 prompts containing phonemic and prosodic cues recorded by 200 native Japanese speakers with varying English competence."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-87",
"text": "See Minematsu et al. (2002) for details on the construction of the corpus."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-88",
"text": "The Atsuo-Henry Corpus (Okada, 2004) includes a corpus of spelling errors made by JWEFL that consists of a collection of spelling errors from multiple corpora."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-89",
"text": "2 For use with our spell checker, the corpus has been cleaned up and modified to fit our task, resulting in 4,769 unique misspellings of 1,046 target words."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-90",
"text": "The data is divided into training (80%), development (10%), and test (10%) sets."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-91",
"text": "For our word lists, we use adapted versions of the Spell Checker Oriented Word Lists."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-92",
"text": "3 The size 50 word lists are used in order to create a general purpose word list that covers all the target words from the Atsuo-Henry Corpus."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-93",
"text": "Since the target pronunciation of each item is needed for the pronunciation model, the word list was filtered to remove words whose pronunciation is not in CMUDICT."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-94",
"text": "After filtering, the word list contains 54,001 words."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-96",
"text": "**METHOD**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-97",
"text": "This section presents our method for modeling pronunciation variation from a phonetically untranscribed corpus of read speech."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-98",
"text": "The pronunciationbased spelling correction approach developed in Toutanova and Moore (2002) requires a list of possible pronunciations in order to compare the pronunciation of the misspelling to the pronunciation of correct words."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-99",
"text": "To account for target pronunciations specific to Japanese speakers, we observe the pronunciation variation in the ERJ and generate additional pronunciations for each word in the word list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-100",
"text": "Since the ERJ is not transcribed, we begin by adapting a recognizer trained on native English speech."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-101",
"text": "First, the ERJ is recognized using a monophone recognizer trained on TIMIT."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-102",
"text": "Next, the most frequent variations between the canonical and recognized pronunciations are used to adapt the recognizer."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-103",
"text": "The adapted recognizer is then used to recognize the ERJ in forced alignment with the canonical pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-104",
"text": "Finally, the variations from the previous step are used to create models of pronunciation variation for each phone, which are used to generate multiple pronunciations for each word."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-105",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-106",
"text": "**INITIAL RECOGNIZER**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-107",
"text": "A monophone speech recognizer was trained on all TIMIT data using the Hidden Markov Model Toolkit (HTK)."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-108",
"text": "4 This recognizer is used to generate a phone string for each utterance in the ERJ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-109",
"text": "Each recognized phone string is then aligned with the canonical pronunciation provided to the speakers."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-110",
"text": "Correct alignments and substitutions are considered with no context and insertions are conditioned on the previous phone."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-111",
"text": "Due to restrictions in HTK, deletions are currently ignored."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-113",
"text": "Because of the low phone accuracy of monophone recognizers, especially on non-native speech, alignments are observed between nearly all pairs of phones."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-114",
"text": "In order to focus on the most frequent alignments common to multiple speakers and utterances, any alignment observed less than 20% as often as the most frequent alignment for that canonical phone is discarded, which results in an average of three variants of each phone."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-115",
"text": "5"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-116",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-117",
"text": "**ADAPTING THE RECOGNIZER**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-118",
"text": "Now that we have probability distributions over observed phones, the HMMs trained on TIMIT are modified as follows to allow the observed variation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-119",
"text": "To allow, for instance, variation between p and th, the states for th from the original recognizer are inserted into the model for p as a separate path."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-120",
"text": "The resulting phone model is shown in Figure 2 ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-121",
"text": "The transition probabilities into the first states"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-122",
"text": "Figure 2: Adapted phone model for p accounting for variation between p and th of the phones come from the probability distribution observed in the initial recognition step."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-123",
"text": "The transition probabilities between the three states for each variant phone remain unchanged."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-124",
"text": "All HMMs are adapted in this manner using the probability distributions from the initial recognition step."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-125",
"text": "The adapted HMMs are used to recognize the ERJ Corpus for a second time, this time in forced alignment with the canonical pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-126",
"text": "The state transitions indicate which variant of each phone was recognized and the correspondences between the canonical phones and recognized phones are used to generate a new probability distribution over observed phones for each canonical phone."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-127",
"text": "These are used to find the most probable pronunciation variations for a native-speaker pronouncing dictionary."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-129",
"text": "**GENERATING PRONUNCIATIONS**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-130",
"text": "The observed phone variation is used to generate multiple pronunciations for each pronunciation in the word list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-131",
"text": "The OpenFst Library 6 is used to find the most probable pronunciations in each case."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-132",
"text": "First, FSTs are created for each phone using the probability distributions from the previous section."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-133",
"text": "Next, an FST is created for the entire word by concatenating the FSTs for the pronunciation from CMU-DICT."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-134",
"text": "The pronunciations corresponding to the best n paths through the FST and the original canonical pronunciation become possible pronunciations in the extended pronouncing dictionary."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-135",
"text": "The size 50 word list contains 54,001 words and when expanded to contain the top five variations of each pronunciation, there are 255,827 unique pronunciations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-137",
"text": "**RESULTS**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-138",
"text": "In order to evaluate the effect of pronunciation variation in Toutanova and Moore (2002) 's spelling correction approach, we compare the performance of the pronunciation model and the combined model with and without pronunciation variation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-139",
"text": "We implemented the letter and pronunciation spelling correction models as described in section 2.2."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-140",
"text": "The letter error model P L and the phone error model P P H are trained on the training set."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-141",
"text": "The development set is used to tune the parameters introduced in previous sections."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-142",
"text": "7 In order to rank the words as candidate corrections for a misspelling r, P L (r|w) and P P HL (r|w) are calculated for each word in the word list using the algorithm described in Brill and Moore (2000) ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-143",
"text": "Finally, P L and P P HL are combined using S CM B to rank each word."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-144",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-145",
"text": "**BASELINE**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-146",
"text": "The open source spell checker GNU Aspell 8 is used to determine the baseline performance of a traditional spell checker using the same word list."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-147",
"text": "An Aspell dictionary was created with the word list described in section 3."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-148",
"text": "Aspell's performance is shown in Table 1 ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-149",
"text": "The 1-Best performance is the percentage of test items for which the target word was the first candidate correction, 2-Best is the percentage for which the target was in the top two, etc."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-151",
"text": "**EVALUATION OF PRONUNCIATION VARIATION**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-152",
"text": "The effect of introducing pronunciation variation using the method described in section 4 can be evaluated by examining the performance on the test set for P P HL with and without the additional variations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-153",
"text": "The results in Table 1 show that the addition of pronunciation variations does indeed improve the performance of P P HL across the board."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-154",
"text": "The 1-Best, 3-Best, and 4-Best cases for P P HL with variation show significant improvement (p<0.05) over P P HL without variation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-156",
"text": "**EVALUATION OF THE COMBINED MODEL**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-157",
"text": "We evaluated the effect of including pronunciation variation in the combined model by comparing the performance of the combined model with and without pronunciation variation, see results in Table 1 ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-158",
"text": "Despite the improvements seen in P P HL with pronunciation variation, there are no significant differences between the results for the combined model with and without variation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-159",
"text": "The combined model with variation is also not significantly different from the letter model P L except for the drop in the 4-Best case."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-160",
"text": "To illustrate the performance of each model, the ranked lists in Table 2 give an example of the candidate corrections for the misspelling of any as * eney."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-161",
"text": "Aspell preserves the initial letter of the misspelling and vowels in many of its candidates."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-162",
"text": "P L 's top candidates also overlap a great deal in orthography, but there is more initial letter and vowel variation."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-163",
"text": "As we would predict, P P HL ranks any as the top correction, but some of the lower-ranked candidates for P P HL differ greatly in length."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-164",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-165",
"text": "**SUMMARY OF RESULTS**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-166",
"text": "The noisy channel spelling correction approach developed by Brill and Moore (2000) and Toutanova and Moore (2002) appears well-suited for writers of English as a foreign language."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-167",
"text": "The letter and combined models outperform the traditional spell checker Aspell by a wide margin."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-168",
"text": "Although including pronunciation variation does not improve the combined model, it leads to significant improvements in the pronunciation-based model P P HL ."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-170",
"text": "**CONCLUSION**"
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-171",
"text": "We have presented a method for modeling pronunciation variation from a phonetically untranscribed corpus of read non-native speech by adapting a monophone recognizer initially trained on native speech."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-172",
"text": "This model allows a native pronouncing dictionary to be extended to include non-native pronunciation variations."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-173",
"text": "We incorporated a pronouncing dictionary extended for Japanese writers of English into the spelling correction model developed by Toutanova and Moore (2002) , which combines orthography-based and pronunciation-based models."
},
{
"sent_id": "ecb6e93a5254b86ef49a5ffd0a52a0-C001-174",
"text": "Although the extended pronunciation dictionary does not lead to improvement in the combined model, it does leads to significant improvement in the pronunciation-based model."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-23",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-24",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-25",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-26",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-27",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-28"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-142"
]
],
"cite_sentences": [
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-28",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-142"
]
},
"@BACK@": {
"gold_contexts": [
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-35"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-37"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-38"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-39"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-42"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-48"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-50"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-53"
],
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-59"
]
],
"cite_sentences": [
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-35",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-37",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-38",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-39",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-42",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-48",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-50",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-53",
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-59"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-166"
]
],
"cite_sentences": [
"ecb6e93a5254b86ef49a5ffd0a52a0-C001-166"
]
}
}
},
"ABC_c4e2a9322471fb5988a5bd737fa51e_10": {
"x": [
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-2",
"text": "Existing target-specific sentiment recognition methods consider only a single target per tweet, and have been shown to miss nearly half of the actual targets mentioned."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-3",
"text": "We present a corpus of UK election tweets, with an average of 3.09 entities per tweet and more than one type of sentiment in half of the tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-4",
"text": "This requires a method for multi-target specific sentiment recognition, which we develop by using the context around a target as well as syntactic dependencies involving the target."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-5",
"text": "We present results of our method on both a benchmark corpus of single targets and the multi-target election corpus, showing state-of-the art performance in both corpora and outperforming previous approaches to multi-target sentiment task as well as deep learning models for singletarget sentiment."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-8",
"text": "Recent years have seen increasing interest in mining Twitter to assess public opinion on political affairs and controversial issues (Tumasjan et al., May 2010; Wang et al., 2012) as well as products and brands (Pak and Paroubek, 2010) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-9",
"text": "Opinion mining from Twitter is usually achieved by determining the overall sentiment expressed in an entire tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-10",
"text": "However, inferring the sentiment towards specific targets (e.g. people or organisations) is severely limited by such an approach since a tweet may contain different types of sentiment expressed towards each of the targets mentioned."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-11",
"text": "An early study by Jiang et al. (2011) showed that 40% of classification errors are caused by using tweetlevel approaches that are independent of the target."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-12",
"text": "Consider the tweet: \"I will b voting 4 Greens ... 1st reason: 2 remove 2 party alt. of labour or conservative every 5 years."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-13",
"text": "2nd: fracking\""
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-14",
"text": "The overall sentiment is positive but there is a negative sentiment towards \"labour\", \"conservative\" and \"fracking\" and a positive sentiment towards \"Greens\"."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-15",
"text": "Examples like this are common in tweets discussing topics like politics."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-16",
"text": "As has been demonstrated by the failure of election polls in both referenda and general elections (Burnap et al., 2016) , it is important to understand not only the overall mood of the electorate, but also to distinguish and identify sentiment towards different key issues and entities, many of which are discussed on social media on the run up to elections."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-17",
"text": "Recent developments on target-specific Twitter sentiment classification have explored different ways of modelling the association between target entities and their contexts."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-18",
"text": "Jiang et al. (2011) propose a rule-based approach that utilises dependency parsing and contextual tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-19",
"text": "Dong et al. (2014) , Tang et al. (2016a) and Zhang et al. (2016) have studied the use of different recurrent neural network models for such a task but the gain in performance from the complex neural architectures is rather unclear 1 In this work we introduce the multi-targetspecific sentiment recognition task, building a corpus of tweets from the 2015 UK general election campaign suited to the task."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-20",
"text": "In this dataset, target entities have been semi-automatically selected, and sentiment expressed towards multiple target entities as well as high-level topics in a tweet have been manually annotated."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-21",
"text": "Unlike all existing studies on target-specific Twitter sentiment analysis, we move away from the assumption that each tweet mentions a single target; we introduce a more realistic and challenging task of identifying sentiment towards multiple targets within a tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-22",
"text": "To tackle this task, we propose TDParse, a method that divides a tweet into different segments building on the approach introduced by Vo and Zhang (2015) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-23",
"text": "TDParse exploits a syntactic dependency parser designed explicitly for tweets (Kong et al., 2014) , and combines syntactic information for each target with its left-right context."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-24",
"text": "We evaluate and compare our proposed system both on our new multi-target UK election dataset, as well as on the benchmarking dataset for single-target dependent sentiment (Dong et al., 2014) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-25",
"text": "We show a clear state-of-the-art performance of TDParse over existing approaches for tweets with multiple targets, which encourages further research on the multi-target-specific sentiment recognition task."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-26",
"text": "2 2 Related Work: Target-dependent Sentiment Classification on Twitter"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-27",
"text": "The 2015 Semeval challenge introduced a task on target-specific Twitter sentiment (Rosenthal et al., 2015) which most systems (Boag et al., 2015; Plotnikova et al., 2015) treated in the same way as tweet level sentiment."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-28",
"text": "The best performing system in the 2016 Semeval Twitter challenge substask B (Nakov et al., 2016) , named Tweester, also performs on tweet level sentiment classification."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-29",
"text": "This is unsurprising since tweets in both tasks only contain a single predefined target entity and as a result often a tweet-level approach is sufficient."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-30",
"text": "An exception to tweet level approaches for this task, showing promise, is Townsend et al. (2015) , who trained a SVM classifier for tweet segmentation, then used a phrase-based sentiment classifier for assigning sentiment around the target."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-31",
"text": "The Semeval aspect-based sentiment analysis task (Pontiki et al., 2015; Pateria and Choubey, 2016) aims to identify sentiment towards entityattribute pairs in customer reviews."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-32",
"text": "This differs from our goal in the following way: both the entities and attributes are limited to a predefined inventory of limited size; they are aspect categories reflected in the reviews rather than specific targets, while each review only has one target entity, e.g. a laptop or a restaurant."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-33",
"text": "Also sentiment classification in formal text such as product reviews 2 The data and code can be found at https://goo.gl/ S2T1GO is very different from that in tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-34",
"text": "Recently Vargas et al. (2016) analysed the differences between the overall and target-dependent sentiment of tweets for three events containing 30 targets, showing many significant differences between the corresponding overall and target-dependent sentiment labels, thus confirming that these are distinct tasks."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-35",
"text": "Early work tackling target-dependent sentiment in tweets (Jiang et al., 2011) designed targetdependent features manually, relying on the syntactic parse tree and a set of grammar-based rules, and incorporating the sentiment labels of related tweets to improve the classification performance."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-36",
"text": "Recent work (Dong et al., 2014) used recursive neural networks and adaptively chose composition functions to combine child feature vectors according to their dependency type, to reflect sentiment signal propagation to the target."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-37",
"text": "Their datadriven composition selection approach replies on the dependency types as features and a small set of rules for constructing target-dependent trees."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-38",
"text": "Their manually annotated dataset contains only one target per tweet and has since been used for benchmarking by several subsequent studies (Vo and Zhang, 2015; Tang et al., 2016a; Zhang et al., 2016) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-39",
"text": "Vo and Zhang (2015) exploit the left and right context around a target in a tweet and combine low-dimensional embedding features from both contexts and the full tweet using a number of different pooling functions."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-40",
"text": "Despite not fully capturing semantic and syntactic information given the target entity, they show a much better performance than Dong et al. (2014) , indicating useful signals in relation to the target can be drawn from such context representation."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-41",
"text": "Both Tang et al. (2016a) and Zhang et al. (2016) adopt and integrate left-right target-dependent context into their recurrent neural network (RNN) respectively."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-42",
"text": "While Tang et al (2016a) propose two long shortterm memory (LSTM) models showing competitive performance to Vo and Zhang (2015) , Zhang et al (2016) design a gated neural network layer between the left and right context in a deep neural network structure but require a combination of three corpora for training and evaluation."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-43",
"text": "Results show that conventional neural network models like LSTM are incapable of explicitly capturing important context information of a target (Tang et al., 2016b) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-44",
"text": "Tang et al. (2016a) also experiment with adding attention layers for LSTM but fail to achieve competitive results possibly due to the small training corpus."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-45",
"text": "Going beyond the existing work we study the more challenging task of classifying sentiment towards multiple target entities within a tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-46",
"text": "Using the syntactic information drawn from tweetspecific parsing, in conjunction with the left-right contexts, we show the state-of-the-art performance in both single and multi-target classification tasks."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-47",
"text": "We also show that the tweet level approach that many sentiment systems adopted in both Semeval challenges, fail to capture all target-sentiments in a multi-target scenario (Section 5.1)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-49",
"text": "**CREATING A CORPUS FOR TARGET SPECIFIC SENTIMENT IN TWITTER**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-50",
"text": "We describe the design, collection and annotation of a corpus of tweets about the 2015 UK election."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-52",
"text": "**DATA HARVESTING AND ENTITY RECOGNITION**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-53",
"text": "We collected a corpus of tweets about the UK elections, as we wanted to select a political event that would trigger discussions on multiple entities and topics."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-54",
"text": "Collection was performed through Twitter's streaming API and tracking 14 hashtags 3 ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-55",
"text": "Data harvesting was performed between 7th February and 30th March 2015."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-56",
"text": "This led to the collection of 712k tweets, from which a subset was sampled for manual annotation of targetspecific sentiment."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-57",
"text": "We also created a list of 438 topic keywords relevant to 9 popular election issues 4 for data sampling."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-58",
"text": "The initial list of 438 seed words provided by a team of journalists was augmented by searching for similar words within a vector space on the basis of cosine similarity."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-59",
"text": "Keywords are used both in order to identify thematically relevant tweets and also targets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-60",
"text": "We also consider named entities as targets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-61",
"text": "Sampling of tweets was performed by removing retweets and making sure each tweet contained at least one topic keyword from one of the 9 election issues, leading to 52,190 highly relevant tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-62",
"text": "For the latter we ranked tweets based on a \"similarity\" relation, where \"similarity\" is measured as a function of content overlap (Mihalcea, 2004) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-63",
"text": "Formally, given a tweet S i being represented by 3 #ukelection2015, #ge2015, #ukge2015, #ukgeneralelec-tion2015, #bbcqt, #bbcsp, #bbcdp, #marrshow, #generalelec-tion2015, #ge15, #generalelection, #electionuk, #ukelection and #electionuk2015 4 EU and immigration, economy, NHS, education, crime, housing, defense, public spending, environment and energy the set of N words that appear in the tweet:"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-64",
"text": ".., W N i and our list of curated topic keywords T , the ranking function is defined as:"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-65",
"text": "where |S i | is the total number of words in the tweet; unlike Mihalcea (2004) we prefer longer tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-66",
"text": "We used exact matching with flexibility on the special characters at either end."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-67",
"text": "TF-IDF normalisation and cosine similarity were then applied to the dataset to remove very similar tweets (empirically we set the cosine similarity threshold to 0.6)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-68",
"text": "We also collected all external URLs mentioned in our dataset and their web content throughout the data harvesting period, filtering out tweets that only contain an external link or snippets of a web page."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-69",
"text": "Finally we sampled 4,500 top-ranked tweets keeping the representation of tweets mentioning each election issue proportionate to the original dataset."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-70",
"text": "For annotation we considered sentiment towards two types of targets: entities and topic keywords."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-71",
"text": "Entities were processed in two ways: firstly, named entities (people, locations, and organisations) were automatically annotated by combining the output of Stanford Named Entity Recognition (NER) (Finkel et al., 2005) , NLTK NER (Bird, 2006 ) and a Twitter-specific NER (Ritter et al., 2011) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-72",
"text": "All three were combined for a more complete coverage of entities mentioned in tweets and subsequently corrected by removing wrongly marked entities through manual annotation."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-73",
"text": "Secondly, to make sure we covered all key entities in the tweets, we also matched tweets against a manually curated list of 7 political-party names and added users mentioned therein as entities."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-74",
"text": "The second type of targets matched the topic keywords from our curated list."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-76",
"text": "**MANUAL ANNOTATION OF TARGET SPECIFIC SENTIMENT**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-77",
"text": "We developed a tool for manual annotation of sentiment towards the targets (i.e. entities and topic keywords) mentioned in each tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-78",
"text": "The annotation was performed by nine PhD-level journalism students, each of them annotating approximately a ninth of the dataset, i.e. 500 tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-79",
"text": "Additionally, they annotated a common subset of 500 tweets consistign of 2,197 target entities, which was used to measure inter-annotator agreement (IAA)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-80",
"text": "An- notators were shown detailed guidelines 5 before taking up the task, after which they were redirected to the annotation tool itself (see Figure 1 )."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-81",
"text": "Tweets were shown to annotators one by one, and they had to complete the annotation of all targets in a tweet to proceed."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-82",
"text": "The tool shows a tweet with the targets highlighted in bold."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-83",
"text": "Possible annotation actions consisted in: (1) marking the sentiment for a target as being positive, negative, or neutral, (2) marking a target as being mistakenly highlighted (i.e. 'doesnotapply') and hence removing it, and (3) highlighting new targets that our preprocessing step had missed, and associating a sentiment value with them."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-84",
"text": "In this way we obtained a corrected list of targets for each tweet, each with an associated sentiment value."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-85",
"text": "We measure inter-annotator agreement in two different ways."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-86",
"text": "On the one hand, annotators achieved \u03ba = 0.345 (z = 92.2, p < 0.0001) (fair agreement) 6 when choosing targets to be added or removed."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-87",
"text": "On the other hand, they achieved a similar score of \u03ba = 0.341 (z = 77.7, p < 0.0001) (fair agreement) when annotating the sentiment of the resulting targets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-88",
"text": "It is worth noting that the sentiment annotation for each target also involves choosing among not only positive/negative/neutral but also a fourth category 'doesnotapply'."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-89",
"text": "The resulting dataset contains 4,077 tweets, with an average of 3.09 entity mentions (targets) per tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-90",
"text": "As many as 3,713 tweets have more than a single entity mention (target) per tweet, which makes the task different from 2015 Semeval 10 subtask C (Rosenthal et al., 2015) and a target-dependent benchmarking dataset of Dong et al. (2014) where each tweet has only one target annotated and thus one sentiment label assigned."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-91",
"text": "The number of targets in the 4,077 tweets to be annotated originally amounted to 12,874."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-92",
"text": "However, the annotators unhighlighted 975 of them, and added 688 new ones, so that the final number of targets in the dataset is 12,587."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-93",
"text": "These are distributed as follows: 1,865 are positive, 4,707 are neutral, and 6,015 are negative."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-94",
"text": "This distribution shows the tendency of a theme like politics, where users tend to have more negative opinions."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-95",
"text": "This is different from the Semeval dataset, which has a majority of neutral sentiment."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-96",
"text": "Looking at the annotations provided for different targets within each tweet, we observe that 2,051 tweets (50.3%) have all their targets consistently annotated with a single sentiment value, 1,753 tweets (43.0%) have two different sentiments, and 273 tweets (6.7%) have three different sentiment values."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-97",
"text": "These statistics suggest that providing a single sentiment for the entire tweet would not be appropriate in nearly half of the cases confirming earlier observations (Jiang et al., 2011) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-98",
"text": "We also labelled each tweet containing one or more topics from the 9 election issues, and asked the annotators to mark the author's sentiment towards the topic."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-99",
"text": "Unlike entities, topics may not be directly present in tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-100",
"text": "We compare topic sentiment with target/entity sentiment for 3963 tweets from our dataset adopting the approach by Vargas et al. (2016) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-101",
"text": "Table 1 reports the individual c(s target ), c(s topic ) and joint c(s target , s topic ) distributions of the target/entity s target and topic s topic sentiment."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-102",
"text": "While s target and s topic report how often each sentiment category occurs in the dataset, the joint distribution c(s target , s topic ) (the inner portions of the table) shows the discrepancies between target and topic sentiments."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-103",
"text": "We observe marked differences between the two sentiment labels."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-104",
"text": "For example it shows the topic sentiment is more neutral (1438.7 vs. 1104.1) and less negative (1930.7 vs. 2285.5 ) than the target sen-timent."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-105",
"text": "There is also a number of tweets expressing neutrality towards the topics mentioned but polarised sentiment towards targets (i.e. we observe c(s topic = neu \u2229 s targets = neg) = 258.6 also c(s topic = neu \u2229 s targets = pos) = 101.4), and vice versa."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-106",
"text": "This emphasises the importance of distinguishing target entity sentiment not only on the basis of overall tweet sentiment but also in terms of sentiment towards a topic."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-107",
"text": "Firstly we adopt the context-based approach by Vo and Zhang (2015) , which divides each tweet into three parts (left context, target and right context), and where the sentiment towards a target entity results from the interaction between its left and right contexts."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-108",
"text": "Such sentiment signal is drawn by mapping all the words in each context into lowdimensional vectors (i.e. word embeddings), using pre-trained embedding resources, and applying neural pooling functions to extract useful features."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-109",
"text": "Such context set-up does not fully capture the syntactic information of the tweet and the given target entity, and by adding features from the full tweet (as done by Vo and Zhang (2015) ) interactions between the left and right context are only implicitly modeled."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-110",
"text": "Here we use a syntactic dependency parser designed explicitly for tweets (Kong et al., 2014) to find the syntactically connected parts of the tweet to each target."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-111",
"text": "We then extract word embedding features from these syntactically dependent tokens [D 1 , ..., D n ] along its dependency path in the parsing tree to the target 7 , as well as from the left-target-right contexts (i.e. L \u2212 T \u2212 R)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-112",
"text": "Feature vectors generated from different contexts are concatenated into a final feature 7 Empirically the proximity/location of such syntactic relations have not made much difference when used in feature weighting and is thus ignored."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-113",
"text": "vector as shown in (2), where P (X) presents a list of k different pooling functions on an embedding matrix X. Not only does this proposed framework make the learning process efficient without labor intensive manual feature engineering and heavy architecture engineering for neural models, it has also shown that complex syntactic and semantic information can be effectively drawn by simply concatenating different types of context together without the use of deep learning (other than pretrained word embeddings)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-114",
"text": "Data set: We evaluate and compare our proposed system to the state-of-the-art baselines on a benchmarking corpus (Dong et al., 2014 ) that has been used by several previous studies (Vo and Zhang, 2015; Tang et al., 2016a; Zhang et al., 2016) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-115",
"text": "This corpus contains 6248 training tweets and 692 testing tweets with a sentiment class balance of 25% negative, 50% neutral and 25% positive."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-116",
"text": "Although the original corpus has only annotated one target per tweet, without specifying the location of the target, we expand this notion to consider cases where the target entity may appear more than once at different locations in the tweet, e.g.:"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-117",
"text": "\"Nicki Minaj has brought back the female rapper. -really? Nicki Minaj is the biggest parody in popular music since the Lonely Island.\""
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-118",
"text": "Semantically it is more appropriate and meaningful to consider both target appearances when determining the sentiment polarity of \"Nicki Minaj\" expressed in this tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-119",
"text": "While it isn't clear if Dong et al. (2014) and Tang et al. (2016a) have considered this realistic same-target-multiappearance scenario, Vo et al. (2015) and Zhang et al. (2016) do not take it into account when extracting target-dependent contexts."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-120",
"text": "Contrary to these studies we extend our system to fully incorporate the situation where a target appears multiple times at different locations in the tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-121",
"text": "We add another pooling layer in (2) where we apply a medium pooling function to combine extracted feature vectors from each target appearance together into the final feature vector for the sentiment classification of such targets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-122",
"text": "Now the feature extraction function P (X) in (2) becomes:"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-123",
"text": "where m is the number of appearances of the target and P medium represents the dimension-wise medium pooling function."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-124",
"text": "Models: To investigate different ways of modelling target-specific context and evaluate the benefit of incorporating the same-target-multiappearance scenario, we build these models:"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-125",
"text": "\u2022 Semeval-best: is a tweet-level model using various types of features, namely ngrams, lexica and word embeddings with extensive data pre-processing and feature engineering."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-126",
"text": "We use this model as a target-independent baseline as it approximates and beats the best performing system (Boag et al., 2015) in Semeval 2015 task 10."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-127",
"text": "It also outperforms the highest ranking system, Tweester, on the Semeval 2016 corpus (by +4.0% in macro-averaged recall) and therefore constitutes a state-of-the art tweet level baseline."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-128",
"text": "\u2022 Naive-seg models: Naive-seg-slices each tweet into a sequence of sub-sentences by using punctuation (i.e. ',' '.' '?' '!')."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-129",
"text": "Embedding features are extracted from each subsentence and pooling functions are applied to combine word vectors."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-130",
"text": "Naive-seg extends it by adding features extracted from the lefttarget-right contexts, while Naive-seg+ extends Naive-seg by adding lexicon filtered sentiment features."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-131",
"text": "\u2022 TDParse models: as described in Section 4.1."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-132",
"text": "TDParse-uses a dependency parser to extract a syntactic parse tree to the target and map all child nodes to low-dimensional vectors."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-133",
"text": "Final feature vectors for each target are generated using neural pooling functions."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-134",
"text": "While TDParse extends it by adding features extracted from the left-target-right contexts, TDParse+ uses three sentiment lexica for filtering words."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-135",
"text": "TDParse+ (m) differs from TDParse+ by taking into account the 'sametarget-multi-appearance' scenario."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-136",
"text": "Both TDParse+ and TDParse+ (m) outperform stateof-the-art target-specific models."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-137",
"text": "\u2022 TDPWindow-N: the same as TDParse+ with a window to constrain the left-right context."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-138",
"text": "For example if N = 3 then we only consider 3 tokens on each side of the target when extracting features from the left-right context."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-140",
"text": "**EXPERIMENTAL SETTINGS**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-141",
"text": "To compare our proposed models with Vo & Zhang (2015) , we have used the same pre-trained embedding resources and pooling functions (i.e. max, min, mean, standard deviation and product)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-142",
"text": "For classification we have used LIBLINEAR (Fan et al., 2008) , which approximates a linear SVM."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-143",
"text": "In tuning the cost factor C we perform five-fold cross validation on the training data over the same set of parameter values for both Vo and Zhang (2015) 's implementation and our system."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-144",
"text": "This makes sure our proposed models are comparable with those of Vo and Zhang (2015) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-145",
"text": "Evaluation metrics: We follow previous work on target-dependent Twitter sentiment classification, and report our performance in accuracy, 3-class macro-averaged (i.e. negative, neutral and positive) F 1 score as well as 2-class macroaveraged (i.e. negative and positive) F 1 score 8 , as used by the Semeval competitions (Rosenthal et al., 2015) for measuring Twitter sentiment classification performance."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-147",
"text": "**EXPERIMENTAL RESULTS AND COMPARISON WITH OTHER BASELINES**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-148",
"text": "We report our experimental results in Table 2 on the single-target benchmarking corpus (Dong et al., 2014) , with three model categories: 1) tweet-level target-independent models, 2) targetdependent models without considering the 'sametarget-multi-appearance' scenario and 3) targetdependent models incorporating the 'same-targetmulti-appearance' scenario."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-149",
"text": "We include the models presented in the previous section as well as models for target specific sentiment from the literature where possible."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-150",
"text": "Among the target-independent baseline models Target-ind (Vo and Zhang, 2015) and Semevalbest have shown strong performance compared with SSWE and SVM-ind (Jiang et al., 2011) as they use more features, especially rich automatic features using the embeddings of Mikolov et al. (2013) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-151",
"text": "Interestingly they also perform better than some of the targetdependent baseline systems, namely SVM-dep (Jiang et al., 2011) , Recursive NN and AdaRNN (Dong et al., 2014) , showing the difficulty of fully extracting and incorporating target information in tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-152",
"text": "Basic LSTM models (Tang et al., 2016a) completely ignore such target information and as a result do not perform as well."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-153",
"text": "Among the target-dependent systems neural network baselines have shown varying results."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-154",
"text": "The adaptive recursive neural network, namely AdaRNN (Dong et al., 2014) , adaptively selects composition functions based on the input data and thus performs better than a standard recursive neural network model (Recursive NN (Dong et al., 2014) )."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-155",
"text": "TD-LSTM and TC-LSTM from Tang et al. (2016a) model left-target-right contexts using two LSTM neural networks and by doing so incorporate target-dependent information."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-156",
"text": "TD-LSTM uses two LSTM neural networks for modeling the left and right contexts respectively."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-157",
"text": "TC-LSTM differs from (and outperforms) TD-LSTM in that it concatenates target word vectors with embedding vectors of each context word."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-158",
"text": "We also test the Gated recurrent neural network models proposed by Zhang et al. (2016) on the same dataset."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-159",
"text": "The gated models include: GRNN, that includes gates in its recurrent hidden layers, G3 that connects left-right context using a gated NN structure, and a combination of the two -GRNN+G3."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-160",
"text": "Results show these gated neural network models do not achieve state-of-theart performance."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-161",
"text": "When we compare our targetdependent model TDParse+, which incorporates target-dependent features from syntactic parses, against the target-dependent models proposed by Vo and Zhang (2015) , namely Target-dep which combines full tweet (pooled) word embedding features with features extracted from left-targetright contexts and Target-dep+ that adds targetdependent sentiment features on top of Targetdep, we see that our method beats both of these, without using full tweet features 9 ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-162",
"text": "TDParse+ also outperforms the state-of-the-art TC-LSTM."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-163",
"text": "When considering the 'same-target-multiappearance' scenario, our best model -TDParse+ improves its performance further (shown as TDParse+ (m) in Table 2 )."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-164",
"text": "Even though TDParse doesn't use lexica, it shows competitive results to Target-dep+ which uses lexicon filtered sen-9 Note that the results reported in Vo and Zhang (2015) (71.1 in accuracy and 69.9 in F1) were not possible to reproduce by running their code with very fine parameter tuning, as suggested by the authors Table 2 : Performance comparison on the benchmarking data (Dong et al., 2014) timent features."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-165",
"text": "In the case of TDParse-, which uses exclusively features from syntactic parses, while it performs significantly worse than Targetind, that uses only full tweet features, when the former is used in conjunction with features from left-target-right contexts it achieves better results than the equivalent Target-dep and Target-dep+."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-166",
"text": "This indicates that syntactic target information derived from parses complements well with the left-target-right context representation."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-167",
"text": "Clausal segmentation of tweets or sentences can provide a simple approximation to parse-tree based models (Li et al., 2015) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-168",
"text": "In Table 2 we can see our naive tweet segmentation models Naive-seg and Naive-seg+ also achieve competitive performance suggesting to some extent that such simple parse-tree approximation preserves the semantic structure of text and that useful target-specific information can be drawn from each segment or clause rather than the entire tweet. and applying our models described in Section 4.1."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-169",
"text": "We compare the results with our other developed baseline models in Section 4.1, including a tweet-level model Semeval-best and clausalsegmentation models that provide simple parsetree approximation, as well as state-of-the-art target-dependent models by Vo and Zhang (2015) and Zhang et al. (2016) ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-170",
"text": "The experimentation setup is the same as described in Section 4.2 10 ."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-171",
"text": "Data set: Our election data has a training/testing ratio of 3.70, containing 3210 training tweets with 9912 target entities and 867 testing tweets with 2675 target entities."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-172",
"text": "Models: In order to limit our use of external resources we do not include Naive-seg+ and TDParse+ for evaluation as they both use lexica for feature generation."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-173",
"text": "Since most of our tweets here contain N > 1 targets and the target-independent classifiers produce a single output per tweet, we evaluate its result N times against the ground truth labels, to make different models comparable."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-174",
"text": "Results: Overall the models perform much poorer than for the single-target benchmarking corpus, especially in 2-class F 1 score, indicating the challenge of the multi-target-specific sentiment recognition."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-175",
"text": "As seen in Table 3 though the feature-rich tweet-level model Semeval-best gives a reasonably strong baseline performance (same as in Table 2 ), both it and Target-ind perform worse than the target-dependent baseline models Target-dep/Target-dep+ (Vo and Zhang, 2015) , indicating the need to capture and utilise target-dependent signals in the sentiment classification model."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-176",
"text": "The Gated neural network models -G3/GRNN/GRNN+G3 (Zhang et al., 2016 ) also perform worse than Target-dep+ while the combined model -GRNN+G3 fails to boost performance, presumably due to the small corpus size."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-177",
"text": "Our final model TDParse achieves the best performance especially in 3-class F 1 and 2-class F 1 scores in comparison with other target-dependent and target-independent models."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-178",
"text": "This indicates that our proposed models can provide better and more balanced performance between precision and recall."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-179",
"text": "It also shows the target-dependent syntactic information acquired from parse-trees is beneficial to determine the target's sentiment particularly when used in conjunction with the left- Table 4 : Performance analysis in S1, S2 and S3 target-right contexts originally proposed by Vo and Zhang (2015) and in a scenario of multiple targets per tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-180",
"text": "Our clausal-segmentation baseline -Naive-seg models approximate such parse-trees by identifying segments of the tweet relevant to the target, and as a result Naive-seg achieves competitive performance compared to other baselines."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-182",
"text": "**STATE-OF-THE-ART TWEET LEVEL SENTIMENT VS TARGET-SPECIFIC SENTIMENT IN A MULTI-TARGET SETTING**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-183",
"text": "To fully compare our multi-target-specific models against other target-dependent and targetindependent baseline methods, we conduct an additional experiment by dividing our election data test set into three disjoint subsets, on the basis of number of distinct target sentiment values per tweet: (S1) contains tweets having only one target sentiment, where the sentiment towards each target is the same; (S2) and (S3) contain two and three different types of targeted sentiment respec-tively (i.e. in S3, positive, neutral and negative sentiment are all expressed in each tweet)."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-184",
"text": "As described in Section 3.2, there are 2,051, 1,753 and 273 tweets in S1, S2 and S3 respectively."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-185",
"text": "Table 4 shows results achieved by the tweetlevel target-independent model -Semeval-best, the state-of-the-art target-dependent baseline model -Target-dep+, and our proposed final model -TDParse, in each of the three subsets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-186",
"text": "We observe Semeval-best performs the best in S1 compared to the two other models but its performance gets worse when different types of target sentiment are mentioned in the tweet."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-187",
"text": "It has the worst performance in S2 and S3, which again emphasises the need for multi-target-specific sentiment classification."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-188",
"text": "Finally, our proposed final model TDParse achieves better performance than Target-dep+ consistently over all subsets indicating its effectiveness even in the most difficult scenario S3."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-189",
"text": "----------------------------------"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-190",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-191",
"text": "In this work we introduce the challenging task of multi-target-specific sentiment classification for tweets."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-192",
"text": "To help the study we have generated a multi-target Twitter corpus on UK elections which will be made publicly available."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-193",
"text": "We develop a state-of-the-art approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-194",
"text": "Our method outperforms previous approaches on a benchmarking single-target corpus as well as our new multi-target election data."
},
{
"sent_id": "c4e2a9322471fb5988a5bd737fa51e-C001-195",
"text": "Future work could investigate sentiment connections among all targets appearing in the same tweet as a multi-target learning task, as well as a hybrid approach that applies either Semeval-best or TDParse depending on the number of targets detected in the tweet."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"c4e2a9322471fb5988a5bd737fa51e-C001-24"
],
[
"c4e2a9322471fb5988a5bd737fa51e-C001-114"
],
[
"c4e2a9322471fb5988a5bd737fa51e-C001-148"
]
],
"cite_sentences": [
"c4e2a9322471fb5988a5bd737fa51e-C001-24",
"c4e2a9322471fb5988a5bd737fa51e-C001-114",
"c4e2a9322471fb5988a5bd737fa51e-C001-148"
]
},
"@BACK@": {
"gold_contexts": [
[
"c4e2a9322471fb5988a5bd737fa51e-C001-36"
],
[
"c4e2a9322471fb5988a5bd737fa51e-C001-39",
"c4e2a9322471fb5988a5bd737fa51e-C001-40"
],
[
"c4e2a9322471fb5988a5bd737fa51e-C001-150",
"c4e2a9322471fb5988a5bd737fa51e-C001-151"
],
[
"c4e2a9322471fb5988a5bd737fa51e-C001-154"
]
],
"cite_sentences": [
"c4e2a9322471fb5988a5bd737fa51e-C001-36",
"c4e2a9322471fb5988a5bd737fa51e-C001-40",
"c4e2a9322471fb5988a5bd737fa51e-C001-151",
"c4e2a9322471fb5988a5bd737fa51e-C001-154"
]
},
"@DIF@": {
"gold_contexts": [
[
"c4e2a9322471fb5988a5bd737fa51e-C001-89",
"c4e2a9322471fb5988a5bd737fa51e-C001-90"
]
],
"cite_sentences": [
"c4e2a9322471fb5988a5bd737fa51e-C001-90"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"c4e2a9322471fb5988a5bd737fa51e-C001-119"
]
],
"cite_sentences": [
"c4e2a9322471fb5988a5bd737fa51e-C001-119"
]
}
}
},
"ABC_e831e058f208542af16c1ea236d2c9_10": {
"x": [
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-2",
"text": "This paper describes our effort on the task of edited region identification for parsing disfluent sentences in the Switchboard corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-3",
"text": "We focus our attention on exploring feature spaces and selecting good features and start with analyzing the distributions of the edited regions and their components in the targeted corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-4",
"text": "We explore new feature spaces of a partof-speech (POS) hierarchy and relaxed for rough copy in the experiments."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-5",
"text": "These steps result in an improvement of 43.98% percent relative error reduction in F-score over an earlier best result in edited detection when punctuation is included in both training and testing data [Charniak and Johnson 2001] , and 20.44% percent relative error reduction in F-score over the latest best result where punctuation is excluded from the training and testing data [Johnson and Charniak 2004] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-8",
"text": "Repairs, hesitations, and restarts are common in spoken language, and understanding spoken language requires accurate methods for identifying such disfluent phenomena."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-9",
"text": "Processing speech repairs properly poses a challenge to spoken dialog systems."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-10",
"text": "Early work in this field is primarily based on small and proprietary corpora, which makes the comparison of the proposed methods difficult [Young and Matessa 1991 , Bear et al. 1992 , Heeman & Allen 1994 ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-11",
"text": "Because of the availability of the Switchboard corpus [Godfrey et al. 1992] and other conversational telephone speech (CTS) corpora, there has been an increasing interest in improving the performance of identifying the edited regions for parsing disfluent sentences [Charniak and Johnson 2001 , Johnson and Charniak 2004 , Liu et al. 2005 ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-12",
"text": "In this paper we describe our effort towards the task of edited region identification with the intention of parsing disfluent sentences in the Switchboard corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-13",
"text": "A clear benefit of having accurate edited regions for parsing has been demonstrated by a concurrent effort on parsing conversational speech [Kahn et al 2005] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-14",
"text": "Since different machine learning methods provide similar performances on many NLP tasks, in this paper, we focus our attention on exploring feature spaces and selecting good features for identifying edited regions."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-15",
"text": "We start by analyzing the distributions of the edited regions and their components in the targeted corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-16",
"text": "We then design several feature spaces to cover the disfluent regions in the training data."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-17",
"text": "In addition, we also explore new feature spaces of a part-of-speech hierarchy and extend candidate pools in the experiments."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-18",
"text": "These steps result in a significant improvement in F-score over the earlier best result reported in [Charniak and Johnson 2001] , where punctuation is included in both the training and testing data of the Switchboard corpus, and a significant error reduction in F-score over the latest best result [Johnson and Charniak 2004] , where punctuation is ignored in both the training and testing data of the Switchboard corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-19",
"text": "In this paper, we follow the definition of [Shriberg 1994 ] and others for speech repairs: A speech repair is divided into three parts: the reparandum, the part that is repaired; the interregnum, the part that can be either empty or fillers; and the repair/repeat, the part that replaces or repeats the reparandum."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-20",
"text": "The definition can also be exemplified via the following utterance: repeat reparanda int erregnum , , this is a big problem."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-21",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-22",
"text": "**THIS IS YOU KNOW**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-23",
"text": "This paper is organized as follows."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-24",
"text": "In section 2, we examine the distributions of the editing regions in Switchboard data."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-25",
"text": "Section 3, then, presents the Boosting method, the baseline system and the feature spaces we want to explore."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-26",
"text": "Section 4 describes, step by step, a set of experiments that lead to a large performance improvement."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-27",
"text": "Section 5 concludes with discussion and future work."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-29",
"text": "**REPAIR DISTRIBUTIONS IN SWITCHBOARD**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-30",
"text": "We start by analyzing the speech repairs in the Switchboard corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-31",
"text": "Switchboard has over one million words, with telephone conversations on prescribed topics [Godfrey et al. 1992] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-32",
"text": "It is full of disfluent utterances, and [Shriberg 1994 , Shriberg 1996 gives a thorough analysis and categorization of them."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-33",
"text": "[Engel et al. 2002 ] also showed detailed distributions of the interregnum, including interjections and parentheticals."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-34",
"text": "Since the majority of the disfluencies involve all the three parts (reparandum, interregnum, and repair/repeat), the distributions of all three parts will be very helpful in constructing patterns that are used to identify edited regions."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-35",
"text": "For the reparandum and repair types, we include their distributions with and without punctuation."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-36",
"text": "We include the distributions with punctuation is to match with the baseline system reported in [Charniak and Johnson 2001] , where punctuation is included to identify the edited regions."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-37",
"text": "Resent research showed that certain punctuation/prosody marks can be produced when speech signals are available [Liu et al. 2003 ]."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-38",
"text": "The interregnum type, by definition, does not include punctuation."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-39",
"text": "The length distributions of the reparanda in the training part of the Switchboard data with and without punctuation are given in Fig. 1 ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-40",
"text": "The reparanda with lengths of less than 7 words make up 95.98% of such edited regions in the training data."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-41",
"text": "When we remove the punctuation marks, those with lengths of less than 6 words reach roughly 96%."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-42",
"text": "Thus, the patterns that consider only reparanda of length 6 or less will have very good coverage."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-43",
"text": "The two repair/repeat part distributions in the training part of the Switchboard are given in Fig. 2 ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-44",
"text": "The repairs/repeats with lengths less than 7 words make 98.86% of such instances in the training data."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-45",
"text": "This gives us an excellent coverage if we use 7 as the threshold for constructing repair/repeat patterns."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-47",
"text": "**LENGTH DISTRIBUTION OF REPARANDA**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-48",
"text": "The length distribution of the interregna of the training part of the Switchboard corpus is shown in Fig. 3 ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-49",
"text": "We see that the overwhelming majority has the length of one, which are mostly words such as \"uh\", \"yeah\", or \"uh-huh\"."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-50",
"text": "In examining the Switchboard data, we noticed that a large number of reparanda and repair/repeat pairs differ on less than two words, i.e. \"as to, you know, when to\" 1 , and the amount of the pairs differing on less than two POS tags is even bigger."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-51",
"text": "There are also cases where some of the pairs have different lengths."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-52",
"text": "These findings provide a good base for our feature space."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-54",
"text": "**FEATURE SPACE SELECTION FOR BOOSTING**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-55",
"text": "We take as our baseline system the work by [Charniak and Johnson 2001] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-56",
"text": "In their approach, rough copy is defined to produce candidates for any potential pairs of reparanda and repairs."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-57",
"text": "A boosting algorithm [Schapire and Singer 1999 ] is used to detect whether a word is edited."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-58",
"text": "A total of 18 variables are used in the algorithm."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-59",
"text": "In the rest of the section, we first briefly introduce the boosting algorithm, then describe the method used in [Charniak and Johnson 2001] , and finally we contrast our improvements with the baseline system."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-61",
"text": "**BOOSTING ALGORITHM**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-62",
"text": "Intuitively, the boosting algorithm is to combine a set of simple learners iteratively based on their classification results on a set of training data."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-63",
"text": "Different parts of the training data are scaled at each iteration so that the parts of the data previous classifiers performed poorly on are weighted higher."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-64",
"text": "The weighting factors of the learners are adjusted accordingly."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-65",
"text": "We re-implement the boosting algorithm reported by [Charniak and Johnson 2001]"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-66",
"text": "where \u03b1 i is the weight to be estimated for feature \u03c6 i ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-67",
"text": "\u03c6 i is a set of variable-value pairs, and each F i has the form of:"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-68",
"text": "with X's being conditioning variables and x's being values."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-69",
"text": "Each component in the production for F i is defined as:"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-71",
"text": "**CHARNIAK-JOHNSON APPROACH**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-72",
"text": "In [Charniak and Johnson 2001] , identifying edited regions is considered as a classification problem, where each word is classified either as edited or normal."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-73",
"text": "The approach takes two steps."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-74",
"text": "The first step is to find rough copy."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-75",
"text": "Then, a number of variables are extracted for the boosting algorithm."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-76",
"text": "In particular, a total of 18 different conditioning variables are used to predict whether the current word is an edited word or a non-edited word."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-77",
"text": "The 18 different variables listed in Table 1 The set of free final words includes all partial words and a small set of conjunctions, adverbs and miscellanea."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-78",
"text": "The set of interregnum strings consists of a small set of expressions such as uh, you know, I guess, I mean, etc."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-80",
"text": "**NEW IMPROVEMENTS**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-81",
"text": "Our improvements to the Charniak-Johnson method can be classified into three categories with the first two corresponding to the twp steps in their method."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-82",
"text": "The three categories of improvements are described in details in the following subsections."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-83",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-84",
"text": "**RELAXING ROUGH COPY**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-85",
"text": "We relax the definition for rough copy, because more than 94% of all edits have both reparandum and repair, while the rough copy defined in [Charniak and Johnson 2001] only covers 77.66% of such instances."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-86",
"text": "Two methods are used to relax the rough copy definition."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-87",
"text": "The first one is to adopt a hierarchical POS tag set: all the Switchboard POS tags are further classified into four major categories: N (noun related), V (verb related), Adj (noun modifiers), Adv (verb modifiers)."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-88",
"text": "Instead of requiring the exact match of two POS tag sequences, we also consider two sequences as a Variables Name Short description X 1 W 0"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-89",
"text": "The current orthographic word."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-90",
"text": "X 2 -X 5 P 0 ,P 1 ,P 2 ,P f Partial word flags for the current position, the next two to the right, and the first one in a sequence of free-final words (partial, conjunctions, etc.) to the right of the current position."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-91",
"text": "The first non-punctuation word to the right of the current position X 18 T i"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-92",
"text": "The tag of the first word right after the interregnum that is right after the current word."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-93",
"text": "The second is to allow one mismatch in the two POS sequences."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-94",
"text": "The mismatches can be an addition, deletion, or substitution."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-95",
"text": "This relaxation improves the coverage from 77.66% to 85.45%."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-96",
"text": "Subsequently, the combination of the two relaxations leads to a significantly higher coverage of 87.70%."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-97",
"text": "Additional relaxation leads to excessive candidates and worse performance in the development set."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-99",
"text": "**ADDING NEW FEATURES**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-100",
"text": "We also include new features in the feature set: one is the shortest distance (the number of words) between the current word and a word of the same orthographic form to the right, if that repeated word exists; another is the words around the current position."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-101",
"text": "Based on the distributional analysis in section 2, we also increase the window sizes for POS tags ( 5 5 ,..., T T \u2212 ) and words"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-103",
"text": "**POST PROCESSING STEP**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-104",
"text": "In addition to the two categories, we try to use contextual patterns to address the independency of variables in the features."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-105",
"text": "The patterns have been extracted from development and training data, to deal with certain sequence-related errors, e.g., E N E E E E, which means that if the neighbors on both sides of a word are classified into EDITED, it should be classified into EDITED as well."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-106",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-107",
"text": "**EXPERIMENTAL RESULTS**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-108",
"text": "We conducted a number of experiments to test the effectiveness of our feature space exploration."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-109",
"text": "Since the original code from [Charniak and Johnson 2001] is not available, we conducted our first experiment to replicate the result of their baseline system described in section 3."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-110",
"text": "We used the exactly same training and testing data from the Switchboard corpus as in [Charniak and Johnson 2001] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-111",
"text": "The training subset consists of all files in the sections 2 and 3 of the Switchboard corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-112",
"text": "Section 4 is split into three approximately equal size subsets."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-113",
"text": "The first of the three, i.e., files sw4004.mrg to sw4153.mrg, is the testing corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-114",
"text": "The files sw4519.mrg to sw4936.mrg are the development corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-115",
"text": "The rest files are reserved for other purposes."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-116",
"text": "When punctuation is included in both training and testing, the re-established baseline has the precision, recall, and F-score of 94.73%, 68.71% and 79.65%, respectively."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-117",
"text": "These results are comparable with the results from [Charniak & Johnson 2001] , i.e., 95.2%, 67.8%, and 79.2% for precision, recall, and f-score, correspondingly."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-118",
"text": "In the subsequent experiments, the set of additional feature spaces described in section 3 are added, step-by-step."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-119",
"text": "The first addition includes the shortest distance to the same word and window size increases."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-120",
"text": "This step gives a 2.27% improvement on F-score over the baseline."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-121",
"text": "The next addition is the introduction of the POS hierarchy in finding rough copies."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-122",
"text": "This also gives more than 3% absolute improvement over the baseline and 1.19% over the expanded feature set model."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-123",
"text": "The addition of the feature spaces of relaxed matches for words, POS tags, and POS hierarchy tags all give additive improvements, which leads to an overall of 8.95% absolute improvement over the re-implemented baseline, or 43.98% relative error reduction on F-score."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-124",
"text": "When compared with the latest results from [Johnson and Charniak 2004] , where no punctuations are used for either training or testing data, we also observe the same trend of the improved results."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-125",
"text": "Our best result gives 4.15% absolute improvement over their best result, or 20.44% relative error reduction in f-scores."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-126",
"text": "As a sanity check, when evaluated on the training data as a cheating experiment, we show a remarkable consistency with the results for testing data."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-127",
"text": "For error analysis, we randomly selected 100 sentences with 1673 words total from the test sentences that have at least one mistake."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-128",
"text": "Errors can be divided into two types, miss (should be edited) and false alarm (should be noraml)."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-129",
"text": "Among the 207 misses, about 70% of them require some phrase level analysis or acoustic cues for phrases."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-130",
"text": "For example, one miss is \"because of the friends because of many other things\", an error we would have a much better chance of correct identification, if we were able to identify prepositional phrases reliably."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-131",
"text": "Another example is \"most of all my family\"."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-132",
"text": "Since it is grammatical by itself, certain prosodic information in between \"most of\" and \"all my family\" may help the identification. reported that interruption point could help parsers to improve results."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-133",
"text": "[Kahn et al. 2005] also showed that prosody information could help parse disfluent sentences."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-134",
"text": "The second major class of the misses is certain short words that are not labeled consistently in the corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-135",
"text": "For example, \"so\", \"and\", and \"or\", when they occur in the beginning of a sentence, are sometimes labeled as edited, and sometimes just as normal."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-136",
"text": "The last category of the misses, about 5.3%, contains the ones where the distances between reparanda and repairs are often more than 10 words."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-137",
"text": "Among the 95 false alarms, more than three quarters of misclassified ones are related to certain grammatical constructions."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-138",
"text": "Examples include cases like, \"the more \u2026 the more\" and \"I think I should \u2026\"."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-139",
"text": "These cases may be fixable if more elaborated grammar-based features are used."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-140",
"text": "----------------------------------"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-141",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-142",
"text": "This paper reports our work on identifying edited regions in the Switchboard corpus."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-143",
"text": "In addition to a distributional analysis for the edited regions, a number of feature spaces have been explored and tested to show their effectiveness."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-144",
"text": "We observed a 43.98% relative error reduction on F-scores for the baseline with punctuation in both training and testing [Charniak and Johnson 2001] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-145",
"text": "Compared with the reported best result, the same approach produced a 20.44% of relative error reduction on F-scores when punctuation is ignored in training and testing data [Johnson and Charniak 2004] ."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-146",
"text": "The inclusion of both hierarchical POS tags and the relaxation for rough copy definition gives large additive improvements, and their combination has contributed to nearly half of the gain for the test set with punctuation and about 60% of the gain for the data without punctuation."
},
{
"sent_id": "e831e058f208542af16c1ea236d2c9-C001-147",
"text": "Future research would include the use of other features, such as prosody, and the integration of the edited region identification with parsing."
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"e831e058f208542af16c1ea236d2c9-C001-5"
],
[
"e831e058f208542af16c1ea236d2c9-C001-18"
],
[
"e831e058f208542af16c1ea236d2c9-C001-85"
]
],
"cite_sentences": [
"e831e058f208542af16c1ea236d2c9-C001-5",
"e831e058f208542af16c1ea236d2c9-C001-18",
"e831e058f208542af16c1ea236d2c9-C001-85"
]
},
"@BACK@": {
"gold_contexts": [
[
"e831e058f208542af16c1ea236d2c9-C001-11"
],
[
"e831e058f208542af16c1ea236d2c9-C001-72"
]
],
"cite_sentences": [
"e831e058f208542af16c1ea236d2c9-C001-11",
"e831e058f208542af16c1ea236d2c9-C001-72"
]
},
"@SIM@": {
"gold_contexts": [
[
"e831e058f208542af16c1ea236d2c9-C001-36"
],
[
"e831e058f208542af16c1ea236d2c9-C001-110"
],
[
"e831e058f208542af16c1ea236d2c9-C001-117"
]
],
"cite_sentences": [
"e831e058f208542af16c1ea236d2c9-C001-36",
"e831e058f208542af16c1ea236d2c9-C001-110",
"e831e058f208542af16c1ea236d2c9-C001-117"
]
},
"@MOT@": {
"gold_contexts": [
[
"e831e058f208542af16c1ea236d2c9-C001-36"
],
[
"e831e058f208542af16c1ea236d2c9-C001-109"
]
],
"cite_sentences": [
"e831e058f208542af16c1ea236d2c9-C001-36",
"e831e058f208542af16c1ea236d2c9-C001-109"
]
},
"@USE@": {
"gold_contexts": [
[
"e831e058f208542af16c1ea236d2c9-C001-55"
],
[
"e831e058f208542af16c1ea236d2c9-C001-65"
],
[
"e831e058f208542af16c1ea236d2c9-C001-110"
]
],
"cite_sentences": [
"e831e058f208542af16c1ea236d2c9-C001-55",
"e831e058f208542af16c1ea236d2c9-C001-65",
"e831e058f208542af16c1ea236d2c9-C001-110"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"e831e058f208542af16c1ea236d2c9-C001-59"
],
[
"e831e058f208542af16c1ea236d2c9-C001-144"
]
],
"cite_sentences": [
"e831e058f208542af16c1ea236d2c9-C001-59",
"e831e058f208542af16c1ea236d2c9-C001-144"
]
}
}
},
"ABC_b1c9b8e24916b136948610383f8ea2_10": {
"x": [
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-261",
"text": "There are also many differences between MNMT and domain adaptation for NMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-2",
"text": "We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-3",
"text": "MNMT has been useful in improving translation quality as a result of knowledge transfer."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-4",
"text": "MNMT is more promising and interesting than its statistical machine translation counterpart because end-to-end modeling and distributed representations open new avenues."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-5",
"text": "Many approaches have been proposed in order to exploit multilingual parallel corpora for improving translation quality."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-6",
"text": "However, the lack of a comprehensive survey makes it difficult to determine which approaches are promising and hence deserve further exploration."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-7",
"text": "In this paper, we present an in-depth survey of existing literature on MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-8",
"text": "We categorize various approaches based on the resource scenarios as well as underlying modeling principles."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-9",
"text": "We hope this paper will serve as a starting point for researchers and engineers interested in MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-10",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-11",
"text": "**INTRODUCTION**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-12",
"text": "Neural machine translation (NMT) (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015) has become the dominant paradigm for MT in academic research as well as commercial use (Wu et al., 2016) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-13",
"text": "NMT has shown state-of-the-art performance for many language pairs (Bojar et al., 2017 (Bojar et al., , 2018 ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-14",
"text": "Its success can be mainly attributed to the use of distributed representations of language, enabling end-to-end training of an MT system."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-15",
"text": "Unlike statistical machine translation (SMT) systems (Koehn et al., 2007) , separate lossy components like word aligners, translation rule extractors and other feature extractors are not required."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-16",
"text": "The dominant NMT approach is the Embed -Encode -Attend -Decode paradigm."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-17",
"text": "Recurrent neu- * equal contribution ral network (RNN) (Bahdanau et al., 2015) , convolutional neural network (CNN) (Gehring et al., 2017) and self-attention (Vaswani et al., 2017) architectures are popular approaches based on this paradigm."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-18",
"text": "For a more detailed exposition of NMT, we refer readers to some prominent tutorials (Neubig, 2017; Koehn, 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-41",
"text": "Our goal is to shed light on various MNMT scenarios, fundamental questions in MNMT, basic principles, architectures, and datasets of MNMT systems."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-19",
"text": "While initial research on NMT started with building translation systems between two languages, researchers discovered that the NMT framework can naturally incorporate multiple languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-20",
"text": "Hence, there has been a massive increase in work on MT systems that involve more than two languages (Dong et al., 2015; Firat et al., 2016a; Cheng et al., 2017; Johnson et al., 2017; Chen et al., 2017 Neubig and Hu, 2018) etc."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-21",
"text": "We refer to NMT systems handling translation between more than one language pair as multilingual NMT (MNMT) systems."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-22",
"text": "The ultimate goal MNMT research is to develop one model for translation between all possible languages by effective use of available linguistic resources."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-23",
"text": "MNMT systems are desirable because training models with data from many language pairs might help acquire knowledge from multiple sources ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-24",
"text": "Moreover, MNMT systems tend to generalize better due to exposure to diverse languages, leading to improved translation quality."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-25",
"text": "This particular phenomenon is known as knowledge transfer (Pan and Yang, 2010) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-26",
"text": "Knowledge transfer has been strongly observed for translation between low-resource languages, which have scarce parallel corpora or other linguistic resources but have benefited from data in other languages ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-27",
"text": "In addition, MNMT systems will be compact, because a single model handles translations for multiple languages (Johnson et al., 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-28",
"text": "This can reduce the deployment footprint, which is crucial for con- There are multiple MNMT scenarios based on available resources and studies have been conducted for the following scenarios ( Figure 1 1 ) : Multiway Translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-29",
"text": "The goal is constructing a single NMT system for one-to-many (Dong et al., 2015) , many-to-one (Lee et al., 2017) or many-tomany (Firat et al., 2016a ) translation using parallel corpora for more than one language pair."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-30",
"text": "Low or Zero-Resource Translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-31",
"text": "For most of the language pairs in the world, there are small or no parallel corpora, and three main directions have been studied for this scenario."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-32",
"text": "Transfer learning: Transferring translation knowledge from a high-resource language pair to improve the translation of a low-resource language pair ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-33",
"text": "Pivot translation: Using a high-resource language (usually English) as a pivot to translate between a language pair (Firat et al., 2016a) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-34",
"text": "Zeroshot translation: Translating between language pairs without parallel corpora (Johnson et al., 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-35",
"text": "Multi-Source Translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-36",
"text": "Documents that have been translated into more than one language might, in the future, be required to be translated 1 Please see the supplementary material for papers related to each category."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-37",
"text": "into another language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-38",
"text": "In this scenario, existing multilingual redundancy in the source side can be exploited for multi-source translation ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-39",
"text": "Given these benefits, scenarios and the tremendous increase in the work on MNMT in recent years, we undertake this survey paper on MNMT to systematically organize the work in this area."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-40",
"text": "To the best of our knowledge, no such comprehensive survey on MNMT exists."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-42",
"text": "The remainder of this paper is structured as follows: We present a systematic categorization of different approaches to MNMT in each of the above mentioned scenarios to help understand the array of design choices available while building MNMT systems (Sections 2, 3, and 4)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-43",
"text": "We put the work in MNMT into a historical perspective with respect to multilingual MT in older MT paradigms (Section 5)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-44",
"text": "We also describe popular multilingual datasets and the shared tasks that focus on multilingualism (Section 6)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-45",
"text": "In addition, we compare MNMT with domain adaptation for NMT, which tackles the problem of improving low-resource in-domain translation (Section 7)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-46",
"text": "Finally, we share our opinions on future research directions in MNMT (Section 8) and conclude this paper (Section 9)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-48",
"text": "**MULTIWAY NMT**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-49",
"text": "The goal is learning a single model for l language pairs (s i , t i ) \u2208 L (i = 1 to l), where L \u2282 S \u00d7 T , and S, T are sets of source and target languages respectively."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-50",
"text": "S and T need not be mutually exclusive."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-51",
"text": "Parallel corpora are available for these l language pairs."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-52",
"text": "One-many, many-one and manymany NMT models have been explored in this framework."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-53",
"text": "Multiway translation systems follow the standard paradigm in popular NMT systems."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-54",
"text": "However, this architecture is adapted to support multiple languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-55",
"text": "The wide ranges of possible architectural choices is exemplified by two highly contrasting prototypical approaches."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-57",
"text": "**PROTOTYPICAL APPROACHES**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-58",
"text": "Complete Sharing."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-59",
"text": "Johnson et al. (2017) proposed a highly compact model where all languages share the same embeddings, encoder, decoder, and attention mechanism."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-60",
"text": "A common vocabulary, typically subword-level like byte pair encoding (BPE) (Sennrich et al., 2016b) , is defined across all languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-61",
"text": "The input sequence includes a special token (called the language tag) to indicate the target language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-62",
"text": "This enables the decoder to correctly generate the target language, though all target languages share the same decoder parameters."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-63",
"text": "The model has minimal parameter size as all languages share the same parameters; and achieves comparable/better results w.r.t."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-64",
"text": "bilingual systems."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-65",
"text": "But, a massively multilingual system can run into capacity bottlenecks (Aharoni et al., 2019) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-66",
"text": "This is a black-box model, which can use an off-theshelf NMT system to train a multilingual system."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-67",
"text": "Ha et al. (2016) proposed a similar model, but they maintained different vocabularies for each language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-68",
"text": "This architecture is particularly useful for related languages, because they have high degree of lexical and syntactic similarity ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-69",
"text": "Lexical similarity can be further utilized by (a) representing all languages in a common script using script conversion Lee et al., 2017) or transliteration (Nakov and Ng (2009) for multilingual SMT), (b) using a common subword-vocabulary across all languages e.g. character (Lee et al., 2017) and BPE (Nguyen and Chiang, 2017) , (c) representing words by both character encoding and a latent embedding space shared by all languages (Wang et al., 2019) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-70",
"text": "Pinnis et al. (2018) and Lakew et al. (2018a) have compared RNN, CNN and the self-attention based architectures for MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-71",
"text": "They show that self-attention based architectures outperform the other architectures in many cases."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-72",
"text": "Minimal Sharing."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-73",
"text": "On the other hand, Firat et al. (2016a) proposed a model comprised of separate embeddings, encoders and decoders for each language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-74",
"text": "By sharing attention across languages, they show improvements over bilingual models."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-75",
"text": "However, this model has a large number of parameters."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-76",
"text": "Nevertheless, the number of parameters only grows linearly with the number of languages, while it grows quadratically for bilingual systems spanning all the language pairs in the multiway system."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-78",
"text": "**CONTROLLING PARAMETER SHARING**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-79",
"text": "In between the extremities of parameter sharing exemplified by the above mentioned models, lies an array of choices."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-80",
"text": "The degree of parameter sharing depends on the divergence between the languages involved and can be controlled at various layers of the MNMT system."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-81",
"text": "Sharing encoders among multiple languages is very effective and is widely used (Lee et al., 2017; ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-82",
"text": "Blackwood et al. (2018) explored target language, source language and pair specific attention parameters."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-83",
"text": "They showed that target language specific attention performs better than other attention sharing configurations."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-84",
"text": "For self-attention based NMT models, explored various parameter sharing strategies."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-85",
"text": "They showed that sharing the decoder self-attention and encoder-decoder inter-attention parameters is useful for linguistically dissimilar languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-86",
"text": "Zaremoodi et al. (2018) further proposed a routing network to dynamically control parameter sharing learned from the data."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-87",
"text": "Designing the right sharing strategy is important to maintaining a balance between model compactness and translation accuracy."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-88",
"text": "Dynamic Parameter or Representation Generation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-89",
"text": "Instead of defining the parameter sharing protocol a priori, Platanios et al. (2018) learned the degree of parameter sharing from the data."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-90",
"text": "This is achieved by defining the language specific model parameters as a function of global parameters and language embeddings."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-91",
"text": "This approach also reduces the number of language specific parameters (only language embeddings), while still allowing each language to have its own unique parameters for different network layers."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-188",
"text": "Most studies assume that the same sentence is available in multiple languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-92",
"text": "In fact, the number of parameters is only a small multiple of the compact model (the multiplication factor accounts for the language embedding size) (Johnson et al., 2017) , but the language embeddings can directly impact the model parameters instead of the weak influence that language tags have."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-93",
"text": "Universal Encoder Representation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-94",
"text": "Ideally, multiway systems should generate encoder representations that are language agnostic."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-95",
"text": "However, the attention mechanism sees a variable number of encoder representations depending on the sentence length (this could vary for translations of the same sentence)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-96",
"text": "To overcome this, an attention bridge network generates a fixed number of contextual representations that are input to the attention network (Lu et al., 2018; V\u00e1zquez et al., 2018) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-97",
"text": "Murthy et al. (2018) pointed out that the contextualized embeddings are word order dependent, hence not language agnostic."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-98",
"text": "Multiple Target Languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-99",
"text": "This is a challenging scenario because parameter sharing has to be balanced with the capability to generate sentences in each target language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-100",
"text": "Blackwood et al. (2018) added the language tag to the beginning as well as end of sequence to avoid its attenuation in a leftto-right encoder."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-101",
"text": "explored multiple methods for supporting target languages: (a) target language tag at beginning of the decoder, (b) target language dependent positional embeddings, and (c) divide hidden units of each decoder layer into shared and language-dependent ones."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-102",
"text": "Each of these methods provide gains over Johnson et al. (2017) , and combining all gave the best results."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-103",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-104",
"text": "**TRAINING PROTOCOLS**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-105",
"text": "Joint Training."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-106",
"text": "All the available languages pairs are trained jointly to minimize the mean negative log-likelihood for each language pair."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-107",
"text": "As some language pairs would have more data than other languages, the model may be biased."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-108",
"text": "To avoid this, sentence pairs from different language pairs are sampled to maintain a healthy balance."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-109",
"text": "Mini-batches can be comprised of a mix of samples from different language pairs (Johnson et al., 2017) or the training schedule can cycle through mini-batches consisting of a language pair only (Firat et al., 2016a) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-110",
"text": "For architectures with language specific layers, the latter approach is convenient to implement."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-111",
"text": "Knowledge Distillation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-112",
"text": "In this approach suggested by Tan et al. (2019) , bilingual models are first trained for all language pairs involved."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-113",
"text": "These bilingual models are used as teacher models to train a single student model for all language pairs."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-114",
"text": "The student model is trained using a linear interpolation of the standard likelihood loss as well as distillation loss that captures the distance between the output distributions of the student and teacher models."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-115",
"text": "The distillation loss is applied for a language pair only if the teacher model shows better translation accuracy than the student model on the validation set."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-116",
"text": "This approach shows better results than joint training of a black-box model, but training time increases significantly because bilingual models also have to be trained."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-117",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-118",
"text": "**LOW OR ZERO-RESOURCE MNMT**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-119",
"text": "An important motivation for MNMT is to improve or support translation for language pairs with scarce or no parallel corpora, by utilizing training data from high-resource language pairs."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-120",
"text": "In this section, we will discuss the MNMT approaches that specifically address the low or zeroresource scenario."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-122",
"text": "**TRANSFER LEARNING**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-123",
"text": "Transfer learning (Pan and Yang, 2010) has been widely explored to address low-resource translation, where knowledge learned from a highresource language pair is used to improve the NMT performance on a low-resource pair."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-124",
"text": "Training."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-125",
"text": "Most studies have explored the following setting: the high-resource and low-resource language pairs share the same target language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-126",
"text": "first showed that transfer learning can benefit low-resource language pairs."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-127",
"text": "First, they trained a parent model on a high-resource language pair."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-128",
"text": "The child model is initialized with the parent's parameters wherever possible and trained on the small parallel corpus for the low-resource pair."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-129",
"text": "This process is known as fine-tuning."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-130",
"text": "They also studied the effect of fine-tuning only a subset of the child model's parameters (source and target embeddings, RNN layers and attention)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-131",
"text": "The initialization has a strong regularization effect in training the child model."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-132",
"text": "Gu et al. (2018b) used the model agnostic meta learning (MAML) framework (Finn et al., 2017) to learn appropriate parameter initialization from the parent pair(s) by taking the child pair into consideration."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-133",
"text": "Instead of fine-tuning, both language pairs can also be jointly trained (Gu et al., 2018a) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-134",
"text": "Language Relatedness. and Dabre et al. (2017b) have empirically shown that language relatedness between the parent and child source languages has a big impact on the possible gains from transfer learning."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-135",
"text": "Kocmi and Bojar (2018) showed that transfer learning improves low-resource language translation, even when neither the source nor the target languages are shared between the resource-rich and poor language pairs."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-136",
"text": "Further investigation is needed to understand the gains in translation quality in this scenario."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-137",
"text": "Neubig and Hu (2018) used language relatedness to prevent overfitting when rapidly adapting pre-trained MNMT model for low-resource scenarios."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-138",
"text": "Chaudhary et al. (2019) used this approach to translate 1,095 languages to English."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-139",
"text": "Lexical Transfer. randomly initialized the word embeddings of the child source language, because those could not be transferred from the parent."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-140",
"text": "Gu et al. (2018a) improved on this simple initialization by mapping pre-trained monolingual embeddings of the parent and child sources to a common vector space."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-141",
"text": "On the other hand, Nguyen and Chiang (2017) utilized the lexical similarity between related source languages using a small subword vocabulary."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-142",
"text": "Lakew et al. (2018b) dynamically updated the vocabulary of the parent model with the low-resource language pair before transferring parameters."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-143",
"text": "Syntactic Transfer."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-144",
"text": "Gu et al. (2018a) proposed to encourage better transfer of contextual representations from parents using a mixture of language experts network."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-145",
"text": "Murthy et al. (2018) showed that reducing the word order divergence between source languages via pre-ordering is beneficial in extremely low-resource scenarios."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-147",
"text": "**PIVOTING**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-148",
"text": "Zero-resource NMT was first explored by Firat et al. (2016a) , where a multiway NMT model was used to translate from Spanish to French using English as a pivot language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-149",
"text": "This pivoting was done either at run time or during pre-training."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-150",
"text": "Run-Time Pivoting."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-151",
"text": "Firat et al. (2016a) involved a pipeline through paths in the multiway model, which first translates from French to English and then from English to Spanish."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-152",
"text": "They also experimented with using the intermediate English translation as an additional source for the second stage."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-153",
"text": "Pivoting during Pre-Training."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-154",
"text": "Firat et al. (2016b) used the MNMT model to first translate the Spanish side of the training corpus to English which in turn is translated into French."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-155",
"text": "This gives a pseudo-parallel French-Spanish corpus where the source is synthetic and the target is original."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-156",
"text": "The MNMT model is fine tuned on this synthetic data and this enables direct French to Spanish translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-157",
"text": "Firat et al. (2016b) also showed that a small clean parallel corpus between French and Spanish can be used for fine tuning and can have the same effect as a pseduo-parallel corpus which is two orders of magnitude larger."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-158",
"text": "Pivoting models can be improved if they are jointly trained as shown by Cheng et al. (2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-159",
"text": "Joint training was achieved by either forcing the pivot language's embeddings to be similar or maximizing the likelihood of the cascaded model on a small sourcetarget parallel corpus."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-160",
"text": "Chen et al. (2017) proposed teacher-student learning for pivoting where they first trained a pivot-target NMT model and used it as a teacher to guide the behaviour of a sourcetarget NMT model."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-161",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-162",
"text": "**ZERO-SHOT**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-163",
"text": "The approaches proposed so far involve pivoting or synthetic corpus generation, which is a slow process due to its two-step nature."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-164",
"text": "It is more interesting, and challenging, to enable translation between a zero-resource pair without explicitly involving a pivot language during decoding or for generating pseudo-parallel corpora."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-165",
"text": "This scenario is known as zero-shot NMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-166",
"text": "Zero-shot NMT also requires a pivot language but it is only used during training without the need to generate pseudoparallel corpora."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-167",
"text": "Training."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-168",
"text": "Zero-shot NMT was first demonstrated by Johnson et al. (2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-169",
"text": "However, this zero-shot translation method is inferior to pivoting."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-170",
"text": "They showed that the context vectors (from attention) for unseen language pairs differ from the seen language pairs, possibly explaining the degradation in translation quality."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-171",
"text": "Lakew et al. (2017) tried to overcome this limitation by augmenting the training data with the pseudo-parallel unseen pairs generated by iterative application of the same zeroshot translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-172",
"text": "Arivazhagan et al. (2018) included explicit language invariance losses in the optimization function to encourage parallel sentences to have the same representation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-173",
"text": "Reinforcement learning for zero-shot learning was explored by Sestorain et al. (2018) where the dual learning framework was combined with rewards from language models."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-174",
"text": "Corpus Size."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-175",
"text": "Work on translation for Indian languages showed that zero-shot works well only when the training corpora are extremely large (Mattoni et al., 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-176",
"text": "As the corpora for most Indian languages contain fewer than 100k sentences, the zero-shot approach is rather infeasible despite linguistic similarity."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-177",
"text": "Lakew et al. (2017) confirmed this in the case of European languages where small training corpora were used."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-178",
"text": "Mattoni et al. (2017) also showed that zero-shot translation works well only when the training corpora are large, while Aharoni et al. (2019) show that massively multilingual models are beneficial for zeroshot translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-179",
"text": "Language Control."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-180",
"text": "Zero-shot NMT tends to translate into the wrong language at times and Ha et al. (2017) proposed to filter the output of the softmax so as to force the model to translate into the desired language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-181",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-182",
"text": "**MULTI-SOURCE NMT**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-183",
"text": "If the same source sentence is available in multiple languages then these sentences can be used together to improve the translation into the target language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-184",
"text": "This technique is known as multisource MT (Och and Ney, 2001 )."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-185",
"text": "Approaches for multi-source NMT can be extremely useful for creating N-lingual (N > 3) corpora such as Europarl (Koehn, 2005) and UN (Ziemski et al., 2016b) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-186",
"text": "The underlying principle is to leverage redundancy in terms of source side linguistic phenomena expressed in multiple languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-187",
"text": "Multi-Source Available."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-189",
"text": "showed that a multi-source NMT model using separate encoders and attention networks for each source language outperforms single source models."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-190",
"text": "A simpler approach concatenated multiple source sentences and fed them to a standard NMT model Dabre et al. (2017a) , with performance comparable to ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-191",
"text": "Interestingly, this model could automatically identify the boundaries between different source languages and simplify the training process for multi-source NMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-192",
"text": "Dabre et al. (2017a) also showed that it is better to use linguistically similar source languages, especially in low-resource scenarios."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-193",
"text": "Ensembling of individual source-target models is another beneficial approach, for which Garmash and Monz (2016) proposed several methods with different degrees of parameterization."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-194",
"text": "Missing Source Sentences."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-195",
"text": "There can be missing source sentences in multi-source corpora."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-196",
"text": "Nishimura et al. (2018b) extended by representing each \"missing\" source language with a dummy token."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-197",
"text": "Choi et al. (2018) and Nishimura et al. (2018a) further proposed to use MT generated synthetic sentences, instead of a dummy token for the missing source languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-198",
"text": "Post-Editing."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-199",
"text": "Instead of having a translator translate from scratch, multi-source NMT can be used to generate high quality translations."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-200",
"text": "The translations can then be post-edited, a process that is less labor intensive and cheaper compared to translating from scratch."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-201",
"text": "Multi-source NMT has been used for post-editing where the translated sentence is used as an additional source, leading to improvements (Chatterjee et al., 2017)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-202",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-203",
"text": "**MULTILINGUALISM IN OLDER PARADIGMS**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-204",
"text": "One of the long term goals of the MT community is the development of architectures that can handle more than two languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-205",
"text": "RBMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-206",
"text": "To this end, rule-based systems (RBMT) using an interlingua were explored widely in the past."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-207",
"text": "The interlingua is a symbolic semantic, language-independent representation for natural language text (Sgall and Panevov\u00e1, 1987) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-208",
"text": "Two popular interlinguas are UNL (Uchida, 1996) and AMR (Banarescu et al., 2013) Different interlinguas have been proposed in various systems like KANT (E. H. Nyberg and Carbonell, 1997), UNL, UNITRAN (Dorr, 1987) and DLT (Witkam, 2006) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-209",
"text": "Language specific analyzers converted language input to interlingua, while language specific decoders converted the interlingua into another language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-210",
"text": "To achieve an unambiguous semantic representation, a lot of linguistic analysis had to be performed and many linguistic resources were required."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-211",
"text": "Hence, in practice, most interlingua systems were limited to research systems or translation in specific domains and could not scale to many languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-212",
"text": "Over time most MT research focused on building bilingual systems."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-213",
"text": "SMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-214",
"text": "Phrase-based SMT (PBSMT) systems (Koehn et al., 2003) , a very successful MT paradigm, were also bilingual for the most part."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-215",
"text": "Compared to RBMT, PBSMT requires less linguistic resources and instead requires parallel corpora."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-216",
"text": "However, like RBMT, they work with symbolic, discrete representations making multilingual representation difficult."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-217",
"text": "Moreover, the central unit in PBSMT is the phrase, an ordered sequence of words (not in the linguistic sense)."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-218",
"text": "Given its arbitrary structure, it is not clear how to build a common symbolic representation for phrases across languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-219",
"text": "Nevertheless, some shallow forms of multilingualism have been explored in the context of: (a) pivot-based SMT, (b) multi-source PBSMT, and (c) SMT involving related languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-220",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-221",
"text": "**PIVOTING.**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-222",
"text": "Popular solutions are: chaining source-pivot and pivot-target systems at decoding (Utiyama and Isahara, 2007) , training a sourcetarget system using synthetic data generated using target-pivot and pivot-source systems (Gispert and Marino, 2006) , and phrase-table triangulation pivoting source-pivot and pivot-target phrase tables (Utiyama and Isahara, 2007; Wu and Wang, 2007) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-223",
"text": "Multi-source."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-224",
"text": "Typical approaches are: re-ranking outputs from independent source-target systems (Och and Ney, 2001) , composing a new output from independent source-target outputs (Matusov et al., 2006) , and translating a combined input representation of multiple sources using lattice networks over multiple phrase tables (Schroeder et al., 2009 )."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-225",
"text": "Related languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-226",
"text": "For multilingual translation with multiple related source languages, the typical approaches involved script unification by mapping to a common script such as Devanagari (Banerjee et al., 2018) or transliteration (Nakov and Ng, 2009 )."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-227",
"text": "Lexical similarity was utilized using subword-level translation models (Vilar et al., 2007; Tiedemann, 2012a; Bhattacharyya, 2016, 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-228",
"text": "Combining subword-level representation and pivoting for translation among related languages has been explored (Henr\u00edquez et al., 2011; Tiedemann, 2012a; Kunchukuttan et al., 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-229",
"text": "Most of the above mentioned multilingual systems involved either decoding-time operations, chaining black-box systems or composing new phrase-tables from existing ones."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-230",
"text": "Comparison with MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-231",
"text": "While symbolic representations constrain a unified multilingual representation, distributed universal language representation using real-valued vector spaces makes multilingualism easier to implement in NMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-232",
"text": "As no language specific feature engineering is required for NMT, making it possible to scale to multiple languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-233",
"text": "Neural networks provide flexibility in experimenting with a wide variety of architectures, while advances in optimization techniques and availability of deep learning toolkits make prototyping faster."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-234",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-235",
"text": "**DATASETS AND RESOURCES**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-236",
"text": "MNMT requires parallel corpora in similar domains across multiple languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-237",
"text": "Multiway."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-238",
"text": "Commonly used publicly available multilingual parallel corpora are the TED corpus (Mauro et al., 2012) , UN Corpus (Ziemski et al., 2016a) and those from the European Union like Europarl, JRC-Aquis, DGT-Aquis, DGT-TM, ECDC-TM, EAC-TM (Steinberger et al., 2014) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-239",
"text": "While these sources are primarily comprised of European languages, parallel corpora for some Asian languages is accessible through the WAT shared task (Nakazawa et al., 2018) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-240",
"text": "Only small amount of parallel corpora are available for many languages, primarily from movie subtitles and software localization strings (Tiedemann, 2012b) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-241",
"text": "Low or Zero-Resource."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-242",
"text": "For low or zero-resource NMT translation tasks, good test sets are required for evaluating translation quality."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-243",
"text": "The above mentioned multilingual parallel corpora can be a source for such test sets."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-244",
"text": "In addition, there are other small parallel datasets like the FLORES dataset for English-{Nepali,Sinhala} (Guzm\u00e1n et al., 2019) , the XNLI test set spanning 15 languages (Conneau et al., 2018b ) and the Indic parallel corpus (Birch et al., 2011) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-245",
"text": "The WMT shared tasks (Bojar et al., 2018 ) also provide test sets for some low-resource language pairs."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-246",
"text": "Multi-Source."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-247",
"text": "The corpora for multi-source NMT have to be aligned across languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-248",
"text": "Multi-source corpora can be extracted from some of the above mentioned sources."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-249",
"text": "The following are widely used for evaluation in the literature: Europarl (Koehn, 2005) , TED (Tiedemann, 2012b) , UN (Ziemski et al., 2016b) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-250",
"text": "The Indian Language Corpora Initiative (ILCI) corpus (Jha, 2010) is a 11-way parallel corpus of Indian languages along with English."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-251",
"text": "The Asian Language Treebank (Thu et al., 2016 ) is a 9-way parallel corpus of South-East Asian languages along with English, Japanese and Bengali."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-252",
"text": "The MMCR4NLP project compiles language family grouped multisource corpora and provides standard splits."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-253",
"text": "Shared Tasks."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-254",
"text": "Recently, shared tasks with a focus on multilingual translation have been conducted at IWSLT (Cettolo et al., 2017) , WAT (Nakazawa et al., 2018) and WMT (Bojar et al., 2018) ; so common benchmarks are available."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-255",
"text": "High quality parallel corpora are limited to specific domains."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-256",
"text": "Both, vanilla SMT and NMT perform poorly for domain specific translation in low-resource scenarios (Duh et al., 2013; Koehn and Knowles, 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-257",
"text": "Leveraging out-of-domain parallel corpora and in-domain monolingual corpora for in-domain translation is known as domain adaptation for MT (Chu and Wang, 2018) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-258",
"text": "As we can treat each domain as a language, there are many similarities and common approaches between MNMT and domain adaptation for NMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-259",
"text": "Therefore, similar to MNMT, when using out-of-domain parallel corpora for domain adaptation, multi-domain NMT and transfer learning based approaches (Chu et al., 2017) have been proposed for domain adaptation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-260",
"text": "When using in-domain monolingual corpora, a typical way of doing domain adaptation is generating a pseduo-parallel corpus by back-translating target in-domain monolingual corpora (Sennrich et al., 2016a) , which is similar to the pseduo-parallel corpus generation in MNMT (Firat et al., 2016b) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-262",
"text": "While pivoting is a popular approach for MNMT (Cheng et al., 2017) , it is unsuitable for domain adaptation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-263",
"text": "As there are always vocabulary overlaps between different domains, there are no zero-shot translation (Johnson et al., 2017) settings in domain adaptation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-264",
"text": "In addition, it not uncommon to write domain specific sentences in different styles and so multi-source approaches are not applicable either."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-265",
"text": "On the other hand, data selection approaches in domain adaptation that select out-of-domain sentences which are similar to in-domain sentences (2017a) have not been applied to MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-266",
"text": "In addition, instance weighting approaches (Wang et al., 2017b ) that interpolate in-domain and out-of-domain models have not been studied for MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-267",
"text": "However, with the development of cross-lingual sentence embeddings, data selection and instance weighting approaches might be applicable for MNMT in the near future."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-268",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-269",
"text": "**FUTURE RESEARCH DIRECTIONS**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-270",
"text": "While exciting advances have been made in MNMT in recent years, there are still many interesting directions for exploration."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-271",
"text": "Language Agnostic Representation Learning."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-272",
"text": "A core question that needs further investigation is: how do we build encoder and decoder representations that are language agnostic?"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-273",
"text": "Particularly, the questions of word-order divergence between the source languages and variable length encoder representations have received little attention."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-274",
"text": "Multiple Target Language MNMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-275",
"text": "Most current efforts address multiple source languages."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-276",
"text": "Multiway systems for multiple low-resource target languages need more attention."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-277",
"text": "The right balance between sharing representations vs. maintaining the distinctiveness of the target language for generation needs exploring."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-278",
"text": "Explore Pre-training Models."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-279",
"text": "Pre-training embeddings, encoders and decoders have been shown to be useful for NMT (Ramachandran et al., 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-280",
"text": "How pre-training can be incorporated into different MNMT architectures, is an important as well."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-281",
"text": "Recent advances in cross-lingual word (Klementiev et al., 2012; Mikolov et al., 2013; Chandar et al., 2014; Artetxe et al., 2016; Conneau et al., 2018a; Jawanpuria et al., 2019) and sentence embeddings (Conneau et al., 2018b; Chen et al., 2018a; Artetxe and Schwenk, 2018) could provide directions for this line of investigation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-282",
"text": "Related Languages, Language Registers and Dialects."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-283",
"text": "Translation involving related languages, language registers and dialects can be further explored given the importance of this use case."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-284",
"text": "Code-Mixed Language."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-285",
"text": "Addressing intrasentence multilingualism i.e. code mixed input and output, creoles and pidgins is an interesting research direction."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-286",
"text": "The compact MNMT models can handle code-mixed input, but code-mixed output remains an open problem (Johnson et al., 2017) ."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-287",
"text": "Multilingual and Multi-Domain NMT."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-288",
"text": "Jointly tackling multilingual and multi-domain translation is an interesting direction with many practical use cases."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-289",
"text": "When extending an NMT system to a new language, the parallel corpus in the domain of interest may not be available."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-290",
"text": "Transfer learning in this case has to span languages and domains."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-291",
"text": "----------------------------------"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-292",
"text": "**CONCLUSION**"
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-293",
"text": "MNMT has made rapid progress in the recent past."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-294",
"text": "In this survey, we have covered literature pertaining to the major scenarios we identified for multilingual NMT: multiway, low or zeroresource (transfer learning, pivoting, and zeroshot approaches) and multi-source translation."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-295",
"text": "We have systematically compiled the principal design approaches and their variants, central MNMT issues and their proposed solutions along with their strengths and weaknesses."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-296",
"text": "We have put MNMT in a historical perspective w.r.t work on multilingual RBMT and SMT systems."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-297",
"text": "We suggest promising and important directions for future work."
},
{
"sent_id": "b1c9b8e24916b136948610383f8ea2-C001-298",
"text": "We hope that this survey paper could significantly promote and accelerate MNMT research."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"b1c9b8e24916b136948610383f8ea2-C001-20"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-27"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-34"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-92"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-101",
"b1c9b8e24916b136948610383f8ea2-C001-102"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-109"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-168"
],
[
"b1c9b8e24916b136948610383f8ea2-C001-263"
]
],
"cite_sentences": [
"b1c9b8e24916b136948610383f8ea2-C001-20",
"b1c9b8e24916b136948610383f8ea2-C001-27",
"b1c9b8e24916b136948610383f8ea2-C001-34",
"b1c9b8e24916b136948610383f8ea2-C001-92",
"b1c9b8e24916b136948610383f8ea2-C001-102",
"b1c9b8e24916b136948610383f8ea2-C001-109",
"b1c9b8e24916b136948610383f8ea2-C001-168",
"b1c9b8e24916b136948610383f8ea2-C001-263"
]
},
"@FUT@": {
"gold_contexts": [
[
"b1c9b8e24916b136948610383f8ea2-C001-286"
]
],
"cite_sentences": [
"b1c9b8e24916b136948610383f8ea2-C001-286"
]
}
}
},
"ABC_ffd65a1a02c852a2670b471fb4b110_10": {
"x": [
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-120",
"text": "**CONCLUSION**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-2",
"text": "Modeling semantic plausibility requires commonsense knowledge about the world and has been used as a testbed for exploring various knowledge representations."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-46",
"text": "**TASK**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-3",
"text": "Previous work has focused specifically on modeling physical plausibility and shown that distributional methods fail when tested in a supervised setting."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-4",
"text": "At the same time, distributional models, namely large pretrained language models, have led to improved results for many natural language understanding tasks."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-5",
"text": "In this work, we show that these pretrained language models are in fact effective at modeling physical plausibility in the supervised setting."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-6",
"text": "We therefore present the more difficult problem of learning to model physical plausibility directly from text."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-7",
"text": "We create a training set by extracting attested events from a large corpus, and we provide a baseline for training on these attested events in a self-supervised manner and testing on a physical plausibility task."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-8",
"text": "We believe results could be further improved by injecting explicit commonsense knowledge into a distributional model."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-11",
"text": "A person riding a camel is a common event, and one would expect the subject-verb-object (s-v-o) triple person-ride-camel to be attested in a large corpus."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-12",
"text": "In contrast, gorilla-ride-camel is uncommon, likely unattested, and yet still semantically plausible."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-13",
"text": "Modeling semantic plausibility then requires distinguishing these plausible events from the semantically nonsensical, e.g. lake-ridecamel."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-14",
"text": "Semantic plausibility is a necessary part of many natural language understanding (NLU) tasks including narrative interpolation (Bowman et al., 2016) , story understanding (Mostafazadeh et al., 2016) , paragraph reconstruction (Li and Jurafsky, 2017) , and hard coreference resolution (Peng Event Plausible?"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-15",
"text": "bird-construct-nest et al., 2015) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-16",
"text": "Furthermore, the problem of modeling semantic plausibility has itself been used as a testbed for exploring various knowledge representations."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-17",
"text": "In this work, we focus specifically on modeling physical plausibility as presented by Wang et al. (2018) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-18",
"text": "This is the problem of determining if a given event, represented as an s-v-o triple, is physically plausible (Table 1) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-19",
"text": "We show that in the original supervised setting a distributional model, namely a novel application of BERT (Devlin et al., 2019) , significantly outperforms the best existing method which has access to manually labeled physical features (Wang et al., 2018) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-20",
"text": "Still, the generalization ability of supervised models is limited by the coverage of the training set."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-21",
"text": "We therefore present the more difficult problem of learning physical plausibility directly from text."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-22",
"text": "We create a training set by parsing and extracting attested s-v-o triples from English Wikipedia, and we provide a baseline for training on this dataset and evaluating on Wang et al. (2018) 's physical plausibility task."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-23",
"text": "We also experiment training on a large set of s-v-o triples extracted from the web as part of the NELL project (Carlson et al., 2010) , and find that Wikipedia triples result in better performance."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-44",
"text": "Selectional preference is one factor in plausibility and thus the two should correlate."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-24",
"text": "arXiv:1911.05689v1 [cs.CL] 13 Nov 2019 Wang et al. (2018) present the semantic plausibility dataset that we use for evaluation in this work, and they show that distributional methods fail on this dataset."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-25",
"text": "This conclusion aligns with other work showing that GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013) embeddings do not encode some salient features of objects (Li and Gauthier, 2017) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-26",
"text": "More recent work has similarly concluded that large pretrained language models only learn attested physical knowledge (Forbes et al., 2019) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-27",
"text": "Other datasets which include plausibility ratings are smaller in size and missing atypical but plausible events (Keller and Lapata, 2003) , or concern the more complicated problem of multi-event inference in natural language (Zhang et al., 2017; Sap et al., 2019) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-28",
"text": "Complementary to our work are methods of extracting physical features from a text corpus (Wang et al., 2017; Forbes and Choi, 2017; Bagherinezhad et al., 2016) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-30",
"text": "**DISTRIBUTIONAL MODELS**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-31",
"text": "Motivated by the distributional hypothesis that words in similar contexts have similar meanings (Harris, 1954) , distributional methods learn the representation of a word based on the distribution of its context."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-32",
"text": "The occurrence counts of bigrams in a corpus are correlated with human plausibility ratings (Lapata et al., 1999 (Lapata et al., , 2001 , so one might expect that with a large enough corpus, a distributional model would learn to distinguish plausible but atypical events from implausible ones."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-33",
"text": "As a counterexample,\u00d3 S\u00e9aghdha (2010) has shown that the subject-verb bigram carrot-laugh occurs 855 times in a web corpus, while manservantlaugh occurs zero."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-34",
"text": "1 Not everything that is physically plausible occurs, and not everything that occurs is attested due to reporting bias 2 (Gordon and Van Durme, 2013) ; therefore, modeling semantic plausibility requires systematic inference beyond a distributional cue."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-35",
"text": "We focus on the masked language model BERT as a distributional model."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-36",
"text": "BERT has led to improved results across a variety of NLU bench-marks (Rajpurkar et al., 2018; Wang et al., 2019) , including tasks that require explicit commonsense reasoning such as the Winograd Schema Challenge (Sakaguchi et al., 2019) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-38",
"text": "**SELECTIONAL PREFERENCE**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-39",
"text": "Closely related to semantic plausibility is selectional preference (Resnik, 1996) which concerns the semantic preference of a predicate for its arguments."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-40",
"text": "Here, preference refers to the typicality of arguments: while it is plausible that a gorilla rides a camel, it is not preferred."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-41",
"text": "Current approaches to selectional preference are distributional (Erk et al., 2010; Van de Cruys, 2014) and have shown limited performance in capturing semantic plausibility (Wang et al., 2018) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-42",
"text": "O S\u00e9aghdha and Korhonen (2012) have investigated combining a lexical hierarchy with a distributional approach, and there have been related attempts at grounding selectional preference in visual perception (Bergsma and Goebel, 2011; Shutova et al., 2015) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-43",
"text": "Models of selectional preference are either evaluated on a pseudo-disambiguation task, where attested predicate-argument tuples must be disambiguated from pseudo-negative random tuples, or evaluated on their correlation with human plausibility judgments."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-47",
"text": "Following existing work, we focus on the task of single-event, physical plausibility."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-48",
"text": "This is the problem of determining if a given event, represented as an s-v-o triple, is physically plausible."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-49",
"text": "We use Wang et al. (2018) 's physical plausibility dataset for evaluation."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-50",
"text": "This dataset consists of 3,062 s-v-o triples, built from a vocabulary of 150 verbs and 450 nouns, and containing a diverse combination of both typical and atypical events balanced between the plausible and implausible categories."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-51",
"text": "The set of events and ground truth labels were manually curated."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-53",
"text": "**SUPERVISED**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-54",
"text": "In the supervised setting, a model is trained and tested on labelled events from the same distribution."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-55",
"text": "Therefore, both the training and test set capture typical and atypical plausibility."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-56",
"text": "We follow the same evaluation procedure as previous work and perform cross validation on the 3,062 labeled triples (Wang et al., 2018) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-58",
"text": "**LEARNING FROM TEXT**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-59",
"text": "We also present the problem of learning to model physical plausibility directly from text."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-60",
"text": "In this new setting, a model is trained on events extracted from a large corpus and evaluated on a physical plausibility task."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-61",
"text": "Therefore, only the test set covers both typical and atypical plausibility."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-62",
"text": "We create two training sets based on separate corpora: first, we parse English Wikipedia using the StanfordNLP neural pipeline (Qi et al., 2018) and extract attested s-v-o triples."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-63",
"text": "Wikipedia has led to relatively good results for selectional preference (Zhang et al., 2019) , and in total we extract 6 million unique triples with a cumulative 10 million occurrences."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-64",
"text": "Second, we use the NELL (Carlson et al., 2010) dataset of 604 million s-v-o triples extracted from the dependency parsed ClueWeb09 dataset."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-65",
"text": "For NELL, we filter out triples with nonalphabetic characters or less than 5 occurrences, resulting in a total 2.5 million unique triples with a cumulative 112 million occurrences."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-66",
"text": "For evaluation, we split Wang et al. (2018)'s 3,062 triples into equal sized validation and test sets."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-67",
"text": "Each set thus consists of 1,531 triples."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-69",
"text": "**METHODS**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-70",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-71",
"text": "**NN**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-72",
"text": "As a baseline, we consider the performance of a neural method for selectional preference (Van de Cruys, 2014)."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-73",
"text": "This method is a two-layer artificial neural network (NN) over static embeddings."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-74",
"text": "Supervised."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-75",
"text": "We reproduce the results of Wang et al. (2018) using GloVe embeddings and the same hyperparameter settings."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-76",
"text": "Self-Supervised."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-77",
"text": "We use this same method for learning from text (Subsection 3.2)."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-78",
"text": "To do so, we turn the training data into a self-supervised train-ing set: attested events are considered to be plausible, and pseudo-implausible events are created by sampling each word in an s-v-o triple independently by occurrence frequency."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-79",
"text": "We do hyperparameter search on the validation set over learning rates in {1e \u22123, 1e \u22124, 1e \u2212 5, 2e \u2212 5}, batch sizes in {16, 32, 64, 128}, and epochs in {0.5, 1, 2}."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-81",
"text": "**BERT**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-82",
"text": "We use BERT for modeling semantic plausibility by simply treating this as a sequence classification task."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-83",
"text": "We tokenize the input s-v-o triple and introduce new entity marker tokens to separate each word."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-84",
"text": "3 We then add a single layer NN to classify the input based on the final layer representation of the [CLS] token."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-85",
"text": "We use BERT-large and finetune the entire model in training."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-86",
"text": "4"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-87",
"text": "Supervised."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-88",
"text": "We do no hyperparameter search and simply use the default hyperparameter configuration which has been shown to work well for other commonsense reasoning tasks (Ruan et al., 2019) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-89",
"text": "BERT-large sometimes fails to train on small datasets (Devlin et al., 2019; Niven and Kao, 2019) ; therefore, we restart training with a new random seed when the training loss fails to decrease more than 10%."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-90",
"text": "Self-Supervised."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-91",
"text": "We perform learning from text (Subsection 3.2) by creating a self-supervised training set in exactly the same way as for the NN method."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-92",
"text": "The hyperparameter configuration is determined by grid search on the validation set over learning rates in {1e \u2212 5, 2e \u2212 5, 3e \u2212 5}, batch sizes in {8, 16}, and epochs in {0.5, 1, 2}."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-94",
"text": "**RESULTS**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-96",
"text": "**SUPERVISED**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-118",
"text": "The baseline NN method in particular seems to learn very little from training on the NELL dataset."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-97",
"text": "For the supervised setting, we follow the same evaluation procedure as Wang et al. (2018) : we perform 10-fold cross validation on the dataset of 3,062 s-v-o triples, and report the mean accuracy of running this procedure 20 times all with the same model initialization (Table 3) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-98",
"text": "BERT outperforms existing methods by a large margin, including those with access to manually labeled physical features."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-99",
"text": "We conclude from Model Accuracy Random 0.50 NN (Van de Cruys, 2014) 0.68 NN+WK (Wang et al., 2018) 0.76 Fine-tuned BERT 0.89 these results that distributional data does provide a strong cue for semantic plausibility in the supervised setting of Wang et al. (2018) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-100",
"text": "Examples of positive and negative results for BERT are presented in Table 4 ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-101",
"text": "There is no immediately obvious pattern in the cases where BERT misclassifies an event."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-102",
"text": "We therefore consider events for which BERT gave a consistent estimate across all 20 runs of cross-validation."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-103",
"text": "Of these, we present the event for which BERT was most confident."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-104",
"text": "We note that due to the limited vocabulary size of the dataset, the training set always covers the test set vocabulary when performing 10-fold cross validation."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-105",
"text": "That is to say that every word in the test set has been seen in a different triple in the training set."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-106",
"text": "For example, every verb occurs within 20 triples; therefore, on average a verb in the test set has been seen 18 times in the training set."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-107",
"text": "Supervised performance is dependent on the coverage of the training set vocabulary (Moosavi and Strube, 2017), and it is prohibitively expensive to have a high coverage of plausibility labels across all English verbs and nouns."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-108",
"text": "Furthermore, supervised models are susceptible to annotation artifacts (Gururangan et al., 2018; Poliak et al., 2018) and do not necessarily even learn the desired relation, or in fact any relation, between words (Levy et al., 2015) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-109",
"text": "This is our motivation for reframing semantic plausibility as a task to be learned directly from text, a new setting in which the training set vocabulary is independent of the test set."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-111",
"text": "**LEARNING FROM TEXT**"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-112",
"text": "For learning from text (Subsection 3.2), we report both the validation and test accuracies of classifying physically plausible events (Table 5) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-113",
"text": "BERT fine-tuned on Wikipedia performs the best, although only partially captures semantic plausibility with a test set accuracy of 63%."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-114",
"text": "Performance may benefit from injecting explicit commonsense knowledge into the model, an approach which has previously been used in the supervised setting (Wang et al., 2018) ."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-115",
"text": "Interestingly, BERT is biased towards labelling events as plausible."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-116",
"text": "For the best performing model, for example, 78% of errors are false positives."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-117",
"text": "Models trained on Wikipedia events consistently outperform those trained on NELL which is consistent with our subjective assessment of the cleanliness of these datasets."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-121",
"text": "We show that large, pretrained language models are effective at modeling semantic plausibility in the supervised setting."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-122",
"text": "Supervised models are limited by the coverage of the training set, however; thus, we reframe modeling semantic plausibility as a self-supervised task and present a baseline based on a novel application of BERT."
},
{
"sent_id": "ffd65a1a02c852a2670b471fb4b110-C001-123",
"text": "We believe that self-supervised results could be further improved by incorporating explicit commonsense knowledge, as well as further incidental signals (Roth, 2017 ) from text."
}
],
"y": {
"@UNSURE@": {
"gold_contexts": [
[
"ffd65a1a02c852a2670b471fb4b110-C001-17"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-99"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-114"
]
],
"cite_sentences": [
"ffd65a1a02c852a2670b471fb4b110-C001-17",
"ffd65a1a02c852a2670b471fb4b110-C001-99",
"ffd65a1a02c852a2670b471fb4b110-C001-114"
]
},
"@DIF@": {
"gold_contexts": [
[
"ffd65a1a02c852a2670b471fb4b110-C001-19"
]
],
"cite_sentences": [
"ffd65a1a02c852a2670b471fb4b110-C001-19"
]
},
"@USE@": {
"gold_contexts": [
[
"ffd65a1a02c852a2670b471fb4b110-C001-22"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-24"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-49"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-56"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-97"
]
],
"cite_sentences": [
"ffd65a1a02c852a2670b471fb4b110-C001-22",
"ffd65a1a02c852a2670b471fb4b110-C001-24",
"ffd65a1a02c852a2670b471fb4b110-C001-49",
"ffd65a1a02c852a2670b471fb4b110-C001-56",
"ffd65a1a02c852a2670b471fb4b110-C001-97"
]
},
"@BACK@": {
"gold_contexts": [
[
"ffd65a1a02c852a2670b471fb4b110-C001-24"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-41"
]
],
"cite_sentences": [
"ffd65a1a02c852a2670b471fb4b110-C001-24",
"ffd65a1a02c852a2670b471fb4b110-C001-41"
]
},
"@SIM@": {
"gold_contexts": [
[
"ffd65a1a02c852a2670b471fb4b110-C001-56"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-75"
],
[
"ffd65a1a02c852a2670b471fb4b110-C001-97"
]
],
"cite_sentences": [
"ffd65a1a02c852a2670b471fb4b110-C001-56",
"ffd65a1a02c852a2670b471fb4b110-C001-75",
"ffd65a1a02c852a2670b471fb4b110-C001-97"
]
},
"@EXT@": {
"gold_contexts": [
[
"ffd65a1a02c852a2670b471fb4b110-C001-66"
]
],
"cite_sentences": [
"ffd65a1a02c852a2670b471fb4b110-C001-66"
]
}
}
},
"ABC_183cf87042a3ad2180ead67555d247_10": {
"x": [
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-60",
"text": "**TASK DEFINITION AND MODEL OVERVIEW**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-2",
"text": "Most of the existing pre-trained language representation models neglect to consider the linguistic knowledge of texts, whereas we argue that such knowledge can promote language understanding in various NLP tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-3",
"text": "In this paper, we propose a novel language representation model called SentiLR, which introduces word-level linguistic knowledge including part-of-speech tag and prior sentiment polarity from SentiWordNet to benefit the downstream tasks in sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-4",
"text": "During pre-training, we first acquire the prior sentiment polarity of each word by querying the SentiWordNet dictionary with its partof-speech tag."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-5",
"text": "Then, we devise a new pretraining task called label-aware masked language model (LA-MLM) consisting of two subtasks: 1) word knowledge recovering given the sentence-level label; 2) sentence-level label prediction with linguistic knowledge enhanced context."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-6",
"text": "Experiments show that Sen-tiLR achieves state-of-the-art performance on several sentence-level / aspect-level sentiment analysis tasks by fine-tuning, and also obtain comparative results on general language understanding tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-9",
"text": "Recently, pre-trained language representation models such as GPT (Radford et al., 2018 (Radford et al., , 2019 , ELMo (Peters et al., 2018) , BERT (Devlin et al., 2019) and XLNet have achieved promising results in NLP tasks, including reading comprehension (Rajpurkar et al., 2016) , natural language inference (Bowman et al., 2015; Williams et al., 2018) and sentiment classification (Socher et al., 2013) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-10",
"text": "These models capture contextual information from large-scale unlabelled corpora via well-designed pre-training * Equal contribution \u2020 Corresponding author: Minlie Huang tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-11",
"text": "The literature has commonly reported that pre-trained models can be used as effective feature extractors and achieve state-of-the-art performance on various downstream tasks ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-12",
"text": "Although pre-trained language representation models have achieved transformative performance, the pre-training tasks like masked language model and next sentence prediction (Devlin et al., 2019) neglect to consider the linguistic knowledge."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-13",
"text": "We argue that such knowledge is important for some NLP tasks, particularly for sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-14",
"text": "For instance, existing work has shown that linguistic knowledge including part-ofspeech tag (Qian et al., 2015; and prior sentiment polarity of each word is closely related to the sentiment of longer texts such as sentences and paragraphs."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-15",
"text": "We argue that pre-trained models enriched with the linguistic knowledge of words will benefit the understanding of the sentiment of the whole texts, thereby resulting in better performance on sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-16",
"text": "Although directly introducing the linguistic knowledge from external linguistic resources is feasible, it remains a challenge for the model to learn beneficial knowledge-aware representation that promotes the downstream tasks in sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-17",
"text": "The linguistic knowledge roughly reflects different impacts of individual words on the sentiment of a whole sentence."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-18",
"text": "Some of these words may act as sentiment shifters."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-19",
"text": "For example, negation words constantly change the sentiment to the opposite polarity (Zhu et al., 2014) , while intensity words modify the valence degree, i.e., sentiment intensity of the text ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-58",
"text": "**MODEL**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-20",
"text": "However, the sentiment labels of sentences are commonly derived from multiple sentiment shifts induced by words, and modeling the complex relationship between the sentence-level senti-ment labels and word-level sentiment shifts is still underexplored."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-21",
"text": "Thus, the goal of our research is to fully employ the linguistic knowledge to get language representation entailing the connection between high-level labels and words, which improves the performance in the tasks of sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-22",
"text": "In this paper, we propose a novel pre-trained language representation model called SentiLR to deal with this challenge."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-23",
"text": "First, to acquire the linguistic knowledge of each word, we utilize Senti-WordNet 3.0 (Baccianella et al., 2010) as our linguistic resource."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-24",
"text": "Specifically, we look up the sentiment scores of words with corresponding partof-speech tags in SentiWordNet."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-25",
"text": "Since we can not accurately match the meaning of each word with the sense in SentiWordNet, we compute a weighted sum of the sentiment score of all the senses as the prior sentiment polarity for each word (Guerini et al., 2013) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-26",
"text": "Then, to capture the relationship between sentence-level labels and word-level sentiment shifts using linguistic knowledge, we devise a novel pre-training task called label-aware masked language model."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-27",
"text": "This task contains two sub-tasks: 1) predicting a masked word, part-of-speech tag, and sentiment polarity at masked positions given the sentence-level sentiment label; 2) predicting the sentence-level label, the masked word and its linguistic knowledge including part-of-speech tag and sentiment polarity simultaneously."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-28",
"text": "These two sub-tasks are expected to encourage the model to utilize linguistic knowledge to build the connection between high-level sentiment labels and low-level sentiment shifts."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-29",
"text": "Our contributions are three folds:"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-30",
"text": "\u2022 We analyze the importance of incorporating linguistic knowledge into pre-trained language representation models, and we observe that effectively leveraging linguistic knowledge benefits the sentiment analysis tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-31",
"text": "\u2022 We propose a novel pre-trained language representation model called SentiLR, which acquires word-level sentiment polarity from SentiWordNet and adopts label-aware masked language model to capture the relationship between sentence-level sentiment labels and word-level sentiment shifts."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-32",
"text": "\u2022 We conduct experiments on sentence-level / aspect-level sentiment classification tasks and show that SentiLR can outperform stateof-the-art pre-trained language representation models such as BERT and XLNet."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-33",
"text": "2 Related Work"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-35",
"text": "**PRE-TRAINED LANGUAGE REPRESENTATION MODEL**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-36",
"text": "Early work on pre-trained language representation models mainly focuses on distributed word representations, such as word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-37",
"text": "Since the distributed word representation is independent of context, it's challenging for such representation to model the complex word characteristics under different contexts."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-38",
"text": "Thus contextual language representation based on pre-trained models including CoVe (McCann et al., 2017) , ELMo (Peters et al., 2018) , GPT (Radford et al., 2018 (Radford et al., , 2019 and BERT (Devlin et al., 2019) becomes prevalent recently."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-59",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-39",
"text": "These models use deep LSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017) as the encoder to acquire contextual language representation."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-40",
"text": "Various pre-training tasks were explored including traditional NLP tasks like machine translation (Mc-Cann et al., 2017) and language model (Peters et al., 2018; Radford et al., 2018 Radford et al., , 2019 , or other tasks such as masked language model and next sentence prediction (Devlin et al., 2019) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-41",
"text": "With the advent of BERT (Devlin et al., 2019) achieving state-of-the-art performances on various NLP tasks, many variants of BERT have been proposed."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-42",
"text": "Due to the important role of entities in language understanding, two heuristic ways have been studied to make the pre-trained model aware of entities, i.e. introducing knowledge graph (Zhang et al., 2019) / knowledge base (Peters et al., 2019) explicitly and designing entity-specific masking strategies during pretraining (Sun et al., 2019a,b) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-43",
"text": "Considering the implicit relationship among different NLP tasks, post-training approaches Li et al., 2019) conduct supervised training on the pretrained BERT with transfer tasks which are related to target tasks, in order to get a better initialization for target tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-44",
"text": "The model structure and the pre-training tasks of BERT are also worth exploring."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-45",
"text": "Some researchers measure the impact of key hyper-parameters to improve the undertrained BERT , and others im- prove the masked language model with masking contiguous random spans or decomposing the training objective into autoregressive language model )."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-46",
"text": "Other work propose task-specific pre-training strategies to acquire task-specific language representation applied to the corresponding tasks such as data augmentation , crosslingual analysis (Lample and Conneau, 2019) , relation extraction (Alt et al., 2019; Soares et al., 2019) and language generation (Song et al., 2019; Dong et al., 2019) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-47",
"text": "To the best of our knowledge, SentiLR is the first work to explore sentimentspecific pre-trained language representation model for downstream sentiment analysis tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-48",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-49",
"text": "**LINGUISTIC KNOWLEDGE FOR SENTIMENT ANALYSIS**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-50",
"text": "Linguistic knowledge such as part of speech and word-level sentiment polarity is commonly used as external features in sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-51",
"text": "Part of speech can facilitate the understanding of the syntactic structure of texts by improving the parsing performance (Socher et al., 2013) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-52",
"text": "It can also be incorporated into all layers of RNN as tag embeddings (Qian et al., 2015) . shows that part of speech can help to learn sentiment-favorable representations."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-53",
"text": "Word-level sentiment polarity is mostly derived from sentiment lexicons (Hu and Liu, 2004; Wilson et al., 2005) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-54",
"text": "(Guerini et al., 2013) obtains the prior sentiment polarity by weighting the sentiment scores over all the senses of words in SentiWordNet (Esuli and Sebastiani, 2006; Baccianella et al., 2010) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-55",
"text": "(Teng et al., 2016) proposes a lexicon-based weighted sum model, which weights the prior sentiment scores of sentiment words to get the sentiment label of the whole sentence."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-56",
"text": "models the linguistic role of sentiment, negation and intensity words via linguistic regularizers in the training objective."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-61",
"text": "Our task is formulated as follows: given a text sequence X = (x 1 , x 2 , \u00b7 \u00b7 \u00b7 , x n ) of length n , our goal is to acquire the representation of the whole sequence H = (h 1 , h 2 , \u00b7 \u00b7 \u00b7 , h n ) \u2208 R n\u00d7d that captures the contextual information and the linguistic knowledge simultaneously."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-62",
"text": "In this formulation, d indicates the dimension of the representation vector."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-63",
"text": "Figure 1 shows the overview of our model pipeline which contains three stages: 1) Acquiring the prior sentiment polarity for each word with its corresponding part-of-speech tag; 2) Conducting pre-training via two tasks i.e. label-aware masked language modeling and next sentence prediction; 3) Fine-tuning on sentiment analysis tasks with different settings."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-64",
"text": "Compared with the vanilla pretrained models like BERT (Devlin et al., 2019) , our model enriches the input sequence with its linguistic knowledge including part-of-speech tags and sentiment polarity labels, and utilizes a modified masked language model to capture the relationship between sentence-level sentiment labels and word-level knowledge in addition to context dependency."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-66",
"text": "**LINGUISTIC KNOWLEDGE ACQUISITION**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-67",
"text": "This module obtains the sentiment polarity for each word with its part-of-speech tag."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-68",
"text": "The input of this module is a sequence of tuples X = ((x 1 , pos 1 ), (x 2 , pos 2 ), \u00b7 \u00b7 \u00b7 , (x n , pos n )) containing words and part-of-speech labels tagged by external tools such as NLTK 1 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-69",
"text": "Assume that for each tuple (x i , pos i ), 1 \u2264 i \u2264 n, we can find m different senses with their sense numbers and positive / negative scores (SN"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-70",
"text": "in SentiWordNet due to the ambiguity, where SN indicates the order of different senses and P osScore/N egScore is the positive / negative score assigned by SentiWordNet."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-71",
"text": "Since we can't accurately match the meaning of each word in the sequence with the sense in the SentiWordNet, we follow (Guerini et al., 2013) to convert the scores of all the senses to a prior sentiment label:"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-72",
"text": "( 2) where the reciprocal of the SN of each sense weights the respective score since in the Senti-WordNet smaller sense number indicates more frequent use of this sense in natural language."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-73",
"text": "Note that if we can't find any sense for (x i , pos i ) in Sen-tiWordNet, the label of N eutral will be assigned."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-75",
"text": "**PRE-TRAINING TASKS**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-76",
"text": "During pre-training, Label-aware masked language model (LA-MLM) and next sentence prediction (NSP) are adopted as the pre-training tasks where the setting of NSP is identical to the one proposed by Devlin et al. (2019) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-77",
"text": "Label-aware masked language model is designed to utilize the linguistic knowledge to grasp the implicit dependency between sentence-level sentiment labels and words in addition to context dependency."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-78",
"text": "It contains two separate sub-tasks, both of which take the position embedding, token embedding and segment embedding as the input."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-79",
"text": "The goal of sub-task#1 is to recover the masked sequence conditioned on the sentence-level label, as shown in Figure 2 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-80",
"text": "In this setting, we add the sentence-level sentiment embedding to the inputs and the model is required to predict the word, partof-speech tag and word-level sentiment polarity individually using the hidden states at the masked positions."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-81",
"text": "This sub-task explicitly exerts the impact of the high-level sentiment label on the words and the linguistic knowledge of words, enhancing the ability of our model to explore the complex connection among them."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-82",
"text": "Figure 3 : Sub-task#2 of label-aware masked language model."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-83",
"text": "Our model is to predict the sentence-level sentiment label (negative) and recover the word information (word: good, part-of-speech tag: JJ, word-level sentiment polarity: positive) simultaneously."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-84",
"text": "The purpose of sub-task#2 is to predict the sentence-level label and the word information based on the hidden states at [CLS] and masked positions respectively."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-85",
"text": "From Figure 3 , we can see that the label is used as the supervision signal, which is different from sub-task#1."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-86",
"text": "The simultaneous prediction of labels, words and linguistic knowledge of words enables our model to capture the implicit relationship among them."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-87",
"text": "Since the two sub-tasks are separate, we empirically set the proportion of pre-training data provided for the two sub-tasks to be 4:1."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-88",
"text": "As for the masking strategy, we increase the masking probability of the words with positive / negative sentiment polarity from 15% in the setting of BERT to 30% because they are more possible to cause sentiment shifts in the whole sentence."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-89",
"text": "Figure 4 : Fine-tuning settings of SentiLR on sentencelevel / aspect-level sentiment classification."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-90",
"text": "In both classification tasks, (x 1 , x 2 , \u00b7 \u00b7 \u00b7 , x n ) indicates the text sequence to be classified."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-91",
"text": "In the aspect-level sentiment classification task, an additional aspect term / aspect category sequence (a 1 , a 2 , \u00b7 \u00b7 \u00b7 , a l ) is concatenated with the text sequence."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-93",
"text": "**FINE-TUNING SETTING**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-94",
"text": "Equipped with the ability to utilize linguistic knowledge via pre-training, our model can be finetuned to different sentiment analysis tasks, including sentence-level / aspect-level sentiment classification."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-95",
"text": "We follow the fine-tuning setting of the existing work (Devlin et al., 2019; : Sentence-level Sentiment Classification: The input of this task is a text sequence ([CLS], x 1 , x 2 , \u00b7 \u00b7 \u00b7 , x n , [SEP])."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-96",
"text": "The sentiment label is obtained based on the hidden state of [CLS]."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-97",
"text": "Aspect-level Sentiment Classification: In addition to the text sequence, the input additionally contains an aspect term / aspect category sequence (a 1 , \u00b7 \u00b7 \u00b7 , a l )."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-98",
"text": "The sentiment label is also acquired based on the hidden state of [CLS] ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-99",
"text": "Figure 4 illustrates the fine-tuning settings."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-101",
"text": "**EXPERIMENT**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-125",
"text": "Since MR, IMDB and Yelp-2/5 don't have validation sets, we randomly sampled subsets from the training sets as the validation sets, and tested all the models with the same data split."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-126",
"text": "The results are shown in Table 3 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-127",
"text": "We can observe that SentiLR performs better or equally compared with other baselines on MR, Yelp-2 and Yelp-5."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-128",
"text": "As for SST and IMDB, our model clearly surpasses BERT and shows comparative performances with XLNet."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-129",
"text": "This demonstrates that our model can derive the sentence-level labels based on the sentiment shifts in the sentences and get a better understanding of the sentiment in the whole text."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-130",
"text": "Aspect-level sentiment classification is an important task in sentiment analysis."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-131",
"text": "Given the aspect term / aspect category and the corresponding review, this task aims to predict the sentiment of the aspect based on the review, which evaluates the ability to capture the sentiment of specific content."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-132",
"text": "The difference between aspect term and aspect category is that the former is a specific term (e.g. fish) while the latter is a coarse-grained category (e.g. food)."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-133",
"text": "For aspect term sentiment classification, we chose SemEval2014 Task 4 (laptop and restaurant domains) as the benchmarks, while for aspect category sentiment classification, we used SemEval2014 Task 4 (restaurant domain) and Se-mEval2016 Task 5 (restaurant domain)."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-134",
"text": "The statistics of these benchmarks containing the amount of training / validation / test sets, the number of classes and the number of aspect terms / aspect categories are shown in Table 4 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-135",
"text": "We followed the existing work to leave 150 examples from the training sets for validation."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-137",
"text": "**ASPECT-LEVEL SENTIMENT CLASSIFICATION**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-138",
"text": "We present the results of aspect-level sentiment classification in Table 5 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-139",
"text": "We can see that Sen-tiLR outperforms the baselines in both accuracy and Macro-F1 on these datasets, indicating that our model can successfully grasp the sentiment of the given aspects."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-140",
"text": "Since the improvement of Macro-F1 is more notable than that of accuracy, it is convinced that our model actually does better in all the three sentiment classes."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-141",
"text": "Due to the sparsity of aspect terms compared with aspect categories, our model improved a larger margin on the task of aspect category sentiment classification than the (Devlin et al., 2019) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-142",
"text": "To explore whether the performance of Sen-tiLR on common NLP tasks will improve or degrade, we evaluated our model on General Language Understanding Evaluation (GLUE) benchmark , which collects diverse language understanding tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-143",
"text": "We fine-tuned Sen-tiLR on each task of GLUE respectively, and compared its performance with vanilla BERT."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-144",
"text": "Since the test sets of GLUE are not publicly available, we reported the results on development sets in Table 6 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-145",
"text": "Note that we directly used the results of BERT on SST-2, MNLI, QNLI and MRPC which are reported by Devlin et al. (2019) and reimplemented the BERT model fine-tuned on the rest of the tasks by ourselves."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-146",
"text": "From Table 6 , SentiLR surely gets better results on the tasks in sentiment analysis like SST-2."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-147",
"text": "We also observe that our model outperforms BERT on CoLA, MRPC, QNLI tasks, and gets comparative results on the other datasets."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-148",
"text": "Among these datasets, CoLA requires fine-grained grammaticality distinction for complex syntactic structures (Warstadt and Bowman, 2019) , which may be aided by part-of-speech tag information."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-149",
"text": "Similarly, QNLI has also been reported to be improved with external part-of-speech features (Rajpurkar et al., 2016) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-150",
"text": "Thereby, our model which is able to utilize the linguistic knowledge achieves better performance accordingly."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-151",
"text": "Table 7 : Ablation study on sentence-level sentiment classification (MR), aspect term sentiment classification (SemEval14-Laptop) and aspect category sentiment classification (SemEval14-Restaurant), where accuracy (Acc.) and Macro-F1 (MF1.) are reported."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-153",
"text": "**ABLATION STUDY**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-154",
"text": "To study the effectiveness and significance of the linguistic knowledge introduced and the labelaware masked language model, we remove the linguistic knowledge and two sub-tasks in the labelaware masked language model respectively and present the results in Table 7 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-155",
"text": "Since the two subtasks are separate, the setting of -subtask#1/2 in Table 7 indicates that the pre-training data are all fed into the other sub-task."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-156",
"text": "Additionally, theknowledge setting means that we remove the partof-speech and sentiment polarity embedding in the input as well as the supervision signals of linguistic knowledge in two sub-tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-157",
"text": "The results in Table 7 show that both the linguistic knowledge and the pre-training task contribute to the final performance."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-158",
"text": "In terms of the different effects of two sub-tasks, they perform comparatively on the sentence-level classification and aspect term sentiment classification."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-159",
"text": "Nevertheless, sub-task#2 seems more important to the aspect category sentiment classification as the performance degrades severely on SemEval14 (Restaurant) when sub-task#2 is ablated."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-160",
"text": "Considering the impact of knowledge, the performance of SentiLR doesn't degrade a lot compared with the setting of removing the pre-training task."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-161",
"text": "This result indicates that SentiLR doesn't only depend on the external knowledge from SentiWordNet."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-162",
"text": "The welldesigned pre-training task facilitates the model to explore the information within contexts even without the explicit knowledge and build the deep connection between the labels and words."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-163",
"text": "The movie is of poor quality with no good comments about it."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-164",
"text": "1"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-165",
"text": "The movie is of low quality with few good comments about it."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-166",
"text": "2"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-167",
"text": "The movie is of decent quality with some good comments about it."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-168",
"text": "3"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-169",
"text": "The movie is of good quality with several good comments about it."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-170",
"text": "4"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-171",
"text": "The movie is of excellent quality with many good comments about it."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-172",
"text": "Table 9 : Statistics of the prediction at [MASK] position given the same input sentence with different sentencelevel labels."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-173",
"text": "The weighted sentiment score is computed as a weighted sum of the probability over the vocabulary and the weight for each word is obtained from the SentiWordNet via Equation (1)."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-174",
"text": "The visualized generation probabilities of different sentiments are obtained by accumulating the probability of words with the respective prior sentiment polarities."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-175",
"text": "Firstly, we show that our pre-trained model can capture the deep relationship between sentencelevel labels and sentiment words."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-176",
"text": "Given the same input sentence with one masked word and different sentence-level labels in the form of sentencelevel embeddings, our model can recover the masked word with respect to the global sentiment."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-177",
"text": "We calculated the weighted sentiment score via w\u2208V score (w,pos[MASK]) \u00d7 p [MASK] (w) where p [MASK] (w) is the probability of word w at the [MASK] position computed by SentiLR, pos [MASK] indicates the predicted part-of-speech tags from SentiLR, and score (w,pos[MASK]) is obtained from the SentiWordNet via Equation (1)."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-178",
"text": "As this weighted score reveals the sentiment polarity of the model's prediction, we can see from Table 9 that it gradually shifts from negative to positive as the sentence-level label goes from 0 to 4."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-179",
"text": "We also calculated the accumulated generation probabilities of negative, neutral and positive words defined by the prior sentiment labels to show the changes of word usage in fine-grained sentiment settings."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-180",
"text": "Secondly, we demonstrate that our model can simultaneously capture the context dependency and the sentiment-related linguistic knowledge."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-181",
"text": "We can see from Table 8 that our model chooses different words at the first [MASK] to satisfy the fine-grained sentence-level labels."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-182",
"text": "Then, our model infers the relationship between the amount of positive comments and the quality of the movie via context dependency and fills the second [MASK] with reasonable quantifiers."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-183",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-184",
"text": "**CONCLUSION**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-185",
"text": "In this paper, we propose a novel pre-trained language representation model called SentiLR, which captures not only the context dependency but also the linguistic knowledge of each word."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-186",
"text": "We introduce the linguistic knowledge from SentiWordNet and design label-aware masked language model to enable our model to utilize the knowledge in sentiment analysis tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-187",
"text": "Experiments show that our model can outperform several state-of-the-art pretrained language representations in the sentiment analysis tasks."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-103",
"text": "**PRE-TRAINING DATASET AND IMPLEMENTATION**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-104",
"text": "We adopted the Yelp Dataset Challenge 2019 2 as our pre-training dataset."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-105",
"text": "This dataset contains 6,685,900 reviews with 5-class review-level sentiment labels."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-106",
"text": "Each review consists of 8.1 sentences and 127.8 words on average."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-107",
"text": "Since our method can adapt to all the BERTstyle pre-training models, we used vanilla BERT (Devlin et al., 2019) as the base framework to construct Transformer blocks in this paper and leave the exploration of other models like RoBERTa as future work."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-108",
"text": "The hyperparameters of the Transformer blocks were set to be the same as BERT-Base due to the limited computational power."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-109",
"text": "Considering the large cost of training from scratch, we utilized the parameters of pre-trained BERT 3 to initialize our model."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-110",
"text": "We also followed BERT to use WordPiece vocabulary (Wu et al., 2016) with a vocabulary size of 30,522."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-111",
"text": "The maximum sequence length in the pre-training phase was 128, while the batch size was 512."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-112",
"text": "We took Adam (Kingma and Ba, 2015) as the optimizer and set the learning rate to be 5e-5."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-113",
"text": "Our model was pre-trained on Yelp Dataset Challenge 2019 for 1 epoch with label-aware masked language model and next sentence prediction."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-114",
"text": "Note that we'll release all the data, codes and model parameters."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-116",
"text": "**BASELINES**"
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-117",
"text": "We compared SentiLR with several state-of-theart pre-trained language representation models: BERT: The pre-trained model based on masked language model and next sentence prediction (Devlin et al., 2019) ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-118",
"text": "XLNet: The variant of BERT which autoregressively recovers the masked tokens with permutation language model ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-119",
"text": "For fair comparison, all the baselines in this paper were set to the base version."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-120",
"text": "The number of parameters in each model is listed in Table 1 ."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-121",
"text": "Since SentiLR adopts the same architecture of Transformer blocks as BERT, the number of parameters in these two models are almost the same and less than XLNet."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-122",
"text": "BERT XLNet SentiLR # parameters 109.486M 117.313M 109.495M The goal of the sentence-level sentiment classification is to predict the sentiment labels of sentences or paragraphs, which examines the model's ability to understand the whole text."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-123",
"text": "We evaluated our model on several sentence-level sentiment classification benchmarks including Stanford Sentiment Treebank (SST) (Socher et al., 2013) , Movie Review (MR) (Pang and Lee, 2005) , IMDB (Maas et al., 2011) and Yelp-2/5 (Zhang et al., 2015) which cover widely used datasets at different scales."
},
{
"sent_id": "183cf87042a3ad2180ead67555d247-C001-124",
"text": "We reported the statistics of the datasets in Table 2 including the number of training / validation / test set, the average length and the number of classes."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"183cf87042a3ad2180ead67555d247-C001-9"
],
[
"183cf87042a3ad2180ead67555d247-C001-38"
],
[
"183cf87042a3ad2180ead67555d247-C001-40"
],
[
"183cf87042a3ad2180ead67555d247-C001-41"
]
],
"cite_sentences": [
"183cf87042a3ad2180ead67555d247-C001-9",
"183cf87042a3ad2180ead67555d247-C001-38",
"183cf87042a3ad2180ead67555d247-C001-40",
"183cf87042a3ad2180ead67555d247-C001-41"
]
},
"@MOT@": {
"gold_contexts": [
[
"183cf87042a3ad2180ead67555d247-C001-12",
"183cf87042a3ad2180ead67555d247-C001-13"
]
],
"cite_sentences": [
"183cf87042a3ad2180ead67555d247-C001-12"
]
},
"@DIF@": {
"gold_contexts": [
[
"183cf87042a3ad2180ead67555d247-C001-64"
],
[
"183cf87042a3ad2180ead67555d247-C001-141"
]
],
"cite_sentences": [
"183cf87042a3ad2180ead67555d247-C001-64",
"183cf87042a3ad2180ead67555d247-C001-141"
]
},
"@USE@": {
"gold_contexts": [
[
"183cf87042a3ad2180ead67555d247-C001-76"
],
[
"183cf87042a3ad2180ead67555d247-C001-95"
],
[
"183cf87042a3ad2180ead67555d247-C001-107"
],
[
"183cf87042a3ad2180ead67555d247-C001-145"
]
],
"cite_sentences": [
"183cf87042a3ad2180ead67555d247-C001-76",
"183cf87042a3ad2180ead67555d247-C001-95",
"183cf87042a3ad2180ead67555d247-C001-107",
"183cf87042a3ad2180ead67555d247-C001-145"
]
},
"@SIM@": {
"gold_contexts": [
[
"183cf87042a3ad2180ead67555d247-C001-76"
]
],
"cite_sentences": [
"183cf87042a3ad2180ead67555d247-C001-76"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"183cf87042a3ad2180ead67555d247-C001-117"
]
],
"cite_sentences": [
"183cf87042a3ad2180ead67555d247-C001-117"
]
}
}
},
"ABC_e3b9c00d792bcddb6eea449179e61e_10": {
"x": [
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-112",
"text": "**DATASET**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-137",
"text": "Table 3 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-138",
"text": "Standard text metric results for single-best-caption models."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-2",
"text": "Abstract."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-3",
"text": "Image Captioning is a task that requires models to acquire a multimodal understanding of the world and to express this understanding in natural language text."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-4",
"text": "While the state-of-the-art for this task has rapidly improved in terms of n-gram metrics, these models tend to output the same generic captions for similar images."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-5",
"text": "In this work, we address this limitation and train a model that generates more diverse and specific captions through an unsupervised training approach that incorporates a learning signal from an Image Retrieval model."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-6",
"text": "We summarize previous results and improve the state-of-the-art on caption diversity and novelty."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-7",
"text": "We make our source code publicly available online 1 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-10",
"text": "Image Captioning is a task that requires models to acquire a multimodal understanding of the world and to express this understanding in natural language text, making it relevant to a variety of fields from human-machine interaction to data management."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-11",
"text": "The practical goal is to automatically generate a natural language caption that describes the most relevant aspects of an image."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-12",
"text": "Most state-of-the-art neural models are built on an encoder-decoder architecture where a Convolutional Neural Network (CNN) acts as the encoder for the image features that are fed to a Recurrent Neural Network (RNN) which generates a caption by acting as a decoder."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-13",
"text": "It is also common to include one or more attention layers to focus the captions on the most salient parts of an image."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-14",
"text": "The standard way of training is through Maximum Likelihood Estimation (MLE) by using a crossentropy loss to replicate ground-truth human-written captions for corresponding images."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-15",
"text": "Recent Image Captioning models of this kind [1, 11, 12, 28] have shown impressive results, much thanks to the powerful language modelling capabilities of Long Short-Term Memory (LSTM) [15] RNNs."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-16",
"text": "However, although MLE training enables models to confidently generate captions that have a high likelihood in the training set, it limits their capacity to generate novel descriptions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-17",
"text": "Their output exhibits a disproportionate replication of common n-grams and full captions seen in the training set [9, 11, 26] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-18",
"text": "Contributing to this problem is a combination of biased datasets and insufficient quality metrics."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-19",
"text": "While the main benchmarking dataset for Image Captioning, MS COCO, makes available over 120k images with 5 human-annotated captions each [6] , the selection process for the images suggests a lack of diversity in both content and composition [11, 20] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-20",
"text": "Furthermore, the standard benchmarking metrics, based on ngram level overlap between generated captions and ground-truth captions, reward models with a bias towards common n-grams."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-21",
"text": "This leads to the (indirect and unwanted) consequence of incentivizing models that output generic captions that are likely to fit a range of similar images, despite missing the goal of describing the relevant aspects specific to each image."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-22",
"text": "In this paper, we propose a model that produces more diverse and specific captions by integrating a Natural Language Understanding (NLU) component in our training which optimizes the specificity of our Natural Language Generation (NLG) component."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-23",
"text": "Our main contribution is an unsupervised specificity-guided training approach that improves the diversity and semantic accuracy of the generated captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-24",
"text": "This approach can be applied to neural models of any multimodal NLG task (e.g. Image Captioning) where a corresponding NLU component can be made available."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-25",
"text": "We begin with an analysis of metrics for measuring caption quality in Section 2, where we define what we believe to be an informative set of metrics for our target."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-26",
"text": "Following this, in Section 3 we describe our novel training approach along with the technical details of the NLG (our Image Captioning model) and NLU components for our experiments."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-27",
"text": "In Section 4 we outline the experiments we undertook to evaluate our approach, followed by a discussion of our quantitative and qualitative results in Section 5."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-28",
"text": "We review related work in Section 6 before presenting our conclusions and suggestions for future work in Section 7."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-30",
"text": "**MEASURING CAPTION QUALITY**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-31",
"text": "The subjectivity in what defines a good caption, has made it difficult to identify a single metric for the overall quality of Image Captioning models [5, 26] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-32",
"text": "Benchmarking methods from Machine Translation [3, 19, 23] have been appropriated, while other somewhat similar methods such as CIDEr [27] have been proposed specifically for assessing the quality of image captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-33",
"text": "All these approaches unfortunately have a strong focus on replicating common n-grams from the ground-truth captions [5] and do not take into account the richness and diversity of human expression [9, 26] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-34",
"text": "Moreover, it has been found that this class of metrics suffers from poor correlations with human evaluation, with CIDEr and METEOR having the highest correlations among them [5] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-35",
"text": "With the recognition of these limitations, there has been a growing interest in developing metrics that measure other desirable qualities in captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-36",
"text": "SPICE [2] is a recent addition which measures the overlap of content by comparing automatically generated scene-graphs from the ground-truth and generated captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-37",
"text": "While being a relevant addition, it does not solve the problem of generic captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-38",
"text": "Rare occurrences and more detailed descriptions are more likely to incur a penalty than common concepts; e.g. correctly specifying a purple flower where the ground-truth text omits its color would register a false positive for the color."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-39",
"text": "This, again, encourages the ''safe'' generic captions that we want to move away from."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-41",
"text": "**DIVERSITY METRICS**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-42",
"text": "In an effort to measure the amount of generic captions produced by various Image Captioning models, [11] explores the concept of caption diversity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-43",
"text": "More recently, this concept has been employed as the focus for training and evaluation [26, 29] , and it has been proposed that improving caption diversity leads to more human-like captions [26] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-44",
"text": "This research direction is still new and lacks clear benchmarks and standardized metrics."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-45",
"text": "We propose the following set of metrics to evaluate the diversity of a model:"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-46",
"text": "\u2500 novelty -percentage of generated captions where exact duplicates are not found in the training set [11, 26, 29 ] \u2500 diversity -percentage of distinct captions (where duplicates count as a single distinct caption) out of the total number of generated captions [11] \u2500 vocabulary size -number of unique words used in generated captions [26]"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-48",
"text": "**MEANINGFUL DIVERSITY THROUGH SPECIFICITY**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-49",
"text": "The diversity metrics alone do not tell us if a diverse model is more meaningful or if it simply introduced more noise."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-50",
"text": "We argue that improving the specificity of the captions is essential to producing a meaningful increase in diversity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-51",
"text": "Our hypothesis is that by directly increasing the specificity, we will also achieve a higher diversity since diversity is a necessity for specificity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-52",
"text": "By improving both the specificity and diversity, we expect to generate qualitatively better captions that are less generic."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-53",
"text": "For this purpose, we propose a training architecture where a specificity loss is inferred by a separately trained Image Retrieval model."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-54",
"text": "Specificity is measured by two standard Image Retrieval metrics:"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-55",
"text": "\u2500 recall at k -percentage of generated captions resulting in the original image being found in the top k candidates retrieved by the Image Retrieval model \u2500 mean rank -mean rank given by the Image Retriever to the correct image based on its generated caption 3"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-57",
"text": "**OPTIMIZING FOR SPECIFICITY**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-58",
"text": "To train a model that produces more diverse and meaningful captions, we propose to use an Image Retrieval model to improve the caption specificity of an Image Captioning model."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-59",
"text": "In Image Retrieval tasks, a given query must be specific enough to retrieve the correct image among other, possibly similar, images."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-60",
"text": "In this paper, we investigate whether the error signal from an Image Retrieval model can improve caption specificity in an Image Captioning model, and whether these more specific captions are also more diverse."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-85",
"text": "The first two improve the individual similarity of a caption to its corresponding image, while the latter two implement contrastive pairwise versions of the first two."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-61",
"text": "The training process is inspired by [22] where the task is to generate Referring Expressions that unambiguously refer to a region of an image; their solution is to introduce a Region Discriminator that measures the quality of their generated expressions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-62",
"text": "Their method is in turn inspired by Generative Adversarial Networks (GANs) in which a Generator and a Discriminator are in constant competition -the Discriminator aims to distinguish between real and generated data, while the Generator aims to generate data that the Discriminator cannot tell apart from the real data [13] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-63",
"text": "In [22] , the training is cooperative rather than competitive; both systems adjust to the other to provide the best joint results."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-64",
"text": "We take a slightly different approach from both the joint training in [22] and recent applications of GAN training in Image Captioning [9, 26] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-65",
"text": "Instead of allowing both systems to learn from each other, we freeze the NLU side and allow only the NLG to learn from the NLU; the NLU model is pre-trained on ground-truth captions, without any input from the NLG."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-66",
"text": "Consequently, we avoid one of the problems observed in [22] where both systems adapt to each other and develop their own protocol of communication which gradually degrades the resemblance to human language."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-67",
"text": "We also avoid the instability in training and difficulty in loss monitoring commonly seen in GANs."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-69",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-70",
"text": "To demonstrate our training approach, we practically apply it to a neural Image Captioning model proposed in [1] which uses an encoder-decoder architecture with regionbased attention."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-71",
"text": "For our experiments, we use a publicly available re-implementation [21] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-72",
"text": "To leverage the fluency gained from MLE training, the model is pre-trained to minimize the cross-entropy loss for each ground truth sequence 1: when conditioned on an image and the attended image features 1: :"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-73",
"text": "The pre-trained model also provides a strong baseline to compare to."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-74",
"text": "The model architecture, illustrated in Fig. 1 , consists of a ResNet-101 [14] CNN pre-trained on the ImageNet [25] dataset, followed by an LSTM for attention modelling, and a second LSTM that generates the captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-75",
"text": "(Unlike [1] , the attention-regions are 14x14 regions over the final convolutional layer instead of using a region proposal network.) During our specificity training, the CNN layers remain frozen while we update the weights of the two LSTMs."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-77",
"text": "**FIG. 2. INTERACTIONS BETWEEN THE IMAGE**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-78",
"text": "Captioning and Image Retrieval models during training."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-79",
"text": "For our NLU component, we use the neural Image Retrieval model from the SentEval toolkit [8] ; the NLU is pre-trained on ground-truth data and remains frozen during our specificity training."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-80",
"text": "Given an image-caption pair, it produces the loss and gradients for our Image Captioning model by projecting the image and caption into the same space to estimate their similarity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-81",
"text": "The image embeddings are acquired by a ResNet-101 trained on ImageNet, and the captions are embedded using InferSent [7] with GloVe [24] word embeddings."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-83",
"text": "**SPECIFICITY LOSS FUNCTIONS**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-84",
"text": "We define four different loss functions to be calculated by our NLU component, each used in one of the model variations."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-86",
"text": "Let be the projected caption embedding and let be the projected image embedding, both acquired by passing the generated caption and its original corresponding image through the Image Retrieval model."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-87",
"text": "For the contrastive loss functions, let be a contrastive image chosen at random from the top 1% most similar images to based on its activations from the final convolutional layer of the encoder CNN."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-88",
"text": "We can now define the dot product similarity loss , the cosine similarity loss , the contrastive dot product loss and the contrastive cosine loss ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-89",
"text": "Equations 2 -5 define the loss functions in terms of a single example; the final loss is the mean loss over all examples."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-91",
"text": "**TRAINING**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-92",
"text": "The interactions between the NLU and NLG components are illustrated in Fig. 2 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-93",
"text": "At each iteration, the Image Captioning model generates a full caption for a given image (or a set of captions for a batch of images)."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-94",
"text": "This involves a non-differentiable sampling step to convert the word-level probabilities into a sequence of discrete words represented by 1-hot encoded vectors."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-95",
"text": "The caption is then fed to the Image Retriever along with its corresponding image, where both are passed through the embedding and projection steps."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-96",
"text": "The Image Retriever calculates one of the specificity losses defined in Section 3.2."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-97",
"text": "To minimize this loss, we need to backpropagate the gradients through the Image Retrieval model's (frozen) layers and then back through the Image Captioning model's layers that we wish to update."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-98",
"text": "This is not trivial since our forward pass includes a nondifferentiable sampling step."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-99",
"text": "To overcome this, we apply the Straight-Through method [4] and use the gradients with respect to the 1-hot encoding as an approximation for the gradients with respect to the probabilities before sampling."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-100",
"text": "We empirically validate this approach by observing that our loss decreases smoothly."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-101",
"text": "We also experimented with the similar Gumbel Straight-Through method [16] but observed no empirical benefit."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-102",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-103",
"text": "**EXPERIMENT DESIGN**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-104",
"text": "All experiments are conducted in PyTorch 2 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-105",
"text": "Our implementation extends the code of the baseline Image Captioning model by replacing the MLE training with our specificity training."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-106",
"text": "The Image Retrieval code is modified to calculate our specificity losses defined in Section 3.2."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-107",
"text": "We use the Adam [18] optimizer with an initial learning rate of 1 \u00d7 10 \u22126 for the contrastive models and 1 \u00d7 10 \u22127 for the other two models."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-108",
"text": "Early stopping is used based on the lowest mean rank on the validation set."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-109",
"text": "The contrastive models trained for about 190k iterations on the randomly shuffled training set, while the non-contrastive models trained for about 250k iterations, all using a batch-size of 2."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-110",
"text": "When sampling from the final models on the test set, any tokens that are duplicates of the immediately previous token are automatically removed since such duplicates were an issue in our non-contrastive models; we do the same for all our models, including the baseline, for a fair comparison."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-113",
"text": "We use the MS COCO dataset [20] with the Karpathy 5k splits [17] , containing 113k images for training and 5k each for validation and test, with 5 captions for each image."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-114",
"text": "The same splits were used for both the NLG and the NLU, including pre-training, ensuring that we have no overlap between training, validation and test data and that our improvements do not come from bridging a gap between different datasets."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-115",
"text": "Note that the specificity training does not require any extra data in addition to that used during pre-training."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-116",
"text": "Furthermore, since the labels are not used during our specificity training, one could also make use of unlabeled data."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-117",
"text": "All splits were pre-processed by lowercasing all words and removing punctuation."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-118",
"text": "Any words appearing less than 5 times in the training set were replaced by the UNK token, resulting in a vocabulary size of 9487 (including the UNK token)."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-120",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-121",
"text": "The models we compare to are the best models in terms of diversity from [11, 26] , using the single best caption after re-ranking for the latter."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-122",
"text": "We also report the specificity metrics used for our training goals."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-123",
"text": "The results for specificity would not be directly comparable to models using other external systems, but they are relevant when assessing our own models and verifying that our increase in diversity follows from an increase in specificity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-124",
"text": "Results from our contrastive models are averaged over 3 runs each."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-125",
"text": "The non-contrastive models are based on single runs."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-126",
"text": "As can be seen in Table 1 , our models demonstrate increased diversity and novelty, outperforming previously reported results."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-127",
"text": "The vocabulary size also increases but is lower than in [26] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-128",
"text": "When it comes to the specificity metrics, our contrastive models have the advantage over our non-contrastive ones."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-129",
"text": "They all improve the overall mean rank, but the latter do not show the increase in smaller k recalls that the contrastive models do."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-130",
"text": "This is not surprising since the contrastive models specifically minimize their loss in comparison to similar images, while the non-contrastive ones increase their semantic similarity in isolation."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-131",
"text": "The higher specificity of the contrastive models is also accompanied by higher values in diversity and novelty."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-132",
"text": "Table 2 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-133",
"text": "Novelty and diversity per image with up to 10 candidates; novelty and diversity was not reported for the single-best-caption output."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-134",
"text": "Diversity metrics for multi-candidate models Diversity within candidates Novelty within candidates CVAE [29] 11.8 82.0 GMM-CVAE [29] 59.4 80.9 AG-CVAE [29] 76.4 79.5"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-135",
"text": "For completeness, we include the best models from [29] in Table 2 ; however, they only report diversity results on multiple (up to 10) candidates per image (where duplicates of a novel caption are counted as multiple novel captions), so they are not directly comparable to the single-best-caption models."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-136",
"text": "Note that [12, 29] use different data splits, while our models and [26] use the Karpathy 5k splits [17] ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-139",
"text": "All metrics are n-gram based except for SPICE which is based on scene graphs automatically inferred from the captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-140",
"text": "[29] 0.698 0.521 0.372 0.265 0.506 0.225 0.834 0.158 GMM-CVAE [29] 0.718 0.538 0.388 0.278 0.516 0.238 0.932 0.170 AG-CVAE [29] 0 In Table 3 , we report results on the standard text metrics."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-141",
"text": "As expected, we see a slight decrease in these metrics when moving away from safer generic captions."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-142",
"text": "They are, however, still in line with our state-of-the-art baseline and slightly stronger than previous diversity-focused models."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-143",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-144",
"text": "**STANDARD TEXT METRICS**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-145",
"text": "B-n = BLEU-n R-L = ROUGE-L M = METEOR C = CIDEr S = SPICE B-1 B-2 B-3 B-4 R-L M C S D-ME+DMSM [12] - - - 0.257 - 0.236 - - Adv-samp [26] - - - - - 0.236 - 0.166 CVAE"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-147",
"text": "**QUALITATIVE ANALYSIS**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-148",
"text": "Our contrastive models tend to generate more specific (and accurate) captions while the baseline model prefers common patterns from the training data, as can be seen in the leftmost images in Fig. 3 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-149",
"text": "As is particularly evident in the bottom image, our contrastive models pay more attention to the image content (i.e. mentioning the dog) while the baseline model pays more attention to the language priors (i.e. assuming the presence of a surfboard on the beach)."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-150",
"text": "The rightmost image shows a failure case where our contrastive models focus on the wooden structure (which is more unique in this context) while omitting the skateboard (which is more common, but also more relevant)."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-151",
"text": "The improvement in diversity and specificity is not achieved by simply producing longer captions; the average caption length for the baseline, contrastive and non-contrastive models were 9.6, 9.4 and 8.9 words respectively."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-152",
"text": "Fig. 3 ."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-153",
"text": "Examples of generated captions and human annotations."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-154",
"text": "The rightmost image shows a failure case where specificity took precedence over relevance."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-155",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-156",
"text": "**RELATED WORK**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-157",
"text": "While Image Captioning has received a lot of attention, the focus has mainly been on n-gram metric results."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-158",
"text": "[11] provides some insight into the problems that follow from the standard training and metrics, noting the lack of diversity observed in captions from state-of-the-art neural models."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-159",
"text": "More recently, this has led to some initial attempts at improving caption diversity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-160",
"text": "In [9] , a GAN model conditioned on the image is proposed."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-161",
"text": "The authors do not report any quantitative results for diversity, but they show qualitative examples after manually adjusting the variance of the input to the GAN."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-162",
"text": "This demonstrates the ability of LSTMs to produce fluent captions under noisy conditions, leading to some variation in the output."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-163",
"text": "We observed a similar effect in experiments with noise-based gradients."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-164",
"text": "However, such methods are not constrained to produce meaningful diversity (as discussed in Section 3) and the level of noise that is appropriate for one caption might be too high for another."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-165",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-166",
"text": "**GENERATED CAPTIONS**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-167",
"text": "baseline: a man riding a snowboard down a snow covered slope DP: a snowboarder doing a trick in the air Cos: a snowboarder doing a trick in the air CDP: a snowboarder is jumping in the air on a snowboard CCos: a snowboarder is jumping in the air on a snowboard HUMAN CAPTIONS a picture of a man in the air on a snowboard a man doing tricks on a snowboard a man riding a snowboard through the air on a ski slope a snowboarder flies into the air under a chair lift a snowboarder does a trick while jumping through the air GENERATED CAPTIONS baseline: a man walking on the beach with a surfboard DP: a person walking on the beach with a surfboard Cos: a person walking on a beach with a surfboard CDP: a person walking on a beach with a dog CCos: a man walking on the beach with a dog HUMAN CAPTIONS a person walking their dog on the beach a man on a beach holding something while walking along it a single person walking the beach with a dog a person walking their dog on the beach a person walking their dog along the shoreline GENERATED CAPTIONS baseline: a man is doing a trick on a skateboard DP: a man doing a trick on a skateboard Cos: a man doing a trick on a skateboard CDP: a man is doing a trick on a wooden structure CCos: a man is doing a trick on a wooden structure HUMAN CAPTIONS a man on a skateboard performing a trick a man flying through the air on top of a skateboard a person on a skateboard in the air at a skate park a male skateboarder skateboards on a wall in an enclosed area a male on a skateboard performing a trick on a halfpipe"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-168",
"text": "Another example of GAN training is [26] where the Discriminator classifies whether a multi-sample set of captions are human-written or generated."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-169",
"text": "In contrast, our evaluator only requires a single caption and uses a much simpler loss function."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-170",
"text": "Furthermore, we let the NLU remain frozen during training, making the training stable and producing more informative learning curves."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-171",
"text": "A similar approach can be found in [10] where Contrastive Learning is used in a GAN-like setting."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-172",
"text": "In contrast to our approach which is unsupervised after pre-training, theirs require image-caption pairs both during and after pre-training."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-173",
"text": "Similar to our work, they are motivated by a specificity goal; unfortunately, they do not report results on any diversity metrics."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-174",
"text": "----------------------------------"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-175",
"text": "**CONCLUSION**"
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-176",
"text": "With this work, we have highlighted an important limitation in current Image Captioning research."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-177",
"text": "We provided a discussion on the limitations of current evaluation metrics and proposed a set of metrics related to diversity while emphasizing the importance of meaningful diversity."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-178",
"text": "Our work summarizes previously reported results and contributes a new state-of-the-art in this area in terms of diversity and novelty."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-179",
"text": "The code for our model and training approach is made publicly available online to encourage further research."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-180",
"text": "To conclude, we believe that the standard MLE training has both benefits and drawbacks for Image Captioning and that much can be gained by combining it with additional optimization terms."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-181",
"text": "By including an Image Retrieval learning signal, we introduced an additional dimension to our model's training by including text-to-image understanding in addition to its original image-to-text target."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-182",
"text": "We suggest further research into training approaches that incentivizes multimodal models to build a more complete, bi-directional understanding of its modalities."
},
{
"sent_id": "e3b9c00d792bcddb6eea449179e61e-C001-183",
"text": "Additionally, we encourage further exploration of evaluation methods that assess additional desirable qualities in automatically generated captions."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"e3b9c00d792bcddb6eea449179e61e-C001-17"
],
[
"e3b9c00d792bcddb6eea449179e61e-C001-31"
],
[
"e3b9c00d792bcddb6eea449179e61e-C001-33"
],
[
"e3b9c00d792bcddb6eea449179e61e-C001-43"
]
],
"cite_sentences": [
"e3b9c00d792bcddb6eea449179e61e-C001-17",
"e3b9c00d792bcddb6eea449179e61e-C001-31",
"e3b9c00d792bcddb6eea449179e61e-C001-33",
"e3b9c00d792bcddb6eea449179e61e-C001-43"
]
},
"@USE@": {
"gold_contexts": [
[
"e3b9c00d792bcddb6eea449179e61e-C001-46"
],
[
"e3b9c00d792bcddb6eea449179e61e-C001-136"
]
],
"cite_sentences": [
"e3b9c00d792bcddb6eea449179e61e-C001-46",
"e3b9c00d792bcddb6eea449179e61e-C001-136"
]
},
"@DIF@": {
"gold_contexts": [
[
"e3b9c00d792bcddb6eea449179e61e-C001-64"
],
[
"e3b9c00d792bcddb6eea449179e61e-C001-127"
],
[
"e3b9c00d792bcddb6eea449179e61e-C001-168",
"e3b9c00d792bcddb6eea449179e61e-C001-169"
]
],
"cite_sentences": [
"e3b9c00d792bcddb6eea449179e61e-C001-64",
"e3b9c00d792bcddb6eea449179e61e-C001-127",
"e3b9c00d792bcddb6eea449179e61e-C001-168"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"e3b9c00d792bcddb6eea449179e61e-C001-121"
]
],
"cite_sentences": [
"e3b9c00d792bcddb6eea449179e61e-C001-121"
]
},
"@SIM@": {
"gold_contexts": [
[
"e3b9c00d792bcddb6eea449179e61e-C001-136"
]
],
"cite_sentences": [
"e3b9c00d792bcddb6eea449179e61e-C001-136"
]
}
}
},
"ABC_5596207b89d917db38c04af49c08aa_10": {
"x": [
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-2",
"text": "Attention models have been intensively studied to improve NLP tasks such as machine comprehension via both question-aware passage attention model and selfmatching attention model."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-3",
"text": "Our research proposes phase conductor (PhaseCond) for attention models in two meaningful ways."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-4",
"text": "First, PhaseCond, an architecture of multi-layered attention models, consists of multiple phases each implementing a stack of attention layers producing passage representations and a stack of inner or outer fusion layers regulating the information flow."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-5",
"text": "Second, we extend and improve the dot-product attention function for PhaseCond by simultaneously encoding multiple question and passage embedding layers from different perspectives."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-6",
"text": "We demonstrate the effectiveness of our proposed model PhaseCond on the SQuAD dataset, showing that our model significantly outperforms both stateof-the-art single-layered and multiple-layered attention models."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-7",
"text": "We deepen our results with new findings via both detailed qualitative analysis and visualized examples showing the dynamic changes through multi-layered attention models."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-10",
"text": "Attention-based neural networks have demonstrated success in a wide range of NLP tasks ranging from neural machine translation , image captioning (Xu et al., 2015) , and speech recognition (Chorowski et al., 2015) ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-11",
"text": "Benefiting from the availability of large-scale benchmark datasets such as SQuAD (Rajpurkar et al., 2016) , the attention-based neural networks has spread to machine comprehension and question answering tasks to allow the model to attend over past output vectors (Wang & Jiang, 2017; Seo et al., 2017; Xiong et al., 2017; Hu et al., 2017; Pan et al., 2017) ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-12",
"text": "Wang & Jiang (2017) uses attention mechanism in Pointer Network to detect an answer boundary by predicting the start and the end indices in the passage."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-13",
"text": "Seo et al. (2017) introduces a bi-directional attention flow network that attention models are decoupled from the recurrent neural networks."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-14",
"text": "Xiong et al. (2017) employs a coattention mechanism that attends to the question and document together. uses a gated attention network that includes both question and passage match and self-matching attentions."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-15",
"text": "Both Pan et al. (2017) and Hu et al. (2017) employs the structure of multi-hops or iterative aligner to repeatedly fuse the passage representation with the question representation as well as the passage representation itself."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-16",
"text": "Inspired by the above-mentioned works, we are proposing to introduce a general framework PhaseCond for the use of multiple attention layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-17",
"text": "There are two motivations."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-18",
"text": "First, previous research on the self-attention model is to purely capture long-distance dependencies (Vaswani et al., 2017) , and therefore a multi-hops architecture (Hu et al., 2017; Pan et al., 2017 ) is used to alternatively captures question-aware passage representations and refines the results by using a self-attention model."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-19",
"text": "In contrast to the multi-hops and interactive architecture, our motivation of using the self-attention model for machine comprehension is to propagate answer evidence which is derived from the preceding question-passage representation layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-20",
"text": "This perspective leads to a different attention-based architecture containing two sequential phases, question-aware passage representation phase and evidence propagation phase."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-21",
"text": "(Seo et al., 2017) , RNET , MReader (Hu et al., 2017) , and PhaseCond (our proposed model)."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-22",
"text": "Second, unlike the domains such as machine translation which jointly align and translate words, question-passage attention models for machine comprehension and question answering calculate the alignment matrix corresponding to all question and passage word pairs (Wang & Jiang, 2017; Seo et al., 2017) ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-23",
"text": "Despite the attention models' success on the machine comprehension task, there has not been any other work exploring learning to encode multiple representations of question or passage from different perspectives for different parts of attention functions."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-24",
"text": "More specifically, most approaches use two same question representations U for the question-passage attention model \u03b1(H, U )U , where H is the passage representation."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-25",
"text": "Our hypothesis is that attention models can be more effective by learning different encoders for a question representation U and a question representation V from different aspects."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-26",
"text": "The key differences between our proposed model and competing approaches are summarized at Table 1 ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-27",
"text": "Our contributions are threefold: 1) we proposed a phase conductor for attention models containing multiple phases, each with a stack of attention layers producing passage representations and a stack of inner or outer fusion layers regulating the information flow, 2) we present an improved attention function for question-passage attention based on two kinds of encoders: an independent question encoder and a weight-sharing encoder jointly considering the question and the passage, as opposed to most previous works which only using the same encoder for one attention model, and 3) we provide both detailed qualitative analysis and visualized examples showing the dynamic changes through multi-layered attention models."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-28",
"text": "Experimental results show that our proposed PhaseCond lead to significant performance improvements over the state-of-the-art single-layered and multilayered attention models."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-29",
"text": "Moreover, we observe several meaningful trends: a) during the questionpassage attention phase, repeatedly attending the passage with the same question representation \"forces\" each passage word to become increasingly closer to the original question representation, and therefore increasing the number of layers has a risk of degrading the network performance, b) during the self-attention phase, the self-attention's alignment weights of the second layer become noticeably \"sharper\" than the first layer, suggesting the importance of fully propagating evidence through the passage itself."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-30",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-31",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-32",
"text": "We proposed phased conductor model (or PhaseCond), which consisting of multiple phases and each phase has two parts, a stack of attention layers L and a stack of fusion layers F controlling information flow."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-33",
"text": "In our model, a fusion layer F can be an inner fusion layer F inner inside of a stack of attention layers, or an outer fusion layer F outer immediately following a stack of attention layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-34",
"text": "Without loss of generality, PhaseCond's configurable computational path for two-phase, a question-passage attention phase containing N question-passage attention layers L Q , and a selfattention phase containing K self-attention layers L S , can be defined as Figure 1 gives an concrete example of building PhaseCond based network for the machine comprehension task."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-35",
"text": "The network contains encoding layers, question-passage attention layers, self-attention layers and output layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-36",
"text": "The encoding layer maps various groups of features, such as character features and word features, to their corresponding embeddings."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-37",
"text": "Those raw embeddings are then fed into an outer fusion layer to encode these embeddings as passage or question representations in Section 2.1."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-38",
"text": "Next, the representations are sent to question-passage attention layers to align and represent passage representation with the whole question representation in Section 2.3."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-39",
"text": "The output of each layer is concatenated and regularized by a stack of fusion layers in Section 2.2.1."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-40",
"text": "After that, the question-attended passage representation is directly matching against itself, for the purpose of propagating information through the whole passage detailed in Section 2.3."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-41",
"text": "For each self-attention layer, we configure an inner fusion layer to obtain a gated representation that is learned to decide how much of the current output is fused by the input from the previous layer detailed in Section 2.3.1."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-42",
"text": "Finally, the fused vectors are sent to the output layer to predict the boundary of the answer span described in Section 2.4."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-44",
"text": "**ENCODER LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-45",
"text": "The concatenation of raw features as inputs are processed in fusion layers followed by encoder layers to form more abstract representations."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-46",
"text": "Here we choose a bi-directional Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997 ) to obtain more abstract representations for words in passages and questions."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-47",
"text": "Different from the commonly used approaches that every single model has exactly one question and passage encoder (Seo et al., 2017; Hu et al., 2017) , our encoder layers simultaneously calculate multiple question and passage representations, for the purpose of serving different parts of attention functions of different phases."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-48",
"text": "We use two types of encoders, independent encoder and shared encoder."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-49",
"text": "In terms of independent encoder, a bi-directional LSTM is used to produce new representation v"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-50",
"text": "where v Q j \u2208 R 2d are concatenated hidden states of two independent BiLSTM for the j-th question word and d is the hidden size."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-51",
"text": "In terms of shared encoder, we jointly produce new representation h"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-52",
"text": "where h P i \u2208 R 2d and u Q j \u2208 R 2d are concatenated hidden states of BiLSTM for the i-th passage word and j-th question word, sharing the same trainable BiLSTM parameters."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-54",
"text": "**QUESTION-PASSAGE ATTENTION LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-55",
"text": "The process of representing a passage with a question essentially includes two sub-tasks: 1) calculating the similarity between the question and different parts of the passage, and 2) representing the passage part with the given question depending on how similar they are."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-56",
"text": "A single question-passage attention layer is illustrated in Figure 2 ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-57",
"text": "In this model, at the t-th layer an alignment matrix A t \u2208 R m , whose shape equals the number of words n in a passage multiplied by the number of words m in a question, is derived by aligning the passage representation at the t \u2212 1 layer with the shared weight question representation,"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-58",
"text": "is the the i-th passage word representation at the t \u2212 1 layer, h 0 i equals to h P i calculated from Eq 2, u Q j calculated from Eq 3 is the same for all the layers, the alignment matrix element A t (i, j) is a scalar, denoting the similarity between the i-th passage word and the j-th question word by using dot product of the passage word vector and the question word vector."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-59",
"text": "Given the alignment matrix element as weights, we compute the new passage representation h t i for the t-th layer by using weighted average over all the independent question representation v Q calculated from Eq 1, as shown in the following."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-61",
"text": "**OUTER FUSION LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-62",
"text": "For each question-passage attention layer, its output of h t i , where t \u2208 N , is concatenated to form the final output vector to represent the i-th passage word"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-63",
"text": "Increasing the number of layers N allows an increasingly more complex representation for a passage word."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-64",
"text": "In order to regulate the flow of N question-passage attention layers and to prevent the over-fitting problem, we use fusion layers, which is highway networks (Srivastava et al., 2015) using of GRUlike gating units and taking C 0 i as its input:"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-65",
"text": "where t \u2208 K, K is the number of fusion layers, W"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-67",
"text": "**SELF-ATTENTION LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-68",
"text": "Following the question-passage attention layers, self-attention layers propagate evidence through the passage context."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-69",
"text": "This process is similar in spirit to the steps of exploring similarity or redundancy between answer candidates (e.g., \"J.F.K\" and \"Kennedy\" can, in fact, be equivalent despite their different surface forms) that have been shown to be very effective during answer merging stage (Ferrucci et al., 2010) ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-70",
"text": "More generally, propagating evidence among the passage words allows correct answers to have better evidence for the question than the rest part of the passage."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-71",
"text": "For a single self-attention layer, we first compute a self alignment matrix S t \u2208 R n\u00d7n by comparing the passage representation itself, is the i-th passage word as input for the t-th self-attention layer, initial value h 0 i is defined as the final fused result C N i from question-passage attention model in section 2.2.1."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-72",
"text": "Given the alignment matrix element as weights, evidences are propagate from the previous layer to the next to produce the new passage representation h t i by using the weighted average over all the t \u2212 1 layer passage representation:"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-73",
"text": "where h t\u22121 k is the passage representation for the k-th word at the t \u2212 1 self-attention layer, B t i \u2208 R 2N d is the output the self-attention layer and it will be sent to a fusion layer, described in section 2.3.1, to obtain the t-th layer passage representation h t i ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-74",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-75",
"text": "**INNER FUSION LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-76",
"text": "To efficiently propagate evidence through the passage, we refine the self-attended representations by using multiple layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-77",
"text": "At the end of each self-attention layer, a GRU-like gating mechanism (Hu et al., 2017 ) is used to decide what information to store and send to the next self-attention layer, by merging the newly produced representation of the current layer and the input representation from the previous layer,B"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-78",
"text": "where W , is then sent to the next layer of self-attention model as input to calculate Eq 9 and Eq 10."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-79",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-80",
"text": "**OUTPUT LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-81",
"text": "We directly follow Hu et al. (2017) and use a memory-based answer pointer networks to predict boundary of the answer."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-82",
"text": "The memory-based answer pointer network contains multiple hops."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-83",
"text": "For the t-th hop, the pointer network produces the probability distribution of the start index p t s and the end index p t e using a pointer network (Vinyals et al., 2015) respectively."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-84",
"text": "If the t-th hop is not the last hop, then the hidden states for the start and end indices are transformed and fed into the next-hop prediction."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-85",
"text": "The training loss is defined as the sum of the negative log probabilities of the last hop start and end indices averaged over all examples."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-87",
"text": "**EXPERIMENTS AND ANALYSIS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-88",
"text": "This paper focuses on the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) to train and evaluate our model."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-89",
"text": "SQuAD, which has gained a significant attention recently, is a largescale dataset consisting of more than 100,000 questions manually created through crowdsourcing on 536 Wikipedia articles."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-90",
"text": "The dataset is randomly partitioned into a training set (80%), a development set (10%), and a blinded test set (10%)."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-91",
"text": "Two metrics are used to perform evaluation: Exact Match (EM) score which calculates the ratio of questions that are answered correctly by exact string match, and F1 score which calculates the harmonic mean of the precision and recall between predicted answers and ground true answers at the character level."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-93",
"text": "**TRAINING DETAILS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-94",
"text": "Our input for the encoding layer in Section 2.1 includes a list of commonly used features."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-95",
"text": "We use pre-trained GloVe 100-dimensional word vectors (Pennington et al., 2014) , parts-of-speech tag features, named-entity tag feature, and binary features of exact matching ) which indicate if a passage word can be exactly matched to any question word and vice versa."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-96",
"text": "Following Hu et al. (2017) , we also use question type (what, how, who, when, which, where, why, be, and other) features (Zhang et al., 2017) where each type is represented by a trainable embedding."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-97",
"text": "We use CNN with 100 one-dimensional filters with width 5 to encode character level embedding."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-98",
"text": "The hidden size is set as 128 for all the LSTM layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-99",
"text": "Dropout (Srivastava et al., 2014) are used for all the learnable parameters with a ratio as 0.2."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-100",
"text": "We use the Adam optimizer (Kingma & Ba, 2014) with an initial learning rate of 0.0006, which is halved when a bad checkpoint is met."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-102",
"text": "**MAIN RESULTS OF MODEL COMPARISON**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-103",
"text": "We compare our proposed model PhaseCond with a multi-layered attention model, the Iterative Aligner, as well as various other recently published systems, which include a single-layered model, BIDAF (Seo et al., 2017) , and a single-layered model containing both the question-passage attention and self-attention, RNET ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-104",
"text": "We first compare our proposed model PhaseCond with Iterative Aligner, which is employed by two top ranked systems MEMEN (Pan et al., 2017) and MReader (Hu et al., 2017) on the SQuAD leaderboard 1 ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-105",
"text": "Since our goal is to show the effectiveness of our proposed model PhaseCond, we use a baseline system implementing MReader for the direct comparison."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-106",
"text": "All the experiment settings are the same for PhaseCond and Iterative Aligner including the number of attention layers, input features, optimizer and learning rate, number of training steps and etc."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-107",
"text": "As shown in Table 2 which summarizes the performance of single models, we achieve steady improvements when 1) additional question encoders are used to extend the passage-question attention function, denoted as QPAtt+, as detailed in Section 2.1 and Section 2.2, and 2) on top of that, using PhaseCond making our model better than using Iterative Aligner."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-108",
"text": "Specifically, PhaseCond's computational path for two question-aware passage attention layers"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-109",
"text": "On the other hand, Iterative Aligner builds path in turn through different kinds of attention layers: The performance of our models and published results of competing attention-based architectures."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-110",
"text": "To perform a fair comparison as much as possible, we collect the results of BiDAF (Seo et al., 2017) and RNET from their recently published papers instead of using the up-to-date performance scores posted on the SQuAD Leaderboard."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-111",
"text": "Our directly available baseline is one implementation of MReader, re-named as Iterative Aligner which has very similar results as those of MReader (Hu et al., 2017) 71.1 / 79.5 71.3 / 79.7 75.6 / 82.8 75.9 / 82.9 MReader (Hu et al., 2017) N As shown in Table 3 , in the single model setting, our model PhaseCond is clearly more effective than all the single-layered models (BiDAF and RNET) and multi-layered models (MReader and Iterative Aligner)."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-112",
"text": "We draw the same conclusion for the ensemble model setting, despite that the RNET works better on the Dev EM measure."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-113",
"text": "The EM result of our baseline Iterative Aligner is lower than RNET, confirming that the problem is not caused by our proposed model."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-114",
"text": "Our explanations is that 1) RNET uses a different feature set (e.g., GloVe 300 dimensional word vectors are employed) and different encoding steps (e.g., three GRU layers are used for encoding question and passage representations), and 2) RNET uses a different ensemble method from our implementation."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-115",
"text": "Table 4 shows the performance with different number of layers for both question-passage attention phase and self-attention phase."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-116",
"text": "We change the layer number separately to compare the performance."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-117",
"text": "For the question-passage attention phase, using single layer doesn't degrade the performance significantly from the default setting of two layers, resulting in a different conclusion from Hu et al. (2017); Xiong et al. (2017) ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-118",
"text": "Intuitively, this is largely expected because representing the passage repeatedly with the same question doesn't constantly add more information."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-119",
"text": "In contrast, multiple stacking layers are needed to allow the evidence fully propagated through the passage."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-120",
"text": "This is exactly what we observed in two stacking layered self-attention phase."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-122",
"text": "**ANALYSIS ON ATTENTION LAYERS**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-123",
"text": "In Figure 3 , we visualize the attention matrices for each layer to show dynamic attention changes."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-124",
"text": "The model is based on the main setting which has two question-passage layers and two self-attention layers."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-125",
"text": "We observed several critical trends."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-126",
"text": "First, the first layer of the question-passage attention phase can successfully align question keywords with the corresponding passage keywords, as shown in Figure 3a ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-127",
"text": "For example, the question keyword \"represented\" have been successfully aligned with related passage keywords \"champion\", \"defeated\", and \"earned\"."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-128",
"text": "Second, patterns of striped color in Figure 3a indicate similar weights among all the passage words, meaning that it becomes indistinguishable among passage words, and therefore adding another layer of question-passage attention model degrades the alignment quality dramatically."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-129",
"text": "This observation is meaningful which (c) The first layer of self-attention."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-130",
"text": "Generally, the darker the color is the higher the weight is (the only exception is Figure 3b which contains negative values)."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-131",
"text": "Given the question \"Which NFL team represented the AFC at Super Bowl 50?\", the system correctly detects the answer \"Denver Broncos\" from the passage part \"The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 2410 to earn their third Super Bowl title.\""
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-132",
"text": "shows that repeatedly representing a passage word regarding the same question representation can make the passage embedding become closer to the original question representation."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-133",
"text": "Third, when comparing Figure 3c and Figure 3d , we observed that the color is diluted for most of the weights in the second layer of self-attention phase, meanwhile a small portion of weights is strengthened, suggesting that information propagation is converging."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-134",
"text": "For example, in Figure 3d as the last attention layer, the phrase \"Denver Broncos\" becomes more concentrated on the phrase \"Carolina Panthers\" than that of Figure 3c ."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-135",
"text": "In contrast, \"Denver Broncos\" becomes less focused on the other keywords (e.g., \"champion\" and \"title\") of the same passage."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-137",
"text": "**CONCLUSION**"
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-138",
"text": "In this paper, we introduce a general framework PhaseCond, on multi-layered attention models with two phases including a question-aware passage representation phase and an evidence propagation phase."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-139",
"text": "The question-aware passage representation phase has a stack of question-aware passage attention models, followed by outer fusion layers that regularize concatenated passage representations."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-140",
"text": "The evidence propagation phase has a stack of self-attention layers, each of which is followed by inner fusion layers that control the information to propagate and output."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-141",
"text": "Also, an improved attention mechanism for PhaseCond is proposed based on a popular dot-product attention function by simultaneously encoding both the independent question embedding layers, the weight-sharing question embedding layer and weight-sharing passage embedding layer."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-142",
"text": "The experimental results show that our model significantly outperforms single-layered or multiple-layered attention networks on blinded test data of SQuAD."
},
{
"sent_id": "5596207b89d917db38c04af49c08aa-C001-143",
"text": "Moreover, our in-depth quantitative analysis and visualizations provide meaningful findings for both question-aware passage attention mechanism and self-matching attention mechanism."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"5596207b89d917db38c04af49c08aa-C001-11"
]
],
"cite_sentences": [
"5596207b89d917db38c04af49c08aa-C001-11"
]
},
"@DIF@": {
"gold_contexts": [
[
"5596207b89d917db38c04af49c08aa-C001-47"
]
],
"cite_sentences": [
"5596207b89d917db38c04af49c08aa-C001-47"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"5596207b89d917db38c04af49c08aa-C001-104"
]
],
"cite_sentences": [
"5596207b89d917db38c04af49c08aa-C001-104"
]
},
"@SIM@": {
"gold_contexts": [
[
"5596207b89d917db38c04af49c08aa-C001-111"
]
],
"cite_sentences": [
"5596207b89d917db38c04af49c08aa-C001-111"
]
}
}
},
"ABC_9e0a44722390d0508fbe56785701e6_10": {
"x": [
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-26",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-27",
"text": "**RELATED WORK**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-188",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-189",
"text": "**USING SATELLITE NODES**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-2",
"text": "Knowledge-based question answering relies on the availability of facts, the majority of which cannot be found in structured sources (e.g. Wikipedia info-boxes, Wikidata)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-3",
"text": "One of the major components of extracting facts from unstructured text is Relation Extraction (RE)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-4",
"text": "In this paper we propose a novel method for creating distant (weak) supervision labels for training a large-scale RE system."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-5",
"text": "We also provide new evidence about the effectiveness of neural network approaches by decoupling the model architecture from the feature design of a state-of-the-art neural network system."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-6",
"text": "Surprisingly, a much simpler classifier trained on similar features performs on par with the highly complex neural network system (at 75x reduction to the training time), suggesting that the features are a bigger contributor to the final performance."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-9",
"text": "Knowledge-based question answering relies on the availability of facts -usually in the form of triples, stored in large-scale knowledge bases (KBs) e.g. Freebase (Bollacker et al., 2008) , DBPedia (Auer et al., 2007) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-10",
"text": "There are two main sources of facts for such a KB: structured data (e.g. Wikipedia info-boxes, Wikidata) or unstructured text."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-11",
"text": "Undeniably, the former type of knowledge extraction is very accurate and has been the main source of knowledge behind the major industrial knowledge bases."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-12",
"text": "However, the facts extracted from structured sources cover a limited set of high-importance relations, leaving a large number of them implicitly (or explicitly) mentioned in unstructured text (McCallum, 2005) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-13",
"text": "In order to ground the following presentation, we will present a typical problem from the factual knowledge extraction domain with the following unstructured text from a Wikipedia page:"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-14",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-15",
"text": "**\"CARRIE**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-16",
"text": "Fisher wrote several semiautobiographical novels, including Postcards from the Edge.\""
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-17",
"text": "The purpose of a fact extraction system is to extract the following facts of the form of predicate (subject, object): instance of (postcards from the edge, novel), and author of (postcards from the edge, carrie fisher), where the first part is a relation, and the other parts are the left and right entities participating in that relation."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-18",
"text": "Typically three tasks are involved in generating facts: Entity Recognition, Entity Resolution (or Entity Linking), and Relation Extraction (RE)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-19",
"text": "Entity Recognition and Resolution deal with the task of translating surface strings to KB entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-20",
"text": "This includes nominal or pronominal coreference resolution: we should be able to extract the same entity even if the text stated that 'Fisher wrote. . . ' (instead of resolving e.g. to Bobby Fisher) or 'She wrote. . . ' (provided that Carrie Fisher's name was mentioned in a previous sentence)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-21",
"text": "Relation Extraction extracts relation triples (or facts) involving those entities with appropriate relations (also part of the KB schema)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-22",
"text": "Each of these components could be built and operated in isolation, but they affect the performance of each other."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-23",
"text": "In this paper, we examine the task of RE focusing on extracting knowledge to enrich a large-scale KB (\u223cbillions of facts)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-24",
"text": "We consider a state-of-the-art model that has been applied to hyponymy detection and present a thorough analysis of its application to datasets derived from Wikidata and Alexa KB, a proprietary large-scale triple KB that powers Amazon's Alexa."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-25",
"text": "We also present a new way of generating distant supervision for relation extraction with a simple yet effective way of reducing the noise for the entity resolution."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-28",
"text": "Relation Extraction is the NLP task of extracting structured semantic relations between entities from natural (unstructured) text."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-29",
"text": "Formally, it can be defined as identifying semantic relations between (resolved) entities and normalise these relations by mapping them to a predefined KB schema."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-30",
"text": "In the NLP community, the RE task evolved out of the Information Extraction projects like MUC in the 1990s (see Chinchor et al. (1993) for an overview) and ACE in the 2000s (Doddington et al., 2004) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-31",
"text": "In both projects the main focus was the automatic extraction of events rather than relations (the main difference being that an event is a special type of fact that involves actor entities and occurs at a specific time point) in a limited set of domains (e.g. bombings, company mergers, etc.)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-32",
"text": "This meant that in both projects the number of relations marked for extraction was very limited (3 relations in MUC and 24 in ACE with 7k relation instances for 40k entity mentions)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-33",
"text": "Starting with those projects, the task of RE was thought of as a pipeline, where the entities were first detected, resolved to a standard schema, and then the RE system would determine which of the possible relations was expressed (if any) between any given pair of entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-34",
"text": "Much of the earlier work explored a variety of different features, such as syntactic phrase chunking and constituency parsing (Bunescu and Mooney, 2005; Jiang and Zhai, 2007; Qian et al., 2008) , and semantic knowledge like WordNet (Zhou et al., 2005) , although Jiang and Zhai (2007) showed that the more complex features might actually hurt the performance of an SVM-based RE system."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-35",
"text": "The work of Shwartz et al. (2016) , that we closely follow, is also using both semantic and syntactic features, by combining the dependency paths between entities, with word embedding representations of both the entities and the lemmas in the dependency paths."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-36",
"text": "Another related area is relation extraction for Open Information Extraction (OpenIE)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-37",
"text": "Some of the more representative projects in the area, like Reverb (Fader et al., 2011) and more recently ClauseIE (Del Corro and Gemulla, 2013) use syntactic information (PoS tagging / chunking, and dependency parsing respectively) to extract entity and relation phrases."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-38",
"text": "However, unlike OpenIE, we are interested in normalized entities and relations (i.e. that map to a knowledge base)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-39",
"text": "In this work, we follow a common way of producing training examples for RE is to use distant supervision (Craven et al., 1999; Mintz et al., 2009 ): the assumption is that if any sentence mentions two entities which we know (from a KB) participate in a specific relation, that sentence must be evidence for that relation."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-40",
"text": "In the area of distant supervision, there are two relevant research directions."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-41",
"text": "The first is to use it for directly enriching KBs from unstructured text, as well as leverage the KBs to generate the distant supervision labels Parikh et al., 2015) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-42",
"text": "The second direction attempts to reduce the noise in distant supervision labels."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-43",
"text": "A first line of approaches, starting with Data Programming (Ratner et al., 2016) , uses generative models to combine multiple sources of weak supervision (e.g. automatically extracted from a KB, rules generated by experts etc.) in order to predict disagreements and overlaps between them and create a noise-aware posterior distribution of predictions."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-44",
"text": "An extension of this approach is Socratic Learning (Varma et al., 2017) which uses the differences in the predictions of the generative model and the main classification system to discover discriminating features and add them back to the generative model."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-45",
"text": "As these approaches require multiple sources of weak supervision, we examine another line of projects which works by aggregating the support sentences 1 for each entity pair (Riedel et al., 2010; Hoffmann et al., 2011) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-115",
"text": "**USING THE FASTTEXT MODEL**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-46",
"text": "This is the approach that Shwartz et al. (2016) and the current work follow."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-47",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-48",
"text": "**HYPENET**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-49",
"text": "A recent paper (Shwartz et al., 2016) proposed HypeNET, a new method for RE that integrated dependency path information with distributional semantic vector representation of the entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-50",
"text": "The authors applied this method to extract hyponyms (i.e. instance of relations) and also made a new version of their system publicly available."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-51",
"text": "2 The training examples used (entity/relation triples) come from a number of sources like WordNet (Miller, 1995) , Yago (Hoffart et al., 2013) , DBPedia and Wikidata, and the source of the linguistic features (part-of-speech tags, dependency paths, noun phrases) was the 2015 dump of Wikipedia, processed using the spaCy system 3 ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-52",
"text": "Their proposed system achieved by far the best results on their dataset."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-53",
"text": "Since instance of is one of the most often used relations (most of the uses are implicit, during inference), we decided to investigate HypeNET as the base of our RE system."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-54",
"text": "The training examples used by the authors of HypeNET consisted of facts about only one relation."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-55",
"text": "We wanted to build a system that works on multiple relations at a very large scale."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-56",
"text": "Hence, in this work we use two different dataset sources: Wikidata, a publicly-available large-scale KB to aid reproducibility, as well as the larger Alexa KB, built by combining a hand-curated ontology with publicly available data from Wikidata, Wikipedia, Freebase, DBPedia, and other sources."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-58",
"text": "**DISTANT SUPERVISION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-59",
"text": "Following the technique presented in Mintz et al. (2009) , and the implementation in HypeNET, we needed to generate training examples where entities X and Y are connected by a relation in the KB and also appear together in the same sentence."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-60",
"text": "When we applied the distant supervision technique presented in HypeNET to our datasets (both Wikidata and Alexa KB) we got poor annotations (see Figure 4 (top) for some examples from Alexa KB and section 6.1."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-61",
"text": "for evaluation on both datasets)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-62",
"text": "This could be attributed to the large volume of entities and their corresponding denotations in the KBs, which resulted in a number of ambiguous situations."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-63",
"text": "For instance, \"Chicago\" could denote both the city and the broadway musical show."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-64",
"text": "In the following section we present our new technique of filtering denotations used for Entity Resolution."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-65",
"text": "This method allows our RE system to scale much better than the original method."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-66",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-67",
"text": "**PAGE-SPECIFIC GAZETTEERS**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-68",
"text": "We created a new type of entity gazetteer, based on the main entity of a Wikipedia page, and the knowledge about that entity we have in the KB."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-116",
"text": "Joulin et al. (2016) recently introduced fastText: a very efficient classifier composed of a simple linear model with a rank constraint."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-69",
"text": "The new system, presented on the top dashed box in Figure 1 , starts with a Wikipedia URL, retrieves its corresponding ID from the KB for that URL (the main entity), and then extracts entities that are connected directly to the main entity (one-hop distance in the KB graph), by going through all the relations the main entity is involved in (except those involving string literals) and returning the entities on the other side of those relations."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-70",
"text": "For each of the related entities, we collect its denotational strings into a purpose-built gazetteer."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-71",
"text": "Figure 2 shows an example KB subgraph for a target entity (in this case George Springate); it contains all the entities immediately connected to it with relations such as graduate of or instance of."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-72",
"text": "Also appearing in the graph, are the denotation strings for each one of the related entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-73",
"text": "Note that this approach will reduce the number of extracted entities compared to the original method, but will dramatically improve both the coverage for non-NP entities and precision of entity resolution."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-74",
"text": "One way to increase the recall of this system would be to consider entities with a distance of >1 (entities related to entities related to the main entity)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-75",
"text": "Figure 4 (bottom) shows results obtained by performing entity resolution using page-specific gazetteers."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-76",
"text": "Those examples, as well as the results in section 6.1. show that the noise in the data is significantly reduced."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-77",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-78",
"text": "**ANNOTATION PIPELINE**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-79",
"text": "The bottom half of Figure 1 presents the distant supervision generation process adapted from Shwartz et al. (2016) to work with our data."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-80",
"text": "In the original work, the text is processed to split and tokenise the sentences, tag the parts of speech and separate the noun phrases (NPs) -these are the candidate entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-81",
"text": "They then construct the dependency path between each possible pair of entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-82",
"text": "Each noun/NP pair is checked against the KB for distant supervision."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-83",
"text": "keep only the entities/paths that appear in the list of labelled examples."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-84",
"text": "They also filter out entity pairs that have infrequent paths (occurring fewer than five times), and pairs whose path is more than five tokens long."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-85",
"text": "However, as discussed in the beginning of this section, this approach introduces a lot of noise."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-86",
"text": "To avoid this problem, we use the pagespecific gazetteer and a greedy string matching system to scan through the unstructured text and assign KB IDs to the longest-matching substring in a sentence."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-87",
"text": "The final step was to generate the annotation labels themselves."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-88",
"text": "To do that, we examine each possible pair of entities to see if they participate in the target relation."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-89",
"text": "For the Wikidata KB, we simply checked whether the target relation existed as a property in the data."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-90",
"text": "Considering the large size of Alexa KB, database lookup operations could be very expensive."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-91",
"text": "In order to speed up the lookup for each X rel Y triple, we used two methods."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-92",
"text": "First, before checking against the KB though, we ensure that the pair conforms with the class signature of the relation ('Ontological Constraints')."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-93",
"text": "For example, only a geographical location can be the left entity in the birthplace of relation."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-94",
"text": "Second, instead of relying on database queries, we used Bloom filters (Bloom, 1970 ) -a memory efficient probabilistic data structure that can be used to test if an entity is a member of a set."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-95",
"text": "The compression value of a Bloom filter is governed by the accepted false positive rate."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-96",
"text": "We set the false positive rate to 0.001 for our experiments."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-97",
"text": "Since for any given pair of entities it is much more likely that they are not going to be related, we only keep a small fraction of the negative instances."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-98",
"text": "Following Shwartz et al. (2016) , we use a 4:1 negative to positive ratio."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-99",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-100",
"text": "**ISOLATING HYPENET FEATURES**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-101",
"text": "To discover the effectiveness of the approach of Shwartz et al. (2016) , we wanted to separate HypeNET's neural architecture from its input features and use those features with different (and simpler) classifiers."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-102",
"text": "HypeNET's main advantage is that it integrated dependency path features with distributional information about the word lemmas along the path and left and right entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-103",
"text": "As our goal was to generate discrete features to be used with more traditional classifiers, we opted for using Brown clusters (Brown et al., 1992) instead of the 50-dimensional GloVe vectors (Pennington et al., 2014) used by Shwartz et al. (2016) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-104",
"text": "The Brown clusters were pre-trained on the Reuters Corpus Vol."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-105",
"text": "1 (Lewis et al., 2004 ) using 3,200 clusters."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-106",
"text": "After evaluating different feature configurations (see section 6.7.), the resulting features were as follows: for each entity pair and for each support, we extracted the dependency path between them and concatenated the lemma, 4-bit prefix of Brown cluster of the lemma, part of speech, dependency relation, and path direction information; to that we added the strings and 4-bit Brown cluster prefix of the left and right entities."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-107",
"text": "The features from different supports were concatenated into one feature list."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-108",
"text": "For example, given the following sentences containing the entity pair carrie fisher, star wars: \"In 1977, Fisher starred in George Lucas' film Star Wars\", and \"Fisher became known for playing Princess Leia in the Star Wars film series\"."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-109",
"text": "The following is the full list of discrete features extracted, where each space-separated token is a distinct feature, and X and Y are used to replace the left and right entities:"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-111",
"text": "**USING A MAXENT CLASSIFIER**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-112",
"text": "In the first set of experiments, we used a standard Maximum Entropy classifier from MALLET toolkit (McCallum, 2002 ) with the discrete features described above."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-113",
"text": "The parameters and settings were kept to their defaults (LBFGS optimizer, with a Gaussian prior variance of 1)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-117",
"text": "The architecture of the system is very similar to that of Mikolov et al. (2013) except that instead of predicting the middle word in a window, the classifier is predicting a label."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-118",
"text": "For fastText, the input features are token ngrams which are embedded into a single hidden value and fed into a hierarchical softmax classifier."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-119",
"text": "For our experiments, we used fastText's default settings, except for the number of ngrams, which we set to 4."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-121",
"text": "**USING THE HYPENET MODEL**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-122",
"text": "The original version of HypeNET (Figure 3 ) combines the dependency path-based features with the distributional information in its neural net architecture: for each entity pair, each support (dependency path) token is encoded by a set of embedding layers -one for each linguistic componentand passed into an LSTM layer."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-123",
"text": "The LSTM layers for the whole path are merged by an average pooling layer and the distributional representation of the entities (via embedding layers) is added."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-124",
"text": "Finally, a softmax layer makes a binary classification decision."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-125",
"text": "We implemented our own version of HypeNET code using Keras (Chollet, 2015) and optimized the learning objective using the Adam optimizer."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-126",
"text": "We modified the basic HypeNET model by making the following changes: i) we allowed the training of word embeddings for lemmas (after initializing them with GloVe embeddings), ii) we replaced the uni-directional LSTM with bi-directional LSTM 4 ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-128",
"text": "**EVALUATION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-129",
"text": "We want to examine a varied set of connections between the left and right entities, so in addition to the instance of relation (P31 in Wikidata) that connects objects to classes, we will examine birthplace of (P19) that connects a location entity to a person entity, and part of (P527) which links objects to their meronyms."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-130",
"text": "When evaluating against Alexa His studies were interrupted by army service and at the end of the war he was forced to return. . ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-132",
"text": "**INSTANCE OF (THE SECOND WORLD WAR, CAUSE OF DEATH)**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-133",
"text": "In the intro to the song, Fred Durst makes reference to. . ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-134",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-135",
"text": "**INSTANCE OF (INTRO 15367, SONG)**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-136",
"text": "Turner also released one album and several singles under the moniker Repeat."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-137",
"text": "instance of (the singles the 2011 album, album)"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-138",
"text": "Call Your Girlfriend was written by Robyn, Alexander Kronlund and Klas\u00c5hlund, with the latter producing the song."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-139",
"text": "instance of (call your girlfriend 3, song)"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-140",
"text": "Forget Her is a song by Jeff Buckley."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-141",
"text": "instance of (forget her, song)"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-142",
"text": "The Subei Mongol Autonomous County is an autonomous county within the prefecture-level city of Jiuquan in the northwestern Chinese province of Gansu."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-143",
"text": "instance of (subei mongol autonomous county, chinese county)"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-144",
"text": "Figure 4: Entity resolution results for the distant supervision training data using Alexa KB and the original pre-processing system of Shwartz et al. (2016) (top), and the new page-specific gazetteers (bottom)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-145",
"text": "The matched strings in the original sentences are highlighted."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-146",
"text": "KB, we replaced part of with applies to, a relation that links an attribute to an object and has no correspondence in Wikidata."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-147",
"text": "We will use the Wikidata KB as a first source of evaluation, and switch to the Alexa KB for a more in depth exploration."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-148",
"text": "We evaluate all models on a sample of 50K examples for training, 10K examples for validation and test respectively for all relations (except part of for which we could only collect 22K training examples)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-149",
"text": "Each example is the collection of all the sentences supporting a X rel Y triple that have been annotated by the distant supervision system of section 3.. We examine the effect of grouping supports in section 6.4.."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-151",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-152",
"text": "We ran each of the following experiments three times (with random initialization) to obtain a measure of variance for their results."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-153",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-154",
"text": "**DISTANT SUPERVISION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-155",
"text": "The goal of the method presented in section 3.1. was to reduce the number of false positives at the cost of introducing some amount of false negatives (due to missing entities, missing denotations, or missing KB facts)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-156",
"text": "In order to quantify the effect of the new method, we manually annotated 1,000 instance of distant supervision examples produced by our new method and the original method used by Shwartz et al. (2016) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-157",
"text": "The original method yielded 67% false positive and 3% false negative examples; the pagespecific gazetteer solution returned only 1% false positives and 39% false negatives."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-158",
"text": "After more analysis, 62% of the false negatives (or 24% of the total examples) were cases were the KB contained the subclass of relation, which we consider a separate relation (although in the data collected by Shwartz et al. (2016) from Yago and Wikidata it is conflated with instance of)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-159",
"text": "The results are similar when using the Wikidata KB: around 1% false positives but only 5% false negatives 5 of which 89% were cases where the KB contained similar relations (like occupation for people, or taxon for species)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-160",
"text": "Figure 4 presents a qualitative comparison of the two methods on our KB."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-161",
"text": "We can see that the two problems of spuri- ous entity matching (e.g. \"end\" to cause of death) and non-standard noun-phrase entities (e.g. \"call your girlfriend\") have been successfully addressed by the pagespecific gazetteer."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-163",
"text": "**MODEL COMPARISON**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-164",
"text": "The results comparing the performance and generalizability of the models (over the three relations) are shown in Table 1 ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-165",
"text": "The main takeaway from these results is that the more advanced architecture of HypeNET does not offer a significant advantage over that of fastText when used with (almost) the same input features."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-166",
"text": "As an added benefit, the fastText classifier is dramatically faster than the HypeNET model, with a reduction of training time from around 75 minutes to less than a minute."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-167",
"text": "However as the results of the MaxEnt model show, the features alone are not enough."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-168",
"text": "It is fastText's (and HypeNET's) ability to create higherdimensional representations of these discrete features that provide the best results."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-170",
"text": "**TRAINING DATA SIZE**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-171",
"text": "Another parameter we wanted to explore was the impact of size of the training data since we plan to target relations with fewer training examples in the future."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-172",
"text": "We evaluate the F-scores of the HypeNET, fastText, and MaxEnt models for the instance of relation on the Alexa KB dataset."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-173",
"text": "The results are shown in Figure 5 ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-174",
"text": "Note that the numbers in the figure refer to entity pairs, not individual supports (sentences)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-175",
"text": "As expected, the performance of all systems keeps increasing when more training examples are used, but there are two interesting observations to be made."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-176",
"text": "The first is the relative variance of the fastText versus the HypeNET model, especially for the case of fewer than 25k examples."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-177",
"text": "The second is that even with 1,000 training examples, the Fscore of both HypeNET and fastText models is above 90%."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-179",
"text": "**GROUPING SUPPORTS**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-180",
"text": "We also wanted to investigate the effect of grouping the supports (sentences) for each entity pair."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-181",
"text": "As mentioned earlier (Section 2.), this had been proposed as a method Table 3 : Effect of using dependency path satellite nodes for each X rel Y triple using the fastText classifier on the Alexa KB data (threshold of 0.5)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-182",
"text": "to reduce noise in the distant supervision labels (Hoffmann et al., 2011) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-183",
"text": "In Shwartz et al. (2016) , the grouping was performed by the mean pooling layer; in the case of the fastText-based system, we simply concatenate the feature tokens from all the supports and feed them into the single hidden layer."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-184",
"text": "For each of the three relations, we ran the fastText model with and without grouping each entity pair's supports, using exactly the same features in both cases."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-185",
"text": "The Table 2 presents the results."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-186",
"text": "Interestingly, the effect on instance of is much smaller than on the other two relations."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-187",
"text": "One possible explanation could be that the page-specific gazetteer method is producing fewer false positives for that relation; more likely, the supports for birthplace of and applies to are more diverse than those of instance of, making their grouping more useful to the classifier."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-190",
"text": "We also looked at the role of the dependency path satellite nodes (words to the left and right of the entities)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-191",
"text": "This type of features has also been adopted by various systems including Mintz et al. (2009) and Shwartz et al. (2016) , and we wanted to establish a basis for its effectiveness across multiple relations."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-192",
"text": "The results, shown in Table 4 : Effect of using all the supports for each X rel Y triple using the fastText classifier on the Alexa KB data (threshold of 0.5)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-193",
"text": "Table 5 : F-score results for the three relations on the Alexa KB dataset."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-194",
"text": "The baseline system (1) is the fastText classifier using the 5 most frequent supports for each X rel Y triple, (1)-dep refers to the system with both the dependency relation and direction features removed, the last system uses all the (lowercased) words in each support as features."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-195",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-196",
"text": "**USING ALL SUPPORTS**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-197",
"text": "A comparison is made for the fastText models trained using the 5 most frequent supports for each triple with the ones trained using all available supports."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-198",
"text": "As shown in Table 4 , reducing the supports to the most frequent ones slightly increases the performance (except in the case of birthplace of) even though on average more than 18K training examples, and more than 3K test examples contain more than 5 supports."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-199",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-200",
"text": "**FEATURE ABLATION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-201",
"text": "As a final step in our exploration, we wanted to measure the impact of each of the features used by the system of Shwartz et al. (2016) ."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-202",
"text": "Table 5 presents the feature ablation results on the Alexa KB data using the fastText classifier."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-203",
"text": "We compare the full set of features presented in section 4."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-204",
"text": "against feature sets without the Brown clusters, word lemmas, POS tags, dependency information, and the X and Y entities (and their Brown cluster)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-205",
"text": "We also show the results of the system using only the X and Y entities and just the words of the supporting sentences (without extracting the dependency path between entities)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-206",
"text": "The main takeaway is that for the instance of and applies to relations, the structure induced by the dependency parser is critical for the system's performance."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-207",
"text": "One explanation is that these relations are not always lexically defined (sometimes expressed with just the verb 'to be' across long subordinate clauses)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-208",
"text": "For the birthplace of relation, the system using the full sentence is on par with the best dependencysupported version suggesting that there are strong lexical cues that signify them (like 'born in', or just the presence of a city name)."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-210",
"text": "**CONCLUSION**"
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-211",
"text": "In this paper, we have explored the feature design and network architecture of the HypeNET RE system, and presented a new mechanism for extracting distant supervision data based on our large-scale KB."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-212",
"text": "We found that by replacing HypeNET network architecture with a simple fastText model similar performance is achieved."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-213",
"text": "The main difference between these two architectures is the mechanism of producing the high-dimensional representations: in HypeNET, LSTMs are used, which maintain dependency over longer contexts of dynamic length; in fastText, the window size for the ngrams is fixed."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-214",
"text": "From our experiments, we can infer that dynamic-length context modelling did not bring any gains."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-215",
"text": "Furthermore, we evaluated the effect of grouping of supports and satellites nodes features for various relations."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-216",
"text": "The results from these experiments provide a solid ground to build RE systems for more relations."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-217",
"text": "There are obvious extensions to the current approach, such as using a more sophisticated method for grouping the supports (e.g. an ensemble-based method) and we investigate these in future work."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-218",
"text": "Beyond architecture improvements, there are two main focus areas for the immediate future: generalising the system to cover very large number (\u223c1k) of relations, and reducing the sources of noise."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-219",
"text": "The former should be relatively straightforward given the existing architecture for extracting the dataset and training the system."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-220",
"text": "The main obstacle will be to combine the results of the multiple RE systems (one for each relation group) into a single classifier."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-221",
"text": "Finally, distant supervision as a method itself introduces some errors since not all sentences that mention both entities of a fact express that fact (e.g. the relation is the director of between steven spielberg and saving private ryan is not expressed in the sentence \"The level of violence in Saving Private Ryan makes sense because Spielberg is trying to show . . . \")."
},
{
"sent_id": "9e0a44722390d0508fbe56785701e6-C001-222",
"text": "Going further, we would like to expand the manual annotations to the training/validation sets to assist or replace the distant supervision."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-35"
],
[
"9e0a44722390d0508fbe56785701e6-C001-46"
],
[
"9e0a44722390d0508fbe56785701e6-C001-98"
],
[
"9e0a44722390d0508fbe56785701e6-C001-190",
"9e0a44722390d0508fbe56785701e6-C001-191"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-35",
"9e0a44722390d0508fbe56785701e6-C001-46",
"9e0a44722390d0508fbe56785701e6-C001-98",
"9e0a44722390d0508fbe56785701e6-C001-191"
]
},
"@SIM@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-35"
],
[
"9e0a44722390d0508fbe56785701e6-C001-190",
"9e0a44722390d0508fbe56785701e6-C001-191"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-35",
"9e0a44722390d0508fbe56785701e6-C001-191"
]
},
"@BACK@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-49"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-49"
]
},
"@EXT@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-79"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-79"
]
},
"@MOT@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-101"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-101"
]
},
"@DIF@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-103"
],
[
"9e0a44722390d0508fbe56785701e6-C001-183"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-103",
"9e0a44722390d0508fbe56785701e6-C001-183"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"9e0a44722390d0508fbe56785701e6-C001-156"
],
[
"9e0a44722390d0508fbe56785701e6-C001-158"
],
[
"9e0a44722390d0508fbe56785701e6-C001-201"
]
],
"cite_sentences": [
"9e0a44722390d0508fbe56785701e6-C001-156",
"9e0a44722390d0508fbe56785701e6-C001-158",
"9e0a44722390d0508fbe56785701e6-C001-201"
]
}
}
},
"ABC_f8fc3634684ff37ab3d29cee910443_10": {
"x": [
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-90",
"text": "**CONSTRUCTING A LEXEME HIERARCHY GRAPH**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-2",
"text": "\"Grounded\" language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-3",
"text": "B\u00f6rschinger et al. (2011) introduced an approach to grounded language learning based on unsupervised PCFG induction."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-4",
"text": "Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-5",
"text": "However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (2011) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-6",
"text": "This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-7",
"text": "Experimental results on the navigation task demonstrates the effectiveness of our approach."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-10",
"text": "The ultimate goal of \"grounded\" language learning is to develop computational systems that can acquire language more like a human child."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-11",
"text": "Given only supervision in the form of sentences paired with relevant but ambiguous perceptual contexts, a system should learn to interpret and/or generate language describing situations and events in the world."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-12",
"text": "For example, systems have learned to commentate simulated robot soccer games by learning from sample sportscasts (Chen and Mooney, 2008; Liang et al., 2009; B\u00f6rschinger et al., 2011) , or understand navigation instructions by learning from action traces produced when following the directions (Chen and Mooney, 2011; Tellex et al., 2011) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-13",
"text": "B\u00f6rschinger et al. (2011) recently introduced an approach to grounded language learning using unsupervised induction of probabilistic context free grammars (PCFGs) to learn from ambiguous contextual supervision."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-14",
"text": "Their approach first constructs a large set of production rules from sentences paired with descriptions of their ambiguous context, and then trains the parameters of this grammar using EM."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-15",
"text": "Parsing a novel sentence with this grammar gives a parse tree which contains the formal meaning representation (MR) for this sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-16",
"text": "This approach works quite well on the sportscasting task originally introduced by Chen and Mooney (2008) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-17",
"text": "In this task, each sentence in a natural-language commentary describing activity in a simulated robot soccer game is paired with the small set of actions observed within the past 5 seconds, one of which is usually described by the sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-18",
"text": "Even with this low level of ambiguity in a constrained domain, their method constructs a PCFG with about 33,000 productions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-19",
"text": "More fundamentally, their approach is restricted to a finite set of potential meaning representations, and the grammar size grows at least linearly with the number of possible MRs, which in turn is inevitably exponential in the number of objects and actions in the domain."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-20",
"text": "The navigation task studied by Chen and Mooney (2011) provides much more ambiguous supervision."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-21",
"text": "In this task, each instructional sentence is paired with a formal landmarks plan (represented as a large graph) that includes a full description of the observed actions and world-states that result when someone follows this instruction."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-22",
"text": "An instruction generally refers to a subgraph of this large graph."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-23",
"text": "Therefore, there are a combinatorial number of possible meanings to which a given sentence can refer."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-24",
"text": "Chen and Mooney (2011) circumvent this combinatorial problem by never explicitly enumerating the exponential number of potential meanings for each sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-25",
"text": "Their system first induces a semantic lexicon that maps words and short phrases to formal representations of actions and objects in the world."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-26",
"text": "This lexicon is learned by finding words and phrases whose occurrence highly correlates with specific observed actions and objects in the simulated environment when executing the corresponding instruction."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-27",
"text": "This learned lexicon is then used to directly infer a formal MR for observed instructional sentences using a greedy covering algorithm."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-28",
"text": "These inferred MRs are then used to train a supervised semantic parser capable of mapping novel sentences to their formal meanings."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-29",
"text": "We present a novel enhancement of B\u00f6rschinger et al.'s PCFG approach that uses Chen and Mooney's lexicon learner to avoid a combinatorial explosion in the number of productions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-30",
"text": "The learned lexicon is first used to build a hierarchy of semantic lexemes (i.e. lexicon entries) called the Lexeme Hierarchy Graph (LHG) for each ambiguous landmarks plan in the training data."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-31",
"text": "The intuition behind utilizing an LHG is that the MR for each lexeme constitutes a semantic concept that corresponds to some naturallanguage (NL) word or phrase."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-32",
"text": "Therefore, the LHG represents how complex semantic concepts are composed of simpler semantic concepts and ultimately connected to NL words and phrases."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-33",
"text": "B\u00f6rschinger et al.'s approach instead produces NL groundings at the level of atomic MR constituents, which causes an explosion in the number of PCFG productions for complex MR languages."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-34",
"text": "We estimated that B\u00f6rschinger et al.'s approach would require more than 20! (> 10 18 ) productions for our navigation problem."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-35",
"text": "1 On the other hand, our method, which uses correspondences from the LHG at the semantic concept level, constructs a more focused PCFG of tractable size."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-36",
"text": "It then extracts the MR for a novel sentence from the most-probable parse tree for the resulting PCFG."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-37",
"text": "Our approach can produce a large, combinatorial number of different MRs for a wide range of novel sentences by composing relevant MR components from the resulting parse tree, whereas B\u00f6rschinger et al.'s approach is only able to output MRs that are explicitly included as a nonterminals in the original learned PCFG."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-38",
"text": "The remainder of the paper is organized as follows."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-39",
"text": "Section 2 reviews B\u00f6rschinger et al.'s PCFG approach as well as the navigation task and data."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-40",
"text": "Section 3 describes our enhanced PCFG approach and Section 4 presents an experimental evaluation of it."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-41",
"text": "Then, Section 5 discusses the unique aspects of our approach and Section 6 describes additional related work."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-42",
"text": "Finally, Section 7 presents future research directions and Section 8 gives our conclusions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-44",
"text": "**BACKGROUND**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-45",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-46",
"text": "**EXISTING PCFG APPROACH**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-47",
"text": "Our approach extends that of B\u00f6rschinger et al. (2011) , which in turn was inspired by a series of previous techniques (Lu et al., 2008; Liang et al., 2009; following the idea of constructing correspondences between NL and MR in a single probabilistic generative framework."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-48",
"text": "Particularly, their approach automatically constructs a PCFG that generates NL sentences from MRs, which indicates how atomic MR constituents are probabilistically related to NL words."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-49",
"text": "The nonterminals in the grammar correspond to complete MRs, MR constituents, and NL phrases."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-50",
"text": "The nonterminal for a composite MR generates each of its MR constituents, and each atomic MR, x, generates an NL phrase, P hrase x ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-51",
"text": "Each P hrase x then generates a sequence of W ord x 's for describing x, and each W ord x can generate each possible word in the natural language."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-52",
"text": "This allows the system to learn the words and phrases used to describe each atomic MR by properly weighting these rules."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-53",
"text": "Figure 1 shows one possible derivation tree for a sample NL-MR pair and the PCFG rules that are constructed for it."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-54",
"text": "Once a set of productions are assembled, their probabilities are learned using the Inside-Outside algorithm."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-55",
"text": "Computing the most probable parse for a novel sentence with the trained PCFG provides its Unfortunately, as discussed earlier, this approach only works for finite MR languages, and the grammar becomes intractably large even for finite but complex MRs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-56",
"text": "It effectively assumes that MRs are fairly small and includes every possible MR constituent as a nonterminal in the PCFG."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-57",
"text": "This is not tractable for more complex MRs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-58",
"text": "Therefore, our extension incorporates a learned lexicon to constrain the space of productions, thereby making the size of the PCFG tractable for complex MRs, and even giving it the ability to handle infinite MR languages."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-59",
"text": "Moreover, when processing novel sentences, our approach can produce a large space of novel MRs that were not anticipated during training, which is not the case for B\u00f6rschinger et al.'s approach."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-61",
"text": "**NAVIGATION TASK AND DATASET**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-62",
"text": "We employ the task and data introduced by Chen and Mooney (2011) whose goal is to interpret and follow NL navigation instructions in a virtual world."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-63",
"text": "Figure 2 shows a sample execution path in a particular virtual world."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-64",
"text": "The challenge is learning to perform this task by simply observing humans following instructions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-65",
"text": "Formally, given training data of the form {(e 1 , a 1 , w 1 ), ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-66",
"text": ". . , (e n , a n , w n )}, where e i is an NL instruction, a i is an observed action sequence, and w i is the current world state (patterns of floors and walls, positions of any objects, etc.), we want to produce the correct actions a j for a novel (e j , w j )."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-67",
"text": "In order to learn, their system infers the intended formal plan p i (the MR for a sentence) which produced the action sequence a i from the instruction e i ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-68",
"text": "However, there is a large space of possible plans for any given action sequence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-69",
"text": "Chen and Mooney first construct a formal landmarks plan, c i , for each a i , which is a graph representing the context of every action and the world-state encountered during the execution of the sequence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-70",
"text": "The correct plan MR, p i , is assumed to be a subgraph of c i , and this causes a combinatorial matching problem between e i and c i in order to learn the correct meaning of e i among all the possible subgraphs of c i ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-71",
"text": "The landmarks and correct plans for a sample instruction are shown in Figure 3 , illustrating the complexity of the MRs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-72",
"text": "Instead of directly solving the combinatorial correspondence problem, they first learn a semantic lex- icon that maps words and short phrases to small subgraphs representing their inferred meanings from the (e i , c i ) pairs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-73",
"text": "The lexicon is learned by evaluating pairs of n-grams, w j , and MR graphs, m j , and scoring them based on how much more likely m j is a subgraph of the context c i when w occurs in the corresponding instruction e i ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-74",
"text": "This process is similar to other \"cross-situational\" approaches to learning word meanings (Siskind, 1996; Thompson and Mooney, 2003) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-75",
"text": "Then, a plan refinement step estimates p i from c i by greedily selecting high-scoring lexemes of the form (w j , m j ) whose words and phrases (w j ) cover the instruction e i and introduce components (m j ) from the landmarks plan c i ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-76",
"text": "The refined plans are used to construct supervised training data (e i , p i ) for a supervised semantic-parser learner."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-77",
"text": "The trained semantic parser can parse a novel instruction into a formal plan, which is finally executed for end-to-end evaluation."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-78",
"text": "Figure 4 illustrates the overall system."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-79",
"text": "As this figure indicates, our new PCFG method replaces the plan refinement and semantic parser components in their system with a unified model that both disambiguates the training data and learns a semantic parser."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-80",
"text": "We use the landmarks plans and the learned lexicon produced by Chen and Mooney (2011) as inputs to our system."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-81",
"text": "2 Like B\u00f6rschinger et al. (2011) , our approach learns a semantic parser directly from ambiguous supervision, specifically NL instructions paired with their complete landmarks plans as context."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-82",
"text": "Our method incorporates the semantic lexemes as building blocks to find correspondences between NL words and semantic concepts represented by the lexeme MRs, instead of building connections between NL words and every possible MR constituent as in B\u00f6rschinger et al.'s approach."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-83",
"text": "Particularly, we utilize the hierarchical subgraph relationships between the MRs in the learned semantic lexicon to produce a smaller, more focused set of PCFG rules."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-84",
"text": "3 The intuition behind our approach is analogous to the hierarchical relations between nonterminals in syntactic parsing, where higher-level categories such as S, VP, or NP are further divided into smaller categories such as V, N, or Det, thereby forming a hierarchical structure."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-85",
"text": "Inspired by this idea, we introduce a directed acyclic graph called the Lexeme Hierarchy Graph (LHG) which represents the hierarchical relationships between lexeme MRs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-86",
"text": "Since complex lexeme MRs represent complicated semantic concepts while simple MRs represent simple concepts, it is natural to construct a hierarchy amongst them."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-87",
"text": "The LHGs for all of the training examples are used to construct production rules for the PCFG, which are then parametrized using EM."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-88",
"text": "Finally, a novel sentence is semantically parsed by computing its mostprobable parse using the trained PCFG, and then its MR is extracted from the resulting parse tree."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-91",
"text": "An LHG represents the hierarchy of lexical meanings relevant to a particular training instance by encoding the subgraph relations between the MRs of relevant lexemes."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-92",
"text": "Algorithm 1 describes how an LHG is constructed for an ambiguous training pair of a sentence and its corresponding context, (e i , c i )."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-93",
"text": "First, we obtain all relevant lexemes (w i j , m i j ) in the lexicon L, where the MR m i j is a subgraph of the context c i (denoted as m i j \u2282 c i )."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-94",
"text": "These lexemes are"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-96",
"text": "**ALGORITHM 1 LEXEME HIERARCHY GRAPH (LHG)**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-97",
"text": "Input: Training instance (e i , c i ), Lexicon L Output: Lexeme hierarchy graph for (e i , c i ) The initial LHG may contain nodes with too many children."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-98",
"text": "This is a problem, because when we subsequently extract PCFG rules, we need to add a production for every k-permutation of the children of each node (see Section 3.2)."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-99",
"text": "To reduce the branching factor in the LHG, we introduce pseudo-lexeme nodes by repeatedly combining the two most similar children of each node."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-100",
"text": "Pseudocode for the process is shown in Algorithm 2."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-101",
"text": "The MR for a pseudo-lexeme is the minimal graph, m , that is a supergraph of both of the lexeme MRs that it combines."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-102",
"text": "The pair of"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-103",
"text": "Move T i and T j to be children of T until There are no more pairs to combine for all non-leaf children T k of T do RECONSTRUCTLHG(T k ) end for end procedure most similar children, (m i , m j ), is determined by measuring the fraction of the nodes in m i and m j that overlap with their minimum extension m and is calculated as follows:"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-104",
"text": "where |m| is the number of nodes in the MR m."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-105",
"text": "Adding pseudo-lexemes also has another advantage."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-106",
"text": "They can be considered to be higher-level semantic concepts composed of two or more subconcepts."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-107",
"text": "These higher-level concepts will likely occur in other training examples as well, which allows for more flexible interpretations."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-108",
"text": "For example, assuming the rule A \u2192 BCD is constructed from an LHG, we will introduce a pseudo lexeme E and build two rules A \u2192 BE and E \u2192 CD."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-109",
"text": "It is likely that E also occurs in another rule constructed from other training examples such as E \u2192 F G. This increases the model's expressive power by supporting additional derivations such as A \u2192 * BF G, providing more flexibility when parsing novel NL sentences."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-111",
"text": "**COMPOSING PCFG RULES**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-112",
"text": "The next step composes PCFG rules from the LHGs and is summarized in Figure 6 ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-113",
"text": "We basically follow the scheme of B\u00f6rschinger et al. (2011) , but instead of generating NL words from each atomic MR, words are generated from each lexeme MR, Figure 6 : Summary of the rule generation process."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-114",
"text": "NLs refer to the set of NL words in the corpus."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-115",
"text": "Lexeme rules come from the schemata of B\u00f6rschinger et al. (2011) , and allow every lexeme MR to generate one or more NL words."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-116",
"text": "Note that pseudo-lexeme nodes do not produce NL words."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-117",
"text": "and smaller lexeme MRs are generated from more complex ones as given by the LHGs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-118",
"text": "A nonterminal S m is generated for the MR, m, of each LHG node."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-119",
"text": "Then, for every LHG node, T , with MR, m, we add rules of the form S m \u2192 S m i ...S m j , where the RHS is some k-permutation of the nonterminals for the MRs of the children of node T ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-120",
"text": "B\u00f6rschinger et al. assume that every atomic MR generates at least one NL word."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-121",
"text": "However, since we do not know which subgraph of the overall context (i.e. c i , the MR of the root node) conveys the intended plan and is therefore expressed in the NL instruction, we must allow each ordered subset of the children of a node (i.e. each k-permutation) to be a possible generation."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-122",
"text": "The rest of the process more closely follows B\u00f6rschinger et al.'s."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-123",
"text": "Every MR, m, of a lexeme node 4 generates a rule S m \u2192 P hrase m , and every P hrase m generates a sequence of NL words, including one or more \"content words\" (W ord m ) for expressing m and zero or more \"extraneous\" words (W ord \u2205 )."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-124",
"text": "While B\u00f6rschinger et al. have W ord m generate all possible NL words (each of which are subsequently weighted by EM training), in our approach, each W ord m only produces the NL phrase associated with m in the lexicon, or individual words that appear in this phrase."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-125",
"text": "The words not covered by W ord m also can be generated by W ord \u2205 which has rules for every word."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-126",
"text": "P h m and P hX m ensure that P hrase m produces at least one W ord m , where P hX m indicates that one or more W ord m 's have already been generated, and P h m indicates that no W ord m has yet been generated."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-127",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-128",
"text": "**PARSING NOVEL NL SENTENCES**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-129",
"text": "To learn the parameters of the resulting PCFG, we use the Inside-Outside algorithm."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-130",
"text": "5 Then, the standard probabilistic CKY algorithm is used to produce the most probable parse for novel NL sentences (Jurafsky and Martin, 2000) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-131",
"text": "B\u00f6rschinger et al. (2011) simply read the MR, m, for a sentence off the top S m nonterminal of the most probable parse tree."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-132",
"text": "However, in our approach, the correct MR is constructed by properly composing the appropriate subset of lexeme MRs from the most-probable parse tree."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-133",
"text": "This allows the system to produce a wide variety of novel MRs for novel sentences, as long as the correct MR is a subgraph of the complete context (c i ) for at least one of the training sentences."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-134",
"text": "First, the parse tree is pruned to remove all subtrees starting with P hrase x nodes."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-135",
"text": "This leaves a tree consisting of the Root and a set of S m nodes."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-136",
"text": "The pruned subtrees only concern generating NL words and phrases from the selected MRs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-137",
"text": "The remaining tree shows which MR constituents were selected from the available context, from which the sentence is then generated."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-138",
"text": "Each leaf in the pruned tree represents an MR constituent that was used to generate a phrase in the sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-139",
"text": "These are the constituents we want to assemble and compose into a final MR for the sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-140",
"text": "Algorithm 3 describes the procedure for extracting the final MR from the pruned parse tree."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-141",
"text": "Figure 7 graphically depicts a sample trace of this algorithm."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-142",
"text": "The algorithm recursively traverses the parse tree."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-143",
"text": "When a leaf-node is reached, it marks all of the nodes in its MR."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-144",
"text": "After traversing all of its children, 5 We used the implementation available at http://web."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-145",
"text": "science.mq.edu.au/\u02dcmjohnson/Software.htm which was also used by B\u00f6rschinger et al. (2011) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-147",
"text": "**ALGORITHM 3 CONSTRUCT PARSED MR RESULT**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-148",
"text": "Input: Parse tree T for input NL, e, with all P hrase x subtrees removed."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-149",
"text": "Output: Semantic parse MR, m, for e procedure OBTAINPARSEDOUTPUT(T ) if T is a leaf then return MR(T ) with all its nodes marked end if for all children T i of T do m i \u2190 OBTAINPARSEDOUTPUT(T i ) Mark the nodes in MR(T ) corresponding to the marked nodes in m i end for if T is not the root then return MR(T ) end if return MR(T ) with unmarked nodes removed end procedure a node in the MR for the current parse-tree node is marked iff its corresponding node in any of the children's MRs were marked."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-150",
"text": "The final output is the MR constructed by removing all of the unmarked nodes from the MR for the root node."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-151",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-152",
"text": "**EXPERIMENTAL EVALUATION**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-153",
"text": "For evaluation, we used the same data and methodology as Chen and Mooney (2011) . Please see their paper for more details."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-155",
"text": "**DATA**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-156",
"text": "We used the English instructions and follower data collected by MacMahon et al. (2006) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-157",
"text": "6 This data contains 706 route instructions for three virtual worlds."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-158",
"text": "The instructions were produced by six instructors for 126 unique starting and ending location pairs spread evenly across the three worlds, and there were 1 to 15 human followers for each instruction who executed an average of 10.4 actions per instruction."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-159",
"text": "Each instruction is a paragraph consisting of an average of 5.0 sentences, each containing an average of 7.8 words."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-160",
"text": "Chen and Mooney constructed the additional single-sentence corpus by matching each sentence with the majority of human followers' actions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-161",
"text": "We use this single-sentence version for training, but use both the single-sentence and the original paragraph version for testing."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-162",
"text": "Each sentence was manually annotated with a \"gold standard\" execution plan, which is used for evaluation but not for training."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-163",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-164",
"text": "**METHODOLOGY AND RESULTS**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-165",
"text": "Experiments were conducted using \"leave one environment out\" cross-validation, training on two environments and testing on the third, averaging over all three test environments."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-166",
"text": "We perform direct comparison to the best results of Chen and Mooney (2011) (referred to as CM)."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-167",
"text": "A Wilcoxon signed-rank test is performed for statistical significance, and ' * ' denotes significant differences (p < .01) in the tables."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-168",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-169",
"text": "**SEMANTIC PARSING RESULTS**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-170",
"text": "We first evaluated how well our system learns to map novel NL sentences for new test environments into their correct MRs."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-171",
"text": "Partial semantic-parsing accuracy (Chen and Mooney, 2011 comparing the system's MR output to the handannotated gold standard."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-172",
"text": "Accuracy is measured in terms of precision, recall, and F1 for individual MR constituents (thereby awarding partial credit for approximately correct MRs)."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-173",
"text": "Table 1 demonstrates that our method outperforms CM by 6 points in F1."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-174",
"text": "Our PCFG-based approach is able to probabilistically disambiguate the training data as well as simultaneously learn a statistical semantic parser within a single framework."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-175",
"text": "This results in better overall performance compared to CM, since they lose potentially useful information, particularly during the refinement stage, due to the separate disjoint components of the system."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-176",
"text": "Single-sentence Paragraph Our system * 57.22% * 20.17% CM 54.40% 16.18% Table 2 : Successful plan execution rates for novel test data. ' * ' means statistical significance."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-177",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-178",
"text": "**NAVIGATION PLAN EXECUTION RESULTS**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-179",
"text": "Next, we test the end-to-end system by executing the parsed navigation plans for test instructions in novel environments to see if they reach the exact desired destinations in the environment."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-180",
"text": "Table 2 shows the successful end-to-end navigation-task completion rate for both single-sentences and complete paragraph instructions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-181",
"text": "Again, our system outperforms CM's best results since more accurate semantic parsing produces more successful plans."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-182",
"text": "However, the difference in performance is smaller than that observed for semantic parsing."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-183",
"text": "This is because the redundancy in the human generated instructions allows an incorrect semantic parse to be successful, as long as the errors do not affect its ability to guide the system to the correct destination."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-184",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-185",
"text": "**DISCUSSION**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-186",
"text": "Our approach improves on B\u00f6rschinger et al. (2011) 's method in the following ways:"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-187",
"text": "\u2022 The building blocks for associating NL and MR are semantic lexemes instead of atomic MR constituents."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-188",
"text": "This prevents the number of constructed PCFG rules from becoming intractably large as happens with B\u00f6rschinger et al.'s approach."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-189",
"text": "As previously mentioned, lexeme MRs are intuitively analogous to syntactic categories in that complex lexeme MRs represent complicated semantic concepts whereas higher-level syntactic categories such as S, VP, or NP represent complex syntactic structures."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-190",
"text": "\u2022 Our approach has the ability to produce previously unseen MRs, whereas B\u00f6rschinger et al. can only generate an MR if it is explicitly included in the PCFG rules constructed from the training data."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-191",
"text": "Even though our MR parse is restricted to be a subgraph of some training context, c i , our model allows for exponentially many combinations."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-192",
"text": "In addition, our approach can produce a wider range of MR outputs than Chen and Mooney (2011) 's even though we use their semantic lexicon as input."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-193",
"text": "Their system deterministically builds a supervised training set by greedily selecting highscoring lexemes, thus implicitly including only high-scoring lexemes during training."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-194",
"text": "On the other hand, our probabilistic approach also considers relatively low-scoring but useful lexemes, thereby utilizing more semantic concepts in the lexicon."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-195",
"text": "In particular, this explains why our approach obtains higher recall in the evaluation of semantic parsing."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-196",
"text": "Even though we have demonstrated our approach on the specific task of following navigation instructions, it is straightforward to apply it to other language-grounding tasks where NL sentences potentially refer to some subset of states, events, or actions in the world, as long as this overall context can be represented as a semantic graph or logical form."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-197",
"text": "Since the semantic lexicon is an input to our system, other approaches to lexicon learning are also easily incorporated."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-198",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-199",
"text": "**RELATED WORK**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-200",
"text": "Most work on learning semantic parsers that map natural-language sentences to formal representations of their meaning have relied upon totally supervised training data consisting of NL/MR pairs (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Kate and Mooney, 2006; Wong and Mooney, 2007; Zettlemoyer and Collins, 2007; Lu et al., 2008; Zettlemoyer and Collins, 2009) ."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-201",
"text": "Several recent approaches have investigated grounded learning from ambiguous supervision extracted from perceptual context."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-202",
"text": "A number of approaches (Kate and Mooney, 2007; Chen and Mooney, 2008; Chen et al., 2010; B\u00f6rschinger et al., 2011) assume training data consisting of a set of sentences each associated with a small set of MRs, one of which is usually the correct meaning of the sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-203",
"text": "Many of these approaches (Kate and Mooney, 2007; Chen and Mooney, 2008; Chen et al., 2010) disambiguate the data and match NL sentences to their correct MR by iteratively retraining a supervised semantic parser."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-204",
"text": "proposed a generative semantic parsing model that first chooses which MRs to describe and then generates a hybrid tree structure (Lu et al., 2008) containing both the MR and NL sentence."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-205",
"text": "They train this model on ambiguous data using EM."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-206",
"text": "As previously discussed, B\u00f6rschinger et al. (2011) use a PCFG generative model and also train it on ambiguous data using EM."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-207",
"text": "Liang et al. (2009) assume each sentence maps to one or more semantic records (i.e. MRs) and trains a hierarchical semi-Markov generative model using EM, and then finds a Viterbi alignment between NL words and records and their constituents."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-208",
"text": "Several recent projects (Branavan et al., 2009; Vogel and Jurafsky, 2010) use NL instructions to guide reinforcement learning from independent exploration with delayed rewards."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-209",
"text": "These systems do not even need the ambiguous supervision obtained from observing humans follow instructions; however, they do not learn semantic parsers that map sentences to complex, structural representations of their meaning."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-210",
"text": "Interpreting and executing NL navigation instructions is our primary task, and several other recent projects have studied related problems."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-211",
"text": "Shimizu and Haas (2009) present a system that parses natural language instructions into actions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-212",
"text": "However, they limit the number of possible actions to only 15 and treat the problem as a sequence labeling problem that is solved using a CRF with supervised training."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-213",
"text": "Matuszek et al. (2010) developed a system that learns to map NL instructions to executable commands for a robot navigating in an environment constructed by a laser range finder."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-214",
"text": "However, their approach has limitations of ignoring any objects or other landmarks in the environment to which the instructions can refer."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-215",
"text": "There are several recent projects (Vogel and Jurafsky, 2010; Kollar et al., 2010; Tellex et al., 2011) which learn to follow instructions in more linguistically complex environments."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-216",
"text": "However, they assume predefined spatial words, direct matching between NL words and the names of objects and other landmarks in the MR, and/or an existing syntactic parser."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-217",
"text": "By contrast, our work does not assume any prior linguistic knowledge, syntactic, lexical, or semantic, and must learn the mapping between NL words and phrases and the MR terms describing landmarks."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-218",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-219",
"text": "**FUTURE WORK**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-220",
"text": "In the future, we would like to develop a better lexicon learner since our PCFG approach critically relies on the quality of the learned lexicon."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-221",
"text": "Particularly, we would like to investigate how syntactic information (such as part-of-speech tags induced using unsupervised learning) could be used to improve semantic-lexicon learning."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-222",
"text": "For example, some of the current lexicon entries violate the general constraint that nouns usually refer to objects and verbs to actions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-223",
"text": "Ideally, the lexicon learner would be able to induce and then utilize this sort of relationship between syntax and semantics."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-224",
"text": "In addition, we want to investigate the use of discriminative reranking (Collins, 2000) , which has proven effective in various other NLP tasks."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-225",
"text": "We would expect the final MR output to improve if a discriminative model, which uses additional global features, is used to rerank the top-k parses produced by our generative PCFG model."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-226",
"text": "----------------------------------"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-227",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-228",
"text": "We have presented a novel method for learning a semantic parser given only highly ambiguous supervision."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-229",
"text": "Our model enhances B\u00f6rschinger et al. (2011) 's approach to reducing the problem of grounded learning of semantic parsers to PCFG induction."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-230",
"text": "We use a learned semantic lexicon to aid the construction of a smaller and more focused set of PCFG productions."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-231",
"text": "This allows the approach to scale to complex MR languages that define a large (potentially infinite) space of representations for capturing the meaning of sentences."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-232",
"text": "By contrast, the previous PCFG approach requires a finite MR language and its grammar grows intractably large for even moderately complex MR languages."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-233",
"text": "In addition, our algorithm for composing MRs from the final parse tree provides the flexibility to produce a wide range of novel MRs that were not seen during training."
},
{
"sent_id": "f8fc3634684ff37ab3d29cee910443-C001-234",
"text": "Evaluations on a previous corpus of navigational instructions for virtual environments has demonstrated the effectiveness of our method compared to a recent competing system."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"f8fc3634684ff37ab3d29cee910443-C001-12"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-202"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-206"
]
],
"cite_sentences": [
"f8fc3634684ff37ab3d29cee910443-C001-12",
"f8fc3634684ff37ab3d29cee910443-C001-202",
"f8fc3634684ff37ab3d29cee910443-C001-206"
]
},
"@EXT@": {
"gold_contexts": [
[
"f8fc3634684ff37ab3d29cee910443-C001-47"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-113"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-186"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-229"
]
],
"cite_sentences": [
"f8fc3634684ff37ab3d29cee910443-C001-47",
"f8fc3634684ff37ab3d29cee910443-C001-113",
"f8fc3634684ff37ab3d29cee910443-C001-186",
"f8fc3634684ff37ab3d29cee910443-C001-229"
]
},
"@SIM@": {
"gold_contexts": [
[
"f8fc3634684ff37ab3d29cee910443-C001-81"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-144",
"f8fc3634684ff37ab3d29cee910443-C001-145"
]
],
"cite_sentences": [
"f8fc3634684ff37ab3d29cee910443-C001-81",
"f8fc3634684ff37ab3d29cee910443-C001-145"
]
},
"@USE@": {
"gold_contexts": [
[
"f8fc3634684ff37ab3d29cee910443-C001-81"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-115"
],
[
"f8fc3634684ff37ab3d29cee910443-C001-144",
"f8fc3634684ff37ab3d29cee910443-C001-145"
]
],
"cite_sentences": [
"f8fc3634684ff37ab3d29cee910443-C001-81",
"f8fc3634684ff37ab3d29cee910443-C001-115",
"f8fc3634684ff37ab3d29cee910443-C001-145"
]
},
"@DIF@": {
"gold_contexts": [
[
"f8fc3634684ff37ab3d29cee910443-C001-113"
]
],
"cite_sentences": [
"f8fc3634684ff37ab3d29cee910443-C001-113"
]
}
}
},
"ABC_1cd671c60486a137377096cae435ec_10": {
"x": [
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-89",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-90",
"text": "**GATING SEARCH RESULTS**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-2",
"text": "We propose a novel neural attention architecture to tackle machine comprehension tasks, such as answering Cloze-style queries with respect to a document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-3",
"text": "Unlike previous models, we do not collapse the query into a single vector, instead we deploy an iterative alternating attention mechanism that allows a fine-grained exploration of both the query and the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-4",
"text": "Our model outperforms state-of-the-art baselines in standard machine comprehension benchmarks such as CNN news articles and the Children's Book Test (CBT) dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-7",
"text": "Recently, the idea of training machine comprehension models that can read, understand, and answer questions about a text has come closer to reality principally through two factors."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-8",
"text": "The first is the advent of deep learning techniques (Goodfellow et al., 2016) , which allow manipulation of natural language beyond its surface forms and generalize beyond relatively small amounts of labeled data."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-9",
"text": "The second factor is the formulation of standard machine comprehension benchmarks based on Cloze-style queries (Hill et al., 2015; Hermann et al., 2015) , which permit fast integration loops between model conception and experimental evaluation."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-10",
"text": "Cloze-style queries (Taylor, 1953) are created by deleting a particular word in a natural-language statement."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-11",
"text": "The task is to guess which word was deleted."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-12",
"text": "In a pragmatic approach, recent work (Hill et al., 2015) formed such questions by extracting a sentence from a larger document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-13",
"text": "In contrast to considering a stand-alone statement, the system is now required to handle a larger amount of information that may possibly influence the prediction of the missing word."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-14",
"text": "Such contextual dependencies may also be injected by removing a word from a short human-crafted summary of a larger body of text."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-15",
"text": "The abstractive nature of the summary is likely to demand a higher level of comprehension of the original text (Hermann et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-16",
"text": "In both cases, the machine comprehension system is presented with an ablated query and the document to which the original query refers."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-17",
"text": "The missing word is assumed to appear in the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-18",
"text": "Encouraged by the recent success of deep learning attention architectures (Bahdanau et al., 2015; Sukhbaatar et al., 2015) , we propose a novel neural attention-based inference model designed to perform machine reading comprehension tasks."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-19",
"text": "The model first reads the document and the query using a recurrent neural network (Goodfellow et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-20",
"text": "Then, it deploys an iterative inference process to uncover the inferential links that exist between the missing query word, the query, and the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-21",
"text": "This phase involves a novel alternating attention mechanism; it first attends to some parts of the query, then finds their corresponding matches by attending to the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-22",
"text": "The result of this alternating search is fed back into the iterative inference process to seed the next search step."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-23",
"text": "This permits our model to reason about different parts of the query in a sequential way, based on the information that has been gathered previously from the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-24",
"text": "After a fixed number of iterations, the model uses a summary of its inference process to predict the answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-25",
"text": "This paper makes the following contributions."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-26",
"text": "We present a novel iterative, alternating attention mechanism that, unlike existing models (Hill et al., 2015; Kadlec et al., 2016) , does not compress the query to a single representation, but instead alternates its attention between the query and the document to obtain a fine-grained query representation within a fixed computation time."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-27",
"text": "Our architecture tightly integrates previous ideas related to bidirectional readers (Kadlec et al., 2016) and iterative attention processes (Hill et al., 2015; Sukhbaatar et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-28",
"text": "It obtains state-of-theart results on two machine comprehension datasets and shows promise for application to a broad range of natural language processing tasks."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-29",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-30",
"text": "**TASK DESCRIPTION**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-31",
"text": "One of the advantages of using Cloze-style questions to evaluate machine comprehension systems is that a sufficient amount of training and test data can be obtained without human intervention."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-32",
"text": "The CBT (Hill et al., 2015) and CNN (Hermann et al., 2015) corpora are two such datasets."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-33",
"text": "The CBT 1 corpus was generated from well-known children's books available through Project Gutenberg."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-34",
"text": "Documents consist of 20-sentence excerpts from these books."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-35",
"text": "The related query is formed from an excerpt's 21st sentence by replacing a single word with an anonymous placeholder token."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-36",
"text": "The dataset is divided into four subsets depending on the type of the word replaced."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-37",
"text": "The subsets are named entity, common noun, verb, and preposition."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-38",
"text": "We will focus our evaluation solely on the first two subsets, i.e. CBT-NE (named entity) and CBT-CN (common nouns), since the latter two are relatively simple as demonstrated by (Hill et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-39",
"text": "The CNN 2 corpus was generated from news articles available through the CNN website."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-40",
"text": "The documents are given by the full articles themselves, which are accompanied by short, bullet-point summary statements."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-41",
"text": "Instead of extracting a query from the articles themselves, the authors replace a named entity within each article summary with an anonymous placeholder token."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-42",
"text": "For both datasets, the training and evaluation data consist of tuples (Q, D, A, a), where Q is the query (represented as a sequence of words), D is the document, A is the set of possible answers, and a \u2208 A is"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-43",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-44",
"text": "**ALTERNATING ITERATIVE ATTENTION**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-45",
"text": "Our model is represented in Fig. 1 ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-46",
"text": "Its workflow has three steps."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-47",
"text": "First is the encoding phase, in which we compute a set of vector representations, acting as a memory of the content of the input document and query."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-48",
"text": "Next, the inference phase aims to untangle the complex semantic relationships linking the document and the query in order to provide sufficiently strong evidence for the answer prediction to be successful."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-49",
"text": "To accomplish this, we use an iterative process that, at each iteration, alternates attentive memory accesses to the query and the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-50",
"text": "Finally, the prediction phase uses the information gathered from the repeated attentions through the query and the document to maximize the probability of the correct answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-51",
"text": "We describe each of the phases in the following sections."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-52",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-53",
"text": "**BIDIRECTIONAL ENCODING**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-54",
"text": "The input to the encoding phase is a sequence of words X = (x 1 , . . . , x |X | ), such as a document or a query, drawn from a vocabulary V ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-55",
"text": "Each word is represented by a continuous word embedding x \u2208 R d stored in a word embedding matrix X \u2208 R |V |\u00d7d ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-56",
"text": "The sequence X is processed using a recurrent neural network encoder (Goodfellow et al., 2016) with gated recurrent units (GRU) (Cho et al., 2014) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-57",
"text": "For each position i in the input sequence, the GRU takes as input the word embedding x i and updates a hidden Figure 1: Our model first encodes the query and the document by means of bidirectional GRU networks."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-58",
"text": "Then, it deploys an iterative inference mechanism that alternates between attending query encodings (1) and document encodings (2) given the query attended state."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-59",
"text": "The results of the alternating attention is gated and fed back into the inference GRU."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-60",
"text": "Even if the encodings are computed only once, the query representation is dynamic and changes throughout the inference process."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-61",
"text": "After a fixed number of steps T , the weights of the document attention are used to estimate the probability of the answer P (a|Q, D)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-62",
"text": "by:"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-63",
"text": "where h i , r i and u i \u2208 R h are the recurrent state, the reset gate and update gate respectively, I {r,u,h} \u2208 R h\u00d7d , H {r,u,h} \u2208 R h\u00d7h are the parameters of the GRU, \u03c3 is the sigmoid function and \u00b7 is the elementwise multiplication."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-64",
"text": "The hidden state h i acts as a representation of the word x i in the context of the preceding sequence inputs x i , we choose to process the sequence in reverse with an additional GRU (Kadlec et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-66",
"text": "Therefore, the encoding phase maps each token x i to a contextual representation given by the concatenation of the forward and backward GRU hidden"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-67",
"text": "We denote byq i \u2208 R 2h andd i \u2208 R 2h the contextual encodings for word i in the query Q and the document D respectively."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-69",
"text": "**ITERATIVE ALTERNATING ATTENTION**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-70",
"text": "This phase can be considered a means to uncover a possible inference chain that starts at the query and the document and leads to the answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-71",
"text": "The inference is modelled by an additional recurrent GRU network."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-72",
"text": "The recurrent network iteratively performs an alternating search step to gather information that may be useful to predict the answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-73",
"text": "In particular, at each time step: (1) it performs an attentive read on the query encodings, resulting in a query glimpse, q t , and (2) given the current query glimpse, it extracts a conditional document glimpse, d t , representing the parts of the document that are relevant to the current query glimpse."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-74",
"text": "In turn, both attentive reads are conditioned on the previous hidden state of the inference GRU s t\u22121 , summarizing the information that has been gathered from the query and the document up to time t. The inference GRU uses both glimpses to update its recurrent state and thus decides which information needs to be gathered to complete the inference process."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-75",
"text": "Query Attentive Read Given the query encodings {q i }, we formulate a query glimpse q t at timestep t by:"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-76",
"text": "where q i, t are the query attention weights and A q \u2208 R 2h\u00d7s , where s is the dimensionality of the inference GRU state, and a q \u2208 R 2h ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-77",
"text": "The attention we use here is similar to the formulation used in (Hill et al., 2015; Sukhbaatar et al., 2015) , but with two differences."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-78",
"text": "First, we use a bilinear term instead of a simple dot product in order to compute the importance of each query term in the current time step."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-79",
"text": "This simple bilinear attention has been successfully used in (Luong et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-80",
"text": "Second, we add a term a q that allows to bias the attention mechanism towards words which tend to be important across the questions independently of the search key s t\u22121 ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-81",
"text": "This is similar to what is achieved by the original attention mechanism proposed in (Bahdanau et al., 2015) without the burden of the additional tanh layer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-83",
"text": "**DOCUMENT ATTENTIVE READ**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-84",
"text": "The alternating attention continues by probing the document given the current query glimpse q t ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-85",
"text": "In particular, the document attention weights are computed based on both the previous search state and the currently selected query glimpse q t :"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-86",
"text": "where d i, t are the attention weights for each word in the document and A d \u2208 R 2h\u00d7(s+2h) and a d \u2208 R 2h ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-87",
"text": "Note that the document attention is also conditioned on s t\u22121 ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-88",
"text": "This allows the model to perform transitive reasoning on the document side, i.e. to use previously obtained document information to bias future attended locations, which is particularly important for natural language inference tasks (Sukhbaatar et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-91",
"text": "In order to update its recurrent state, the inference GRU may evolve on the basis of the information gathered from the current inference step, i.e."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-92",
"text": ", where f is defined in Eq. 1."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-93",
"text": "However, the current query glimpse may be too general or the document may not contain the information specified in the query glimpse, i.e. the query or the document attention weights may be nearly uniform."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-94",
"text": "We include a gating mechanism that is designed to reset the current query and document glimpses in the case that the current search is not fruitful."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-95",
"text": "Formally, we implement a gating mech-"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-96",
"text": ", where \u00b7 is the element-wise multiplication and g : R s+6h \u2192 R 2h ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-97",
"text": "The gate g takes the form of a 2-layer feed-forward network with sigmoid output unit activation."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-98",
"text": "The fourth argument of the gate takes into account multiplicative interactions between query and document glimpses, making it easier to determine the degree of matching between them."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-99",
"text": "Given a query gate g q , producing r q , and a document gate g d , producing r d , the inputs of the inference GRU are given by the reset version of the query and document glimpses, i.e.,"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-100",
"text": ". Intuitively, the model reviews the query glimpse with respect to the contents of the document glimpse and vice versa."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-102",
"text": "**ANSWER PREDICTION**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-103",
"text": "After a fixed number of time-steps T , the document attention weights obtained in the last search step d i,T are used to predict the probability of the answer given the document and the query P (a|Q, D)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-104",
"text": "Formally, we follow (Kadlec et al., 2016) and apply the \"pointer-sum\" loss:"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-105",
"text": "where I(a, D) is a set of positions where a occurs in the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-106",
"text": "The model is trained to maximize log P (a|Q, D) over the training corpus."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-107",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-108",
"text": "**TRAINING DETAILS**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-109",
"text": "To train our model, we used stochastic gradient descent with the ADAM optimizer (Kingma and Ba, 2014) , with an initial learning rate of 0.001."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-110",
"text": "We set the batch size to 32 and we decay the learning rate by 0.8 if the accuracy on the validation set does not increase after a half-epoch, i.e. 2000 batches (for CBT) and 5000 batches for (CNN)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-111",
"text": "We initialize all weights of our model by sampling from the normal distribution N (0, 0.05)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-112",
"text": "Following (Saxe et al., 2013) , the GRU recurrent weights are initialized to be orthogonal and biases are initialized to zero."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-113",
"text": "In order to stabilize the learning, we clip the gradients if their norm is greater than 5 (Pascanu et al., 2013) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-114",
"text": "We performed a hyperparameter search with embedding regularization in {0.001, 0.0001}, inference steps T \u2208 {3, 5, 8}, embedding size d \u2208 {256, 384}, encoder size h \u2208 {128, 256} and the inference GRU size s \u2208 {256, 512}. We regularize our model by applying a dropout (Srivastava et al., 2014) Table 2 : Results on the CBT-NE (named entity) and CBT-CN (common noun) datasets."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-115",
"text": "Results marked with 1 are from (Hill et al., 2015) and those marked with 2 are from (Kadlec et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-116",
"text": "the inputs to both the query and the document attention mechanisms."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-117",
"text": "We found that setting embedding regularization to 0.0001, T = 8, d = 384, h = 128, s = 512 worked robustly across the datasets."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-118",
"text": "Our model is implemented in Theano (Bastien et al., 2012) , using the Keras (Chollet, 2015) library."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-119",
"text": "Computational Complexity Similar to previous state-of-the-art models (Kadlec et al., 2016; Chen et al., 2016) which use a bidirectional encoder, the major bottleneck of our method is computing the document and query encodings."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-120",
"text": "The alternating attention mechanism runs only for a fixed number of steps (T = 8 in our tests), which is orders of magnitude smaller than a typical document or query in our datasets (see Table 1 )."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-121",
"text": "The repeated attentions each require a softmax over \u223c1000 locations which is typically fast on recent GPU architectures."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-122",
"text": "Thus, our computation cost is comparable to (Kadlec et al., 2016; Chen et al., 2016 ), but we outperform the latter models on the datasets tested."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-124",
"text": "**RESULTS**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-125",
"text": "We report the results of our model on the CBT-CN, CBT-NE and CNN datasets, previously described in Section 2."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-126",
"text": "Table 2 reports our results on the CBT-CN and CBT-NE dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-127",
"text": "The Humans, LSTMs and Memory Networks (MemNNs) results are taken from (Hill et al., 2015) and the Attention-Sum Reader (AS Reader) is a state-of-the-art result recently obtained by (Kadlec et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-128",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-129",
"text": "**CBT**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-130",
"text": "Main result Our model (line 7) sets a new stateof-the-art on the common noun category by gaining 3.6 and 5.6 points in validation and test over the best baseline AS Reader (line 5)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-131",
"text": "This performance gap is only partially reflected on the CBT-NE dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-132",
"text": "We observe that the 1.4 accuracy points on the validation set do not reflect better performance on the test set, which sits on par with the best baseline."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-133",
"text": "In CBT-NE, the missing word is a named entity appearing in the story which is likely to be less frequent than a common noun."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-134",
"text": "We found that approximatively 27.5% of validation examples and 29.6% of test examples contain an answer that has never been predicted in the training set."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-135",
"text": "These numbers are considerably lower for the CBT-CN, for which only 2.5% and 4.6% of validation and test examples respectively contain an answer that has not been previously seen."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-136",
"text": "Ensembles Fusing multiple models generally achieves better generalization."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-137",
"text": "In order to investigate whether this could help achieving better held-out performance on CBT-NE, we adopt a simple strategy and average the predictions of 5 models trained with different random seeds (line 9)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-138",
"text": "In this case, our ensemble outperforms the AS Reader ensemble both on CBT-CN and CBT-NE setting new state-of-the-art for this task."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-139",
"text": "On CBT-NE, it achieves a validation and test performance of 76.9 and 72.0 accuracy points respectively (line 9 (Hill et al., 2015) , 3 from (Kadlec et al., 2016) and 4 from (Chen et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-140",
"text": "improvements over the single model and sits at 74.1 on validation and 71.0 on test."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-141",
"text": "Fixed query attention In order to measure the impact of the query attention step in our model, we constrain the query attention weights q i,t to be uniform, i.e. q i,t = 1/|Q|, for all t = 1, . . . , T (line 6)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-142",
"text": "This corresponds to fixing the query representation to the average pooling over the bidirectional query encodings and is similar in spirit to previous work (Kadlec et al., 2016; Chen et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-143",
"text": "By comparing line 6 and line 7, we see that the query attention mechanism allows improvements up to 2.3 points in validation and 4.9 points in test with respect to fixing the query representation throughout the search process."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-144",
"text": "A similar scenario was observed on the CNN dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-145",
"text": "Table 3 reports our results on the CNN dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-146",
"text": "We compare our model with a simple word distance model, the three neural approaches from (Hermann et al., 2015) (Deep LSTM Reader, Attentive Reader and Impatient Reader), and with the AS reader (Kadlec et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-147",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-148",
"text": "**CNN**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-149",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-150",
"text": "**MAIN RESULT**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-151",
"text": "The results show that our model (line 8) improves state-of-the-art accuracy by 4 percent absolute on validation and 3.4 on test with respect to the most recent published result (AS Reader) (line 7)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-152",
"text": "We also report the very recent results of the Stanford AR system that came to our attention during the writeup of this article (Chen et al., 2016 ) (line 9)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-153",
"text": "Our model slightly improves over this strong baseline by 0.2 percent on validation and 0.9 percent on test."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-154",
"text": "We note that the latter comparison may be influenced by different training and initialization strategies."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-155",
"text": "First, Stanford AS uses GloVe embeddings (Pennington et al., 2014) , pre-trained from a large external corpus."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-156",
"text": "Second, the system normalizes the output probabilities only over the candidate answers in the document."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-157",
"text": "Ensembles We also report the results using ensembled models."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-158",
"text": "Similarly to the single model case, our ensembles achieve state-of-the-art test performance of 75.2 and 76.1 on validation and test respectively, outperforming previously published results."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-159",
"text": "Category analysis (Chen et al., 2016) classified a sample of 100 CNN stories based on the type of inference required to guess the answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-160",
"text": "Categories that only require local context matching around the placeholder and the answer in the text are Exact Match, Paraphrasing, and Partial Clue, while those which require higher reasoning skills are Multiple Sentences and Ambiguous."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-161",
"text": "For example, in Exact Match examples, the question placeholder and the answer in the document share several neighboring exact words."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-162",
"text": "Category-specific results are reported in Table 4 : Per-category performance of the Stanford AR and our system."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-163",
"text": "The first three categories require local context matching, the next two global context matching and coreference errors are unanswerable questions (Chen et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-164",
"text": "tackled by the neural models, which perform similarly."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-165",
"text": "It seems that the iterative alternating attention inference is better able to solve more difficult examples such as Ambiguous/Hard."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-166",
"text": "One hypothesis is that, in contrast to Stanford AR, which uses only one fixedquery attention step, our iterative attention may better explore the documents and queries."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-167",
"text": "Finally, Coreference Errors (\u223c25% of the corpus) includes examples with critical coreference resolution errors which may make the questions \"unanswerable\"."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-168",
"text": "This is a barrier to achieving accuracies considerably above 75% (Chen et al., 2016) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-169",
"text": "If this estimate is accurate, our ensemble model (76.1%) may be approaching near-optimal performance on this dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-170",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-171",
"text": "**DISCUSSION**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-172",
"text": "We inspect the query and document attention weights for an example article from the CNN dataset."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-173",
"text": "The title of the article is \"Dante turns in his grave as Italian language declines\", and it discusses the decline of Italian language in schools."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-174",
"text": "The plot is shown in Figure 5 .2, where locations attended to in the query and document are in the left and right column respectively."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-175",
"text": "Each row corresponds to an inference timestep 1 \u2264 t \u2264 8."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-176",
"text": "At the first step, the query attention focuses on the placeholder token, as its local context is generally important to discriminate the answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-177",
"text": "The model first focuses on @entity148, which corresponds to \"Greek\" in this article."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-178",
"text": "At this point, the model is still uncertain about other possible locations in the document (we can observe small weights The approach to teaching @entity6 in @placeholder schools needs a makeover , she says across document locations)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-179",
"text": "At t = 2, the query attention moves towards \"schools\" and the model hesitates between \"Italian\" and \"European Union\" (@entity28, see step 3), both of which may satisfy the query."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-180",
"text": "At step 3, the most likely candidates are \"European Union\" and \"Rome\" (@entity159)."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-181",
"text": "As the timesteps unfold, the model learns that \"needs\" may be important to infer the correct entity, i.e. \"Italian\"."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-182",
"text": "The query sits on the same attended location, while the document attention evolves to become more confident about the answer."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-183",
"text": "We find that, across CBT and CNN examples, the query attention wanders near or focuses on the placeholder location, attempting to discriminate its identity using only local context."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-184",
"text": "For these particular datasets, the majority of questions can be answered after attending only to the words directly neighbouring the placeholder."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-185",
"text": "This aligns with the findings of (Chen et al., 2016) concerning CNN, which state that the required reasoning and inference levels for this dataset are quite simple."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-186",
"text": "It would be worthwhile to formulate a dataset in which the placeholder is harder to infer using only local neighboring words, and thereby necessitates deeper query exploration."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-187",
"text": "Finally, across this work we fixed the number of inference steps T ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-188",
"text": "We found that using 8 timesteps works well consistently across the tested datasets."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-189",
"text": "However, we hypothesize that more (fewer) timesteps would benefit harder (easier) examples."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-190",
"text": "A straight-forward extension of the model would be to dynamically select the number of inference steps conditioned on each example."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-191",
"text": "----------------------------------"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-192",
"text": "**RELATED WORKS**"
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-193",
"text": "Neural attention models have been applied recently to a sm\u00f6rg\u00e5sbord of machine learning and natural language processing problems."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-194",
"text": "These include, but are not limited to, handwriting recognition (Graves, 2013) , digit classification (Mnih et al., 2014) , machine translation (Bahdanau et al., 2015) , question answering (Sukhbaatar et al., 2015; Hermann et al., 2015) and caption generation (Xu et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-195",
"text": "In general, attention models keep a memory of states that can be accessed at will by learned attention policies."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-196",
"text": "In our case, the memory is represented by the set of document and query contextual encodings."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-197",
"text": "Our model is closely related to (Sukhbaatar et al., 2015; Kumar et al., 2015; Hermann et al., 2015; Kadlec et al., 2016; Hill et al., 2015) , which were also applied to question answering."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-198",
"text": "The pointer-style attention mechanism that we use to perform the final answer prediction has been proposed by (Kadlec et al., 2016) , which in turn was based on the earlier Pointer Networks of (Vinyals et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-199",
"text": "However, differently from our work, (Kadlec et al., 2016) perform only one attention step and embed the query into a single vector representation, corresponding to the concatenation of the last state of the forward and backward GRU networks."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-200",
"text": "To our knowledge, embedding the query into a single vector representation is a choice that is shared by most machine reading comprehension models."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-201",
"text": "In our model, the repeated, tight integration between query attention and document attention allows the model to explore dynamically which parts of the query are most important to predict the answer, and then to focus on the parts of the document that are most salient to the currently-attended query components."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-202",
"text": "A similar attempt in attending different components of the query may be found in (Hermann et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-203",
"text": "In that model, the document is processed once for each query word."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-204",
"text": "This can be computationally intractable for large documents, since it involves unrolling a bidirectional recurrent neural network over the entire document multiple times."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-205",
"text": "In contrast, our model only estimates query and document encodings once and can learn how to attend different parts of those encodings in a fixed number of steps."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-206",
"text": "The inference network is responsible for making sense of the current attention step with respect to what has been gathered before."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-207",
"text": "In addition to achieving state-ofthe-art performance, this technique may also prove to be more scalable than alternative query attention models."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-208",
"text": "Finally, our iterative inference process shares similarities to the iterative hops in Memory Networks (Sukhbaatar et al., 2015; Hill et al., 2015) ."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-209",
"text": "In that model, the query representation is updated iteratively from hop to hop, although its different components are not attended to separately."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-210",
"text": "Moreover, we substitute the simple linear update with a GRU network."
},
{
"sent_id": "1cd671c60486a137377096cae435ec-C001-211",
"text": "The gating mechanism of the GRU network made it possible to use multiple steps of attention and to propagate the learning signal effectively back through to the first timestep."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"1cd671c60486a137377096cae435ec-C001-9"
],
[
"1cd671c60486a137377096cae435ec-C001-12"
],
[
"1cd671c60486a137377096cae435ec-C001-31",
"1cd671c60486a137377096cae435ec-C001-32"
]
],
"cite_sentences": [
"1cd671c60486a137377096cae435ec-C001-9",
"1cd671c60486a137377096cae435ec-C001-12",
"1cd671c60486a137377096cae435ec-C001-32"
]
},
"@DIF@": {
"gold_contexts": [
[
"1cd671c60486a137377096cae435ec-C001-26"
],
[
"1cd671c60486a137377096cae435ec-C001-77",
"1cd671c60486a137377096cae435ec-C001-78",
"1cd671c60486a137377096cae435ec-C001-79",
"1cd671c60486a137377096cae435ec-C001-80"
]
],
"cite_sentences": [
"1cd671c60486a137377096cae435ec-C001-26",
"1cd671c60486a137377096cae435ec-C001-77"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"1cd671c60486a137377096cae435ec-C001-27"
],
[
"1cd671c60486a137377096cae435ec-C001-115"
]
],
"cite_sentences": [
"1cd671c60486a137377096cae435ec-C001-27",
"1cd671c60486a137377096cae435ec-C001-115"
]
},
"@MOT@": {
"gold_contexts": [
[
"1cd671c60486a137377096cae435ec-C001-38"
]
],
"cite_sentences": [
"1cd671c60486a137377096cae435ec-C001-38"
]
},
"@SIM@": {
"gold_contexts": [
[
"1cd671c60486a137377096cae435ec-C001-77",
"1cd671c60486a137377096cae435ec-C001-78",
"1cd671c60486a137377096cae435ec-C001-79",
"1cd671c60486a137377096cae435ec-C001-80"
],
[
"1cd671c60486a137377096cae435ec-C001-197"
],
[
"1cd671c60486a137377096cae435ec-C001-208"
]
],
"cite_sentences": [
"1cd671c60486a137377096cae435ec-C001-77",
"1cd671c60486a137377096cae435ec-C001-197",
"1cd671c60486a137377096cae435ec-C001-208"
]
},
"@USE@": {
"gold_contexts": [
[
"1cd671c60486a137377096cae435ec-C001-127"
]
],
"cite_sentences": [
"1cd671c60486a137377096cae435ec-C001-127"
]
}
}
},
"ABC_02521fd9721c264ee05315dec9b31d_10": {
"x": [
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-2",
"text": "We present pre-training approaches for selfsupervised representation learning of speech data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-3",
"text": "A BERT, masked language model, loss on discrete features is compared with an InfoNCE-based constrastive loss on continuous speech features."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-4",
"text": "The pre-trained models are then fine-tuned with a Connectionist Temporal Classification (CTC) loss to predict target character sequences."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-5",
"text": "To study impact of stacking multiple feature learning modules trained using different self-supervised loss functions, we test the discrete and continuous BERT pre-training approaches on spectral features and on learned acoustic representations, showing synergitic behaviour between acoustically motivated and masked language model loss functions."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-6",
"text": "In low-resource conditions using only 10 hours of labeled data, we achieve Word Error Rates (WER) of 10.2% and 23.5% on the standard test \"clean\" and \"other\" benchmarks of the Librispeech dataset, which is almost on bar with previously published work that uses 10 times more labeled data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-7",
"text": "Moreover, compared to previous work that uses two models in tandem (Baevski et al., 2019b) , by using one model for both BERT pre-trainining and fine-tuning, our model provides an average relative WER reduction of 9%."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-8",
"text": "1"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-11",
"text": "Representation learning has been an active research area for more than 30 years (Hinton et al., 1986) , with the goal of learning high level representations which separates different explanatory factors of the phenomena represented by the input data (LeCun et al., 2015; Bengio et al., 2013) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-12",
"text": "Disentangled representations provide models with exponentially higher ability to generalize, using little amount of labels, to new conditions by combining multiple sources of variations."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-13",
"text": "1 We will open source the code for our models."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-14",
"text": "Building Automatic Speech Recognition (ASR) systems, for example, requires a large volume of training data to represent different factors contributing to the creation of speech signals, e.g. background noise, recording channel, speaker identity, accent, emotional state, topic under discussion, and the language used in communication."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-15",
"text": "The practical need for building ASR systems for new conditions with limited resources spurred a lot of work focused on unsupervised speech recognition and representation learning (Park and Glass, 2008; Glass; et."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-16",
"text": "al., a,f; van den Oord et al., 2018; , in addition to semiand weakly-supervised learning techniques aiming at reducing the supervised data needed in realworld scenarios (Vesely et al.; Li et al., b; Krishnan Parthasarathi and Strom; Chrupa\u0142a et al.; Kamper et al., 2017) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-17",
"text": "Recently impressive results have been reported for representation learning, that generalizes to different downstream tasks, through self-supervised learning for text and speech (Devlin et al., 2018; Baevski et al., 2019a; van den Oord et al., 2018; Baevski et al., 2019b) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-18",
"text": "Self-supervised representation learning is done through tasks to predict masked parts of the input, reconstruct inputs through low bit-rate channels, or contrast similar data points against different ones."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-19",
"text": "Different from (Baevski et al., 2019b) where the a BERT-like model is trained with the masked language model loss, frozen, and then used as a feature extractor in tandem with a final fully supervised convolutional ASR model (Collobert et al., 2016) , in this work, our \"Discrete BERT\" approach achieves an average relative Word Error Rate (WER) reduction of 9% by pre-training and fine-tuning the same BERT model using a Connectionist Temporal Classification (Graves et al.) loss."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-20",
"text": "In addition, we present a new approach for pre-training bi-directional transformer models on continuous speech data using the InfoNCE loss (van den Oord et al., 2018) -dubbed \"continuous BERT\"."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-21",
"text": "To understand the nature of their learned representations, we train models using the continuous and the discrete BERT approaches on spectral features, e.g. Mel-frequency cepstral coefficients (MFCC), as well as on pre-trained Wav2vec features ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-22",
"text": "These comparisons provide insights on how complementary the acoustically motivated contrastive loss function is to the other masked language model one."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-23",
"text": "The unsupervised and semi-supervised ASR approaches is in need for test suites like the unified downstream tasks available for language representation models (Devlin et al., 2018) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-24",
"text": "Lscher et al., 2019) evaluated semi-supervised self-labeling WER performance on the standard test \"clean\" and test \"other\" benchmarks of the Librispeech dataset (Panayotov et al., 2015) when using only 100 hour subset as labeled data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-25",
"text": "Baevski et al., 2019b; van den Oord et al., 2018) use the same 960h Librispeech data as unlabeled pre-training data, however, they use Phone Error Rates (PER) on the 3h TIMIT dataset (Garofolo et al., 1993) as their performance metric."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-26",
"text": "The zero-resource ASR literature (et."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-27",
"text": "al., a; Chaabouni et al.) use the ABX task evaluate the quality of learned features."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-28",
"text": "To combine the best of these evaluation approaches, we pre-train our models on the unlabeled 960h Librispeech data, with a close-to-zero supervised set of only 1 hour and 10 hours, sampled equally from the \"clean\" and \"other\" conditions of Librispeech."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-29",
"text": "Then, we report final WER performance on its standard dev and test sets."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-30",
"text": "Using our proposed approaches we achieve a best WER of 10.2% and 23.5% the clean and other subsets respectively which is competitive with previous work that uses 100h of labeled data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-31",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-32",
"text": "**PRELIMINARIES**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-33",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-34",
"text": "**BERT**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-35",
"text": "Using self-supervision, BERT (Devlin et al., 2018) , a deep bidirectional transformer model, builds its internal language representation that generalizes to other downstream NLP tasks."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-36",
"text": "Self-attention over the whole input word sequence enables BERT to jointly condition on both the left and right context of data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-37",
"text": "For training, it uses both a masked language model loss, by randomly removing some input words for the model to predict, and a contrastive loss to distinguish the next sentence in the document from a randomly selected one."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-38",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-39",
"text": "**WAV2VEC**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-40",
"text": "Wav2vec learns representations of audio data by solving a self-supervised context-prediction task with the same loss function as word2vec (Mikolov et al., 2013; van den Oord et al., 2018) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-41",
"text": "The model is based on two convolutional neural networks where the encoder f : X \u2192 Z produces a representation z i for each time step i at a rate of 100 Hz and the aggregator g : Z \u2192 C combines multiple encoder time steps into a new representation c i for each time step i. Given c i , the model is trained to distinguish a sample z i+k that is k steps in the future from distractor samplesz drawn from a distribution p n , by minimizing the contrastive loss for steps k = 1, . . . , K:"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-42",
"text": "where T is the sequence length, \u03c3(x) = 1/(1 + exp(\u2212x)), and where \u03c3(z i+k h k (c i )) is the probability of z i+k being the true sample."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-43",
"text": "A step-specific affine transformation h k (c i ) = W k c i + b k is applied to c i (van den Oord et al., 2018) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-44",
"text": "The loss L = K k=1 L k is optimized by summing (1) over different step sizes."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-45",
"text": "The learned high level features produced by the context network c i are shown to be better acoustic representations for speech recognition compared to standard spectral features."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-46",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-47",
"text": "**VQ-WAV2VEC**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-48",
"text": "vq-wav2vec (Baevski et al., 2019b) learns vector quantized (VQ) representations of audio data using a future time-step prediction task."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-49",
"text": "Similar to wav2vec, there is a convolutional encoder and decoder networks f : X \u2192 Z and g :\u1e90 \u2192 C for feature extraction and aggregation."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-50",
"text": "However, in between them there is a quantization module q : Z \u2192\u1e90 to build discrete representations which are input to the aggregator."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-51",
"text": "First, 30ms segments of raw speech are mapped to a dense feature representation z at a stride of 10ms using the encoder f ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-52",
"text": "Next, the quantizer (q) turns these dense representations into discrete indices which are mapped to a reconstruction\u1e91 of the original representation z. The\u1e91 is fed into the aggregator g and the model is optimized via the same context prediction task as wav2vec (cf."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-53",
"text": "\u00a72.2)."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-54",
"text": "The quantization module replaces the original representation z by\u1e91 = e i from a fixed size codebook e \u2208 R V \u00d7d which contains V representations of size d."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-55",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-56",
"text": "**APPROACH**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-57",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-58",
"text": "**DISCRETE BERT**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-59",
"text": "Our work builds on the recently proposed work in (Baevski et al., 2019b) where audio is quantized using a contrastive loss, then features learned on top by a BERT model (Devlin et al., 2018) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-60",
"text": "For the vq-wav2vec quantization, we use the gumbelsoftmax vq-wav2vec model with the same setup as described in (Baevski et al., 2019b) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-61",
"text": "This model quantizes the Librispeech dataset into 13.5k unique codes."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-62",
"text": "To understand the impact of acoustic representations baked into the wav2vec features, as alternatives, we explore quantizing the standard melfrequency cepstral coefficients (MFCC) and logmel filterbanks coefficients (FBANK), choosing a subset small enough to fit into GPU memory and running k-means with 13.5k centroids (to match the vq-wav2vec setup) to convergence."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-63",
"text": "We then assign the index of the closest centroid to represent each time-step."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-64",
"text": "We train a standard BERT model (Devlin et al., 2018; with only the masked language modeling task on each set of inputs in the same way as described in (Baevski et al., 2019b) , namely by choosing tokens for masking with probability of 0.05, expanding each chosen token to a span of 10 masked tokens (spans may overlap) and then computing a cross-entropy loss which attempts to maximize the likelihood of predicting the true token for each one that was masked ( Figure 1a )."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-66",
"text": "**CONTINUOUS BERT**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-67",
"text": "A masked language modeling task cannot be performed with continuous inputs and outputs, as there are no targets to predict in place of the masked tokens."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-68",
"text": "Instead of reconstructing the input as in (van den Oord et al., 2017), we classify the masked positive example among a set of negatives."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-69",
"text": "The inputs to the model are dense wav2vec"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-70",
"text": "features , MFCC or FBANK features representing 10ms of audio data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-71",
"text": "Some of these inputs are replaced with a mask embedding and are then fed into a transformer encoder."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-72",
"text": "We then compute the dot product between the outputs corresponding to each masked input, the true input that was masked, and a set of negatives sampled from other masked inputs within the same batch."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-73",
"text": "The model is optimized with the InfoNCE loss (van den Oord et al., 2018) where given one positive sample z i and N negative samplesz we minimize:"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-74",
"text": "where each sample z i is computed as a dot product of the output of the model at timestep i and the true unmasked value of positive example at timestep i or a randomly sampled negative example."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-75",
"text": "To stabilize training, we add the squared sum of logits produced by the dot-product to the loss, and then apply a soft clamp\u015d i = \u03bb tanh(s i /\u03bb) for each logit s i to prevent the model's tendency to continually increase the magnitude of logits during training (Bachman et al., 2019) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-77",
"text": "**SUPERVISED FINE-TUNING**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-78",
"text": "The pre-trained models are fine-tuned to perform the ASR task by adding a randomly initialized linear projection on top of the features computed by the transformer models into V classes representing the vocabulary of the task."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-79",
"text": "The vocabulary is 29 tokens for character targets plus a word boundary token."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-80",
"text": "The models are optimized by minimizing the CTC loss."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-81",
"text": "Fine-tuning requires only a few epochs on a single GPU."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-83",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-84",
"text": "All of our experiments are implemented by extending the fairseq toolkit."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-86",
"text": "**DATA**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-87",
"text": "All of our experiments are performed by pretraining on 960 hours of Librispeech (Panayotov et al., 2015) training set, fine-tuning on labeled 10 hours and 1 hour sets sampled equally from the two conditions of the training set, and evaluating on the standard dev and test splits."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-89",
"text": "**MODELS**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-90",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-91",
"text": "**QUANTIZED INPUTS TRAINING**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-92",
"text": "We first train the vq-wav2vec quantization model following the gumbel-softmax recipe described in (Baevski et al., 2019b) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-93",
"text": "After training this model For quantizing MFCC and log-mel filterbanks we first compute dense features using the scripts from the Kaldi (Povey) toolkit."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-94",
"text": "We then compute 13.5k K-Means centroids, to match the number of unique tokens produced by the vq-wav2vec model, using 8 32GB Volta GPUs."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-95",
"text": "To fit into GPU memory, we subsample 50% of MFCC features and 25% of FBANK features from the training set before running the K-Means algorithm."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-96",
"text": "The model we use for the masked language modeling task is a standard BERT model with 12 layers, model dimension 768, inner dimension (FFN) 3072 and 12 attention heads (Devlin et al., 2018) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-97",
"text": "The learning rate is warmed up over the first 10,000 updates to a peak value of 1 \u00d7 10 \u22125 , and then linearly decayed over a total of 250k updates."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-98",
"text": "We train on 128 GPUs with a batch size of 3072 tokens per GPU giving a total batch size of 393k tokens (Ott et al., 2018) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-99",
"text": "Each token represents 10ms of audio data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-100",
"text": "To mask the input sequence, we follow (Baevski et al., 2019b) and randomly sample p = 0.05 of all tokens to be a starting index, without replacement, and mask M = 10 consecutive tokens from every sampled index; spans may overlap."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-101",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-102",
"text": "**CONTINUOUS INPUTS TRAINING**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-103",
"text": "For training on dense features, we use a model similar to a standard BERT model with the same parameterization as the one used for quantized input training, but we use the wav2vec, MFCC or FBANK inputs directly."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-104",
"text": "We add 128 relative positional embeddings at every multi-head attention block as formulated in (Dai et al., 2019) instead of fixed positional embeddings to ease handling longer examples."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-105",
"text": "We train this model on only 8 GPUs, with a batch size of 9600 inputs per GPU resulting in a total batch size of 76,800."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-106",
"text": "We find that increasing the number of GPUs (which increases the effective batch size) does not lead to better results with this particular setup."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-107",
"text": "Wav2vec features are 512-dimensional, while MFCC features have 39 dimensions and Logmel features have 80."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-108",
"text": "We introduce a simple linear projection from the feature dimension to BERT dimension (768) for all models."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-109",
"text": "Similarly to the approach in 4.2.1, we choose time-steps to mask by randomly sampling, with-out replacement, p = 0.05 of all time-steps to be a starting index, and mask M = 10 consecutive timesteps from every sampled index; spans may overlap."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-110",
"text": "We sample 10 negative examples from other masked time-steps from the same example, and an additional 10 negative examples from masked time-steps occurring anywhere in the batch."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-111",
"text": "We compute a dot product between the original features and the output corresponding to the same time-step after they are processed by the BERT model."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-112",
"text": "We add the squared sum of logits from these computations multiplied by \u03bb = 0.04 to the loss, and then apply a smooth clamp by recomputing each logit s i = 20 tanh(s i /20)."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-113",
"text": "The learning rate is warmed up over the first 10,000 updates to a peak value of 1 \u00d7 10 \u22125 , and then linearly decayed over a total of 250k updates."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-114",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-115",
"text": "**METHODOLOGY**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-116",
"text": "For quantized inputs, we compute token indices using the gumbel-softmax based vq-wav2vec model."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-117",
"text": "For MFCC and FBANK features we take the index of the closest centroid (as measured by finding the minimum Euclidean distance) to each corresponding feature in the Librispeech dataset."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-118",
"text": "We then train a BERT model as descirbed in \u00a74.2.1."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-119",
"text": "For wav2vec continuous inputs, we use features extracted by the publicly available wav2vec model which contains 6 convolutional blocks in the feature extractor and 11 convolutional blocks in the aggregator module."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-120",
"text": "We use the outputs of the aggregator as features."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-121",
"text": "For MFCC and FBANK, we use those features directly after applying a single linear projection to upsample them to the model dimensionality."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-122",
"text": "We fine-tune our pre-trained models on 1 or 10 hours of labelled data sampled from the Librispeech training set."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-123",
"text": "We use the standard CTC loss and train for up to 20k updates."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-124",
"text": "We find that the pre-trained models converge after only around 4k updates, while the models trained from scratch tend to converge much later, around 18k updates."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-125",
"text": "We fine-tune all models with learning rate of 0.0001 that is linearly warmed up over the first 2k updates and then annealed following a cosine learning rate schedule over the last 18k updates."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-126",
"text": "We set the dropout of the pre-trained BERT models to 0.1 and sweep on dropout of the BERT model outputs before the final projection layer over values between 0.0 and 0.4 in increments of 0.1."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-127",
"text": "For each model, we choose a single best checkpoint that has the best loss on the validation set, which is a combination of dev-clean and dev-other standard Librispeech splits."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-128",
"text": "We use the publicly available wav2letter++ (Pratap et al., 2019) decoder integrated into the Fairseq framework with the official Librispeech 4gram language model."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-129",
"text": "We run a sweep on weights for language model score, word score and silence token weights for each model, where parameters are chosen randomly and evaluated on the devother Librispeech set."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-130",
"text": "We use the weights found by these sweeps to evaluate and report results for all other splits."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-131",
"text": "The sweeps are run with beam size of 250, while the final decoding is done with beam size of 1500."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-132",
"text": "The quantized BERT models have a limit of 2048 source tokens due to their use of fixed positional embeddings."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-133",
"text": "During training we discard longer examples and during evaluation we discard randomly chosen tokens from each example until they are at most 2048 tokens long."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-134",
"text": "We expect that increasing the size of the fixed positional embeddings, or switching to relative positional embeddings will improve performance on longer examples, but in this work we wanted to stay consistent with the setup in Baevski et al. (2019b) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-135",
"text": "The tandem model which uses the features extracted from the pre-trained BERT models is a character-based Wav2Letter setup of (Zeghidour et al., 2018) which uses seven consecutive blocks of convolutions (kernel size 5 with 1.000 \u00d7 10 3 channels), followed by a PReLU nonlinearity and a dropout rate of 1 \u00d7 10 \u22121 ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-136",
"text": "The final representation is projected to a 28-dimensional probability over the vocabulary and decoded using the standard 4gram language model following the same protocol as for the fine-tuned models Table 1 presents WERs of different input features and pre-training methods on the standard Librispeech clean and other subsets using 10 hours and 1 hour of labeled data for fine-tuning."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-137",
"text": "Compared to the two-model tandem system proposed in (Baevski et al., 2019b) , which uses a the discrete BERT features to train another ASR system from scratch, our discrete BERT model provides an average of 13% and 6% of WER reduction on clean and other subsets respectively, by pre-training and fine-tuning the same BERT model on the 10h labeled set."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-138",
"text": "The wav2vec inputs represent one level of unsupervised feature discovery, which provides a better space for quantization compared to raw spectral features."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-139",
"text": "The discrete BERT training augments the wav2vec features with a higher level of representation that captures the sequential structure of the full utterance through the masked language modeling loss."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-140",
"text": "On the other hand, the continuous BERT training, given its contrastive InforNCE loss, can be viewed as another level of acoustic representations that captures longer range regularities."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-141",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-142",
"text": "**RESULTS**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-143",
"text": "Using the MFCC and FBANK as inputs to the continuous and discrete BERT models provide insights on the synergies of different levels of acoustic and language model representations."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-144",
"text": "Similar to the observations in (Mohamed et al., 2012) , the FBANK features are more friendly to unsupervised local acoustic representation learning methods like continuous BERT, leading to consistent gains compared to MFCC features for both 10h and 1h sets."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-145",
"text": "model plays the role of a language model and input wav2vec features learns high level acoustic representations, in the very low-resource condition of 1h fine-tuning, the average relative improvement between quantized FBANK and Wav2vec inputs is larger in the \"clean\" subsets -55%, which require better local acoustic representations, compared to 45% WER reduction for the noisy \"other\" subsets that rely more on the global language modeling capabilities."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-146",
"text": "With wav2vec features providing good acoustic representations, the discrete BERT model provides an average of about 28% relative improvement over the continuous BERT model for the 10h fine-tuning condition."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-147",
"text": "We believe the reason is due to the complementary nature of the discrete BERT language modelling loss and the wav2vec acoustically motivated pre-training, as opposed to the relatively redundant acoustic pre-training losses of the continious BERT and wav2vec."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-148",
"text": "In the 1h fine-tuning case, however, better local acoustic features provide more gains in the \"clean\" subsets compared to the \"other\" ones, following the same trend of the quantized FBANK and wav2vec features under the same conditions."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-149",
"text": "Table 2 shows the competitive performance of the discrete BERT approach compared to previously published work which is fine-tuned on more than 10 times the labeled data."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-151",
"text": "**ABLATIONS**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-152",
"text": "To understand the value of self-supervision in our setup, Table 3 shows WERs for both continuous and discrete input features fine-tuned from random weights, without BERT pre-training, using (Table 4 )."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-153",
"text": "Adding a second layer of representation more than halved the WER, with more gains observed in the \"clean\" subset as also observed in 4.4."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-155",
"text": "**DISCUSSION AND RELATED WORK**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-156",
"text": "The the success of BERT (Devlin et al., 2018) and Word2Vec (Mikolov et al., 2013) for NLP tasks motivated more research on self-supervised approaches for acoustic word embedding and unsupervised acoustic feature representation (Bengio and Heigold; Levin et al.; Chung et al., b; He et al.; van den Oord et al., 2018; Baevski et al., 2019b) , either by predicting masked discrete or continuous input, or by contrastive prediction of neighboring or similarly sounding segments using distant supervision or proximity in the audio signal as an indication of similarity."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-157",
"text": "In (Kamper et al.) a dynamic time warping alignment is used to discover similar segment pairs."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-158",
"text": "Our work is inspired by the research efforts in reducing the dependence on labeled data for building ASR systems through unsupervised unit discovery and acoustic representation leaning (Park and Glass, 2008; Glass; et."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-159",
"text": "al., a,f) , and through multiand cross-lingual transfer learning in low-resource conditions (et."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-160",
"text": "al., c,d,b; Ghoshal et al.; Huang et al.; et."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-161",
"text": "al., e) , and semi-supervised learning (Vesely et al.; Li et al., b; Krishnan Parthasarathi and Strom; Li et al., a) ."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-162",
"text": "----------------------------------"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-163",
"text": "**CONCLUSION AND FUTURE WORK**"
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-164",
"text": "We presetned two variations, continuous and discrete, of BERT models that are pre-trained on the librispeech 960h data and fine-tuned for speech recognition rather than used as feature extractor in tandem with another ASR system."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-165",
"text": "Along with the discrete-input BERT model, we used a contrastive loss for training a continuous variant of BERT."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-166",
"text": "The acoustic and language modeling roles in the system are played by the vq-wav2vec and the BERT components respectively."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-167",
"text": "Our ablation experiments showed the contribution and importance of each component for final ASR performance."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-168",
"text": "Our system is able to reach final WER of 10.2% and 23.5% on the standard Librispeech test clean and other sets, respectively, using only 10h of labeled data, almost matching the 100h supervised baselines."
},
{
"sent_id": "02521fd9721c264ee05315dec9b31d-C001-169",
"text": "Our future directions include testing our model on 1000x larger volume of unlabeled data that is more acoustically challenging, along with multi-and crosslingual transfer learning extensions."
}
],
"y": {
"@DIF@": {
"gold_contexts": [
[
"02521fd9721c264ee05315dec9b31d-C001-7"
],
[
"02521fd9721c264ee05315dec9b31d-C001-19"
],
[
"02521fd9721c264ee05315dec9b31d-C001-137"
]
],
"cite_sentences": [
"02521fd9721c264ee05315dec9b31d-C001-7",
"02521fd9721c264ee05315dec9b31d-C001-19",
"02521fd9721c264ee05315dec9b31d-C001-137"
]
},
"@BACK@": {
"gold_contexts": [
[
"02521fd9721c264ee05315dec9b31d-C001-17"
],
[
"02521fd9721c264ee05315dec9b31d-C001-48"
],
[
"02521fd9721c264ee05315dec9b31d-C001-156"
]
],
"cite_sentences": [
"02521fd9721c264ee05315dec9b31d-C001-17",
"02521fd9721c264ee05315dec9b31d-C001-48",
"02521fd9721c264ee05315dec9b31d-C001-156"
]
},
"@EXT@": {
"gold_contexts": [
[
"02521fd9721c264ee05315dec9b31d-C001-59"
]
],
"cite_sentences": [
"02521fd9721c264ee05315dec9b31d-C001-59"
]
},
"@SIM@": {
"gold_contexts": [
[
"02521fd9721c264ee05315dec9b31d-C001-60"
],
[
"02521fd9721c264ee05315dec9b31d-C001-64"
]
],
"cite_sentences": [
"02521fd9721c264ee05315dec9b31d-C001-60",
"02521fd9721c264ee05315dec9b31d-C001-64"
]
},
"@USE@": {
"gold_contexts": [
[
"02521fd9721c264ee05315dec9b31d-C001-60"
],
[
"02521fd9721c264ee05315dec9b31d-C001-64"
],
[
"02521fd9721c264ee05315dec9b31d-C001-92"
],
[
"02521fd9721c264ee05315dec9b31d-C001-100"
],
[
"02521fd9721c264ee05315dec9b31d-C001-134"
]
],
"cite_sentences": [
"02521fd9721c264ee05315dec9b31d-C001-60",
"02521fd9721c264ee05315dec9b31d-C001-64",
"02521fd9721c264ee05315dec9b31d-C001-92",
"02521fd9721c264ee05315dec9b31d-C001-100",
"02521fd9721c264ee05315dec9b31d-C001-134"
]
}
}
},
"ABC_473cf4603dea14ff89ca12d6e0cb50_10": {
"x": [
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-2",
"text": "Deep neural networks reach state-of-the-art performance for wide range of natural language processing, computer vision and speech applications."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-3",
"text": "Yet, one of the biggest challenges is running these complex networks on devices such as mobile phones or smart watches with tiny memory footprint and low computational capacity."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-4",
"text": "We propose on-device Self-Governing Neural Networks (SGNNs), which learn compact projection vectors with local sensitive hashing."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-5",
"text": "The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-6",
"text": "We conduct extensive evaluation on dialog act classification and show significant improvement over state-of-the-art results."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-7",
"text": "Our findings show that SGNNs are effective at capturing low-dimensional semantic text representations, while maintaining high accuracy."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-8",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-9",
"text": "**INTRODUCTION**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-10",
"text": "Deep neural networks are one of the most successful machine learning methods outperforming many state-of-the-art machine learning methods in natural language processing (Sutskever et al., 2014) , speech and visual recognition tasks (Krizhevsky et al., 2012) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-11",
"text": "The availability of high performance computing has enabled research in deep learning to focus largely on the development of deeper and more complex network architectures for improved accuracy."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-12",
"text": "However, the increased complexity of the deep neural networks has become one of the biggest obstacles to deploy deep neural networks ondevice such as mobile phones, smart watches and IoT (Iandola et al., 2016) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-13",
"text": "The main challenges with developing and deploying deep neural network models on-device are (1) the tiny memory footprint, (2) inference latency and (3) significantly low computational capacity compared to high performance computing systems such as CPUs, GPUs and TPUs on the cloud."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-14",
"text": "There are multiple strategies to build lightweight text classification models for ondevice."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-15",
"text": "One can create a small dictionary of common input \u2192 category mapping on the device and use a naive look-up at inference time."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-16",
"text": "However, such an approach does not scale to complex natural language tasks involving rich vocabularies and wide language variability."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-17",
"text": "Another strategy is to employ fast sampling techniques (Ahmed et al., 2012; Ravi, 2013) or incorporate deep learning models with graph learning like (Bui et al., 2017 (Bui et al., , 2018 , which result in large models but have proven to be extremely powerful for complex language understanding tasks like response completion (Pang and Ravi, 2012) and Smart Reply (Kannan et al., 2016) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-18",
"text": "In this paper, we propose Self-Governing Neural Networks (SGNNs) inspired by projection networks (Ravi, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-19",
"text": "SGNNs are on-device deep learning models learned via embedding-free projection operations."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-20",
"text": "We employ a modified version of the locality sensitive hashing (LSH) to reduce input dimension from millions of unique words/features to a short, fixed-length sequence of bits."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-21",
"text": "This allows us to compute a projection for an incoming text very fast, on-the-fly, with a small memory footprint on the device since we do not need to store the incoming text and word embeddings."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-22",
"text": "We evaluate the performance of our SGNNs on Dialogue Act classification, because (1) it is an important step towards dialog interpretation and conversational analysis aiming to understand the intent of the speaker at every utterance of the conversation and (2) deep learning methods reached state-of-the-art (Lee and Dernoncourt, 2016; Khanpour et al., 2016; Tran et al., 2017; Ortega and Vu, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-23",
"text": "The main contributions of the paper are:"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-24",
"text": "\u2022 Novel Self-Governing Neural Networks (SGNNs) for on-device deep learning for short text classification."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-25",
"text": "\u2022 Compression technique that effectively captures low-dimensional semantic text representation and produces compact models that save on storage and computational cost."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-26",
"text": "\u2022 On the fly computation of projection vectors that eliminate the need for large pre-trained word embeddings or vocabulary pruning."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-27",
"text": "\u2022 Exhaustive experimental evaluation on dialog act datasets, outperforming state-of-theart deep CNN (Lee and Dernoncourt, 2016) and RNN variants (Khanpour et al., 2016; Ortega and Vu, 2017 )."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-29",
"text": "**SELF-GOVERNING NEURAL NETWORKS**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-30",
"text": "We model the Self-Governing network using a projection model architecture (Ravi, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-31",
"text": "The projection model is a simple network with dynamically-computed layers that encodes a set of efficient-to-compute operations which can be performed directly on device for inference."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-32",
"text": "The model defines a set of efficient \"projection\" functions P( x i ) that project each input instance x i to a different space \u2126 P and then performs learning in this space to map it to corresponding outputs y p i ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-33",
"text": "A very simple projection model comprises just few operations where the inputs x i are transformed using a series of T projection functions P 1 , ..., P T followed by a single layer of activations."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-34",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-35",
"text": "**MODEL ARCHITECTURE**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-36",
"text": "In this work, we design a Self-Governing Neural Network (SGNN) using multi-layered localitysensitive projection model."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-37",
"text": "Figure 1 shows the model architecture of the on-device SGNN network."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-38",
"text": "The self-governing property of this network stems from its ability to learn a model (e.g., a classifier) without having to initialize, load or store any feature or vocabulary weight matrices."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-39",
"text": "In this sense, our method is a truly embedding-free approach unlike majority of the widely-used stateof-the-art deep learning techniques in NLP whose performance depends on embeddings pre-trained on large corpora."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-40",
"text": "Instead, we use the projection functions to dynamically transform each input to a low-dimensional representation."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-41",
"text": "Furthermore, we stack this with additional layers and non-linear activations to achieve deep, non-linear combinations of projections that permit the network to learn complex mappings from inputs x i to outputs y i ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-42",
"text": "An SGNN network is shown below:"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-43",
"text": "where, i p refers to the output of projection operation applied to input x i , h p is applied to projection output, h t is applied at intermediate layers of the network with depth k followed by a final softmax activation layer at the top."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-44",
"text": "In a k-layer SGNN, h t , where t = p, p + 1, ..., p + k \u2212 1 refers to the k subsequent layers after the projection layer."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-45",
"text": "W p , W t , W o and b p , b t , b o represent trainable weights and biases respectively."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-46",
"text": "The projection transformations use precomputed parameterized functions, i.e., they are not trained during the learning process, and their outputs are concatenated to form the hidden units for subsequent operations."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-47",
"text": "Each input text x i is converted to an intermediate feature vector (via raw text features such as skip-grams) followed by projections."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-48",
"text": "On-the-fly Computation."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-49",
"text": "The transformation step F dynamically extracts features from the raw input."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-50",
"text": "Text features (e.g., skip-grams) are converted into feature-ids f j (via hashing) to generate a sparse feature representation x i of feature-id, weight pairs (f j , w j ) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-51",
"text": "This intermediate feature representation is passed through projection functions P to construct projection layer i p in SGNN."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-52",
"text": "For this last step, a projection vector P k is first constructed on-the-fly using a hash function with feature ids f j in x i and fixed seed as input, then dot product of the two vectors < x i , P k > is computed and transformed into binary representation P k ( x i ) using sgn(.) of the dot product."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-53",
"text": "As shown in Figure 1 , both F and P steps are computed on-the-fly, i.e., no word-embedding or vocabulary/feature matrices need to be stored and looked up during training or inference."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-54",
"text": "Instead feature-ids and projection vectors are dynamically computed via hash functions."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-55",
"text": "For intermediate feature weights w j , we use observed counts in each input text and do not use pre-computed statistics like idf."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-56",
"text": "Hence the method is embedding-free."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-57",
"text": "Model Optimization."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-58",
"text": "The SGNN network is trained from scratch on the task data using a supervised loss defined wrt ground truth\u0177 i :"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-59",
"text": "During training, the network learns to choose and apply specific projection operations P j (via activations) that are more predictive for a given task."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-60",
"text": "The choice of the type of projection matrix P as well as representation of the projected space \u2126 P has a direct effect on computation cost and model size."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-61",
"text": "We leverage an efficient randomized projection method and use a binary representation {0, 1} d for \u2126 P ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-62",
"text": "This yields a drastically lower memory footprint both in terms of number and size of parameters."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-63",
"text": "Computing Projections."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-64",
"text": "We employ an efficient randomized projection method for the projection step."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-65",
"text": "We use locality sensitive hashing (LSH) (Charikar, 2002) to model the underlying projection operations in SGNN."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-66",
"text": "LSH is typically used as a dimensionality reduction technique for clustering (Manning et al., 2008) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-67",
"text": "LSH allows us to project similar inputs x i or intermediate network layers into hidden unit vectors that are nearby in metric space."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-68",
"text": "We use repeated binary hashing for P and apply the projection vectors to transform the input x i to a binary hash representation denoted by P k ( x i ) \u2208 {0, 1}, where"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-69",
"text": "This results in a dbit vector representation, one bit corresponding to each projection row P k=1...d ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-70",
"text": "The same projection matrix P is used for training and inference."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-71",
"text": "We never need to explicitly store the random projection vector P k since we can compute them on the fly using hash functions over feature indices with a fixed row seed rather than invoking a random number generator."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-72",
"text": "This also permits us to perform projection operations that are linear in the observed feature size rather than the overall feature or vocabulary size which can be prohibitively large for high-dimensional data, thereby saving both memory and computation cost."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-73",
"text": "Thus, SGNN can efficiently model highdimensional sparse inputs and large vocabulary sizes common for text applications instead of relying on feature pruning or other pre-processing heuristics employed to restrict input sizes in standard neural networks for feasible training."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-74",
"text": "The binary representation is significant since this results in a significantly compact representation for the projection network parameters that in turn considerably reduces the model size."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-75",
"text": "SGNN Parameters."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-76",
"text": "In practice, we employ T different projection functions P j=1...T , each resulting in d-bit vector that is concatenated to form the projected vector i p in Equation 5."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-77",
"text": "T and d vary depending on the projection network parameter configuration specified for P and can be tuned to trade-off between prediction quality and model size."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-78",
"text": "Note that the choice of whether to use a single projection matrix of size T \u00b7 d or T separate matrices of d columns depends on the type of projection employed (dense or sparse)."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-79",
"text": "For the intermediate feature step F in Equation 5, we use skip-gram features (3-grams with skip-size=2) extracted from raw text."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-80",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-81",
"text": "**TRAINING AND INFERENCE**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-82",
"text": "We use the compact bit units to represent the projection in SGNN."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-83",
"text": "During training, the network learns to move the gradients for points that are nearby to each other in the projected bit space \u2126 P in the same direction."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-84",
"text": "SGNN network is trained end-to-end using backpropagation."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-85",
"text": "Training can progress efficiently with stochastic gradient descent with distributed computing on highperformance CPUs or GPUs."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-86",
"text": "Complexity."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-87",
"text": "The overall complexity for SGNN inference, governed by the projection layer, is O(n \u00b7 T \u00b7 d), where n is the observed feature size (*not* overall vocabulary size) which is linear in input size, d is the number of LSH bits specified for each projection vector P k , and T is the number of projection functions used in P. The model size (in terms of number of parameters) and memory storage required for the projection inference step is O(T \u00b7 d \u00b7 C), where C is the number of hidden units in h p in the multi-layer projection network and typically smaller than T \u00b7 d."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-88",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-89",
"text": "**DATASETS AND EXPERIMENTAL SETUP 3.1 DATA DESCRIPTION**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-90",
"text": "We conduct our experimental evaluation on two dialog act benchmark datasets."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-91",
"text": "\u2022 SWDA: Switchboard Dialog Act Corpus (Godfrey et al., 1992; Jurafsky et al., 1997) is a popular open domain dialogs corpus between two speakers with 42 dialogs acts."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-92",
"text": "\u2022 MRDA: ICSI Meeting Recorder Dialog Act Corpus (Adam et al., 2003; Shriberg et al., 2004 ) is a dialog corpus of multiparty meetings with 5 tags of dialog acts."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-93",
"text": "Table 1 summarizes dataset statistics."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-94",
"text": "We use the train, validation and test splits as defined in (Lee and Dernoncourt, 2016; Ortega and Vu, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-95",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-96",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-97",
"text": "We setup our experimental evaluation, as follows: given a classification task and a dataset, we generate an on-device model."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-98",
"text": "The size of the model can be configured (by adjusting the projection matrix P) to fit in the memory footprint of the device, i.e. a phone has more memory compared to a smart watch."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-99",
"text": "For each classification task, we report Accuracy on the test set."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-100",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-101",
"text": "**HYPERPARAMETER AND TRAINING**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-102",
"text": "For both datasets we used the following: 2-layer SGNN (P T =80,d=14 \u00d7 FullyConnected 256 \u00d7 FullyConnected 256 ), mini-batch size of 100, dropout rate of 0.25, learning rate was initialized to 0.025 with cosine annealing decay (Loshchilov and Hutter, 2016) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-103",
"text": "Unlike prior approaches (Lee and Dernoncourt, 2016; Ortega and Vu, 2017 ) that rely on pre-trained word embeddings, we learn the projection weights on the fly during training, i.e word embeddings (or vocabularies) do not need to be stored."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-104",
"text": "Instead, features are computed on the fly and are dynamically compressed via the projection matrices into projection vectors."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-105",
"text": "These values were chosen via a grid search on development sets, we do not perform any other dataset-specific tuning."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-106",
"text": "Training is performed through stochastic gradient descent over shuffled mini-batches with Nesterov momentum optimizer (Sutskever et al., 2013) , run for 1M steps."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-107",
"text": "Tables 2 and 3 show results on the SwDA and MRDA dialog act datasets."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-108",
"text": "Overall, our SGNN model consistently outperforms the baselines and prior state-of-the-art deep learning models."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-109",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-110",
"text": "**RESULTS**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-112",
"text": "**BASELINES**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-113",
"text": "We compare our model against a majority class baseline and Naive Bayes classifier (Lee and Dernoncourt, 2016) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-114",
"text": "Our model significantly outperforms both baselines by 12 to 35% absolute."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-115",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-116",
"text": "**COMPARISON AGAINST STATE-OF-ART METHODS**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-117",
"text": "We also compare our performance against prior work using HMMs (Stolcke et al., 2000) and recent deep learning methods like CNN (Lee and Dernoncourt, 2016) , RNN (Khanpour et al., 2016) and RNN with gated attention (Tran et al., 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-118",
"text": "To the best of our knowledge, (Lee and Dernoncourt, 2016; Ortega and Vu, 2017; Tran et al., 2017) are the latest approaches in dialog act classification, which also reported on the same data splits."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-119",
"text": "Therefore, we compare our research against these works."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-120",
"text": "According to (Ortega and Vu, 2017) , prior work by (Ji and Bilmes, 2006) achieved promising results on the MRDA dataset, but since the evaluation was conducted on a different data split, it is hard to compare them directly."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-121",
"text": "For both SwDA and MRDA datasets, our SGNNs obtains the best result of 83.1 and 86.7 accuracy outperforming prior state-of-the-art work."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-122",
"text": "This is very impressive given that we work with very small memory footprint and we do not rely on pre-trained word embeddings."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-123",
"text": "Our study also shows that the proposed method is very effective for such natural language tasks compared to more complex neural network architectures such as deep CNN (Lee and Dernoncourt, 2016) and RNN variants (Khanpour et al., 2016; Ortega and Vu, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-124",
"text": "We believe that the compression techniques like locality sensitive projections jointly coupled with non-linear functions are effective at capturing lowdimensional semantic text representations that are useful for text classification applications."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-125",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-126",
"text": "**DISCUSSION ON MODEL SIZE AND INFERENCE**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-127",
"text": "LSTMs have millions of parameters, while our on-device architecture has just 300K parameters (order of magnitude lower)."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-128",
"text": "Most deep learning methods also use large vocabulary size of 10K or higher."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-129",
"text": "Each word embedding is represented as 100-dimensional vector leading to a storage requirement of 10, 000 \u00d7 100 parameter weights just in the first layer of the deep network."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-130",
"text": "In contrast, SGNNs in all our experiments use a fixed 1120-dimensional vector regardless of the vocabulary or feature size, dynamic computation results"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-131",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-132",
"text": "**METHOD**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-133",
"text": "Acc."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-134",
"text": "Majority Class (baseline) (Ortega and Vu, 2017) 33.7 Naive Bayes (baseline) (Khanpour et al., 2016) 47.3 HMM (Stolcke et al., 2000) 71.0 DRLM-conditional training (Ji and Bilmes, 2006) 77.0 DRLM-joint training (Ji and Bilmes, 2006) 74.0 LSTM (Lee and Dernoncourt, 2016) 69.9 CNN (Lee and Dernoncourt, 2016) 73.1 Gated-Attention&HMM (Tran et al., 2017) 74.2 RNN+Attention (Ortega and Vu, 2017) 73.8 RNN (Khanpour et al., 2016) 80.1 SGNN: Self-Governing Neural Network (ours) 83.1 (Ortega and Vu, 2017) 59.1 Naive Bayes (baseline) (Khanpour et al., 2016) 74.6 Graphical Model (Ji and Bilmes, 2006) 81.3 CNN (Lee and Dernoncourt, 2016) 84.6 RNN+Attention (Ortega and Vu, 2017) 84.3 RNN (Khanpour et al., 2016) 86.8 SGNN: Self-Governing Neural Network (ours) 86.7 Table 3 : MRDA Dataset Results in further speed up for high-dimensional feature spaces."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-135",
"text": "This amounts to a huge savings in storage and computation cost wrt FLOPs (floating point operations per second)."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-136",
"text": "----------------------------------"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-137",
"text": "**CONCLUSION**"
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-138",
"text": "We proposed Self-Governing Neural Networks for on-device short text classification."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-139",
"text": "Experiments on multiple dialog act datasets showed that our model outperforms state-of-the-art deep leaning methods (Lee and Dernoncourt, 2016; Khanpour et al., 2016; Ortega and Vu, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-140",
"text": "We introduced a compression technique that effectively captures low-dimensional semantic representation and produces compact models that significantly save on storage and computational cost."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-141",
"text": "Our approach does not rely on pre-trained embeddings and efficiently computes the projection vectors on the fly."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-142",
"text": "In the future, we are interested in extending this approach to more natural language tasks."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-143",
"text": "For instance, we built a multilingual SGNN model for customer feedback classification (Liu et al., 2017) and obtained 73% on Japanese, close to best performing system on the challenge (Plank, 2017) ."
},
{
"sent_id": "473cf4603dea14ff89ca12d6e0cb50-C001-144",
"text": "Unlike their method, we did not use any pre-processing, tagging, parsing, pre-trained embeddings or other resources."
}
],
"y": {
"@MOT@": {
"gold_contexts": [
[
"473cf4603dea14ff89ca12d6e0cb50-C001-22"
]
],
"cite_sentences": [
"473cf4603dea14ff89ca12d6e0cb50-C001-22"
]
},
"@USE@": {
"gold_contexts": [
[
"473cf4603dea14ff89ca12d6e0cb50-C001-94"
]
],
"cite_sentences": [
"473cf4603dea14ff89ca12d6e0cb50-C001-94"
]
},
"@BACK@": {
"gold_contexts": [
[
"473cf4603dea14ff89ca12d6e0cb50-C001-120"
]
],
"cite_sentences": [
"473cf4603dea14ff89ca12d6e0cb50-C001-120"
]
},
"@DIF@": {
"gold_contexts": [
[
"473cf4603dea14ff89ca12d6e0cb50-C001-123"
],
[
"473cf4603dea14ff89ca12d6e0cb50-C001-139"
]
],
"cite_sentences": [
"473cf4603dea14ff89ca12d6e0cb50-C001-123",
"473cf4603dea14ff89ca12d6e0cb50-C001-139"
]
}
}
},
"ABC_d9567072d2df6c0010b32e1d1eb676_10": {
"x": [
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-2",
"text": "Generative Adversarial Networks (GANs) are a promising approach for text generation that, unlike traditional language models (LM), does not suffer from the problem of \"exposure bias\"."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-3",
"text": "However, A major hurdle for understanding the potential of GANs for text generation is the lack of a clear evaluation metric."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-4",
"text": "In this work, we propose to approximate the distribution of text generated by a GAN, which permits evaluating them with traditional probability-based LM metrics."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-5",
"text": "We apply our approximation procedure on several GAN-based models and show that they currently perform substantially worse than stateof-the-art LMs."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-6",
"text": "Our evaluation procedure promotes better understanding of the relation between GANs and LMs, and can accelerate progress in GAN-based text generation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-7",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-8",
"text": "**INTRODUCTION**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-9",
"text": "Neural networks have revolutionized the field of text generation, in machine translation (Sutskever et al., 2014; Neubig, 2017; Luong et al., 2015; Chen et al., 2018) , summarization (See et al., 2017) , image captioning (You et al., 2016) and many other applications (Goldberg, 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-10",
"text": "Traditionally, text generation models are trained by going over a gold sequence of symbols (characters or words) from left-to-right, and maximizing the probability of the next symbol given the history, namely, a language modeling (LM) objective."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-11",
"text": "A commonly discussed drawback of such LM-based text generation is exposure bias (Ranzato et al., 2015) : during training, the model predicts the next token conditioned on the ground truth history, while at test time prediction is based on predicted tokens, causing a train-test mismatch."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-12",
"text": "Models trained in this manner often struggle to overcome previous prediction errors."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-13",
"text": "Generative Adversarial Networks (Goodfellow et al., 2014) offer a solution for exposure bias."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-14",
"text": "* The authors contributed equally Originally introduced for images, GANs leverage a discriminator, which is trained to discriminate between real images and generated images via an adversarial loss."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-15",
"text": "In such a framework, the generator is not directly exposed to the ground truth data, but instead learns to imitate it using global feedback from the discriminator."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-16",
"text": "This has led to several attempts to use GANs for text generation, with a generator using either a recurrent neural network (RNN) Guo et al., 2017; Press et al., 2017; Rajeswar et al., 2017) , or a Convolutional Neural Network (CNN) (Gulrajani et al., 2017; Rajeswar et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-17",
"text": "However, evaluating GANs is more difficult than evaluating LMs."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-18",
"text": "While in language modeling, evaluation is based on the log-probability of a model on held-out text, this cannot be straightforwardly extended to GAN-based text generation, because the generator outputs discrete tokens, rather than a probability distribution."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-19",
"text": "Currently, there is no single evaluation metric for GAN-based text generation, and existing metrics that are based on n-gram overlap are known to lack robustness and have low correlation with semantic coherence (Semeniuta et al., 2018) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-20",
"text": "In this paper, we propose a method for evaluating GANs with standard probability-based evaluation metrics."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-21",
"text": "We show that the expected prediction of a GAN generator can be viewed as a LM, and suggest a simple Monte-Carlo method for approximating it."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-22",
"text": "The approximated probability distribution can then be evaluated with standard LM metrics such as perplexity or Bits Per Character (BPC)."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-23",
"text": "To empirically establish our claim, we implement our evaluation on several RNN-based GANs: (Press et al., 2017; Guo et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-24",
"text": "We find that all models have substantially lower BPC compared to state-of-the-art LMs."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-25",
"text": "By directly comparing to LMs, we put in perspective the current performance of RNN-based GANs for text generation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-26",
"text": "Our results are also in line with recent concurrent work by Caccia et al. (2018) , who reached a similar conclusion by comparing the performance of textual GANs to that of LMs using metrics suggested for GAN evaluation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-27",
"text": "Our code is available at: http: //github.com/GuyTevet/SeqGAN-eval and http://github.com/GuyTevet/ rnn-gan-eval."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-29",
"text": "**BACKGROUND**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-30",
"text": "Following the success of GANs in image generation, several works applied the same idea to texts using convolutional neural networks (Gulrajani et al., 2017; Rajeswar et al., 2017) , and later using RNNs (Press et al., 2017; ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-31",
"text": "RNNs enable generating variable-length sequences, conditioning each token on the tokens generated in previous time steps."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-32",
"text": "We leverage this characteristic in our approximation model ( \u00a74.1)."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-33",
"text": "A main challenge in applying GANs for text is that generating discrete symbols is a nondifferentiable operation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-34",
"text": "One solution is to perform a continuous relaxation of the GAN output, which leads to generators that emit a nearly discrete continuous distribution (Press et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-35",
"text": "This keeps the model differentiable and enables end-to-end training through the discriminator."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-36",
"text": "Alternatively, SeqGAN and Leak-GAN (Guo et al., 2017) used policy gradient methods to overcome the differentiablity requirement."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-37",
"text": "We apply our approximation to both model types."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-38",
"text": "3 Evaluating GANs and LMs LM Evaluation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-39",
"text": "Text generation from LMs is commonly evaluated using probabilistic metrics."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-40",
"text": "Specifically, given a test sequence of symbols (t 1 , . . . , t n ), and a LM q, the average crossentropy over the entire test set is computed:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-41",
"text": "For word-based models, the standard metric is perplexity: P P = 2 ACE , while for character-based models it is BP C = ACE directly."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-42",
"text": "Intrinsic improvement in perplexity does not guarantee an improvement in an extrinsic downstream task that uses a language model."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-43",
"text": "However, perplexity often correlates with extrinsic measures (Jurafsky and Martin, 2018) , and is the de-facto metric for evaluating the quality of language models today."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-44",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-45",
"text": "**GAN-BASED TEXT GENERATION EVALUATION.**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-46",
"text": "By definition, a text GAN outputs a discrete sequence of symbols rather than a probability distribution."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-47",
"text": "As a result, LM metrics cannot be applied to evaluate the generated text."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-48",
"text": "Consequently, other metrics have been proposed:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-49",
"text": "\u2022 N-gram overlap: Press et al., 2017) : Inspired by BLEU (Papineni et al., 2002) , this measures whether n-grams generated by the model appear in a held-out corpus."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-50",
"text": "A major drawback is that this metric favors conservative models that always generate very common text (e.g., \"it is\")."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-51",
"text": "To mitigate this, self-BLEU has been proposed (Lu et al., 2018) as an additional metric, where overlap is measured between two independently sampled texts from the model."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-52",
"text": "\u2022 LM score: The probability of generated text according to a pre-trained LM."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-53",
"text": "This has the same problem of favoring conservative models."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-54",
"text": "\u2022 Zhao et al. (2017) suggested an indirect score by training a LM on GAN-generated text, and evaluating it using perplexity."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-55",
"text": "The drawback in this setting is the coupling of the performance of the GAN with that of the proxy LM."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-56",
"text": "\u2022 Heusel et al. (2017) used Frechet InferSent Distance (FID) to compute the distance between distributions of features extracted from real and generated samples."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-57",
"text": "However, this approach relies on a problematic assumption that features are normally distributed."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-58",
"text": "\u2022 Rajeswar et al. (2017) used a context-free grammar (CFG) to generate a reference corpus, and evaluated the model by the likelihood the CFG assigns to generated samples."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-59",
"text": "However, simple CFGs do not fully capture the complexity of natural language."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-60",
"text": "\u2022 To overcome the drawbacks of each individual method, Semeniuta et al. (2018) proposed a unified measure based on multiple evaluation metrics (N-grams, BLEU variations, FID, LM score variations and human evaluation)."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-61",
"text": "Specifically, they argue that the different measures capture different desired properties of LMs, e.g., quality vs. diversity."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-62",
"text": "\u2022 Following Semeniuta et al. (2018) , and in parallel to this work, Caccia et al. (2018) proposed a temperature sweep method that trades-off quality for diversity using a single parameter."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-63",
"text": "Similar to our findings, they concluded that GANs perform worse than LMs on this metric."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-64",
"text": "Figure 1: Generator recurrent connections. {ht} is the internal state sequence and {ot} is the generator prediction sequence (one-hot)."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-65",
"text": "During inference, the outputs {ot} are fed back as the input for the next time step (dashed lines)."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-66",
"text": "During LM approximation, the input {xt} is a sequence of one-hot vectors from the test set."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-67",
"text": "Overall, current evaluation methods cannot fully capture the performance of GAN-based text generation models."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-68",
"text": "While reporting various scores as proposed by Semeniuta et al. (2018) is possible, it is preferable to have a single measure of progress when comparing different text generation models."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-69",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-70",
"text": "**PROPOSED METHOD**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-71",
"text": "We propose a method for approximating a distribution over tokens from a GAN, and then evaluate the model with standard LM metrics."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-72",
"text": "We will describe our approach given an RNN-based LM, which is the most commonly-used architecture, but the approximation can be applied to other auto-regressive models (Vaswani et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-74",
"text": "**LANGUAGE MODEL APPROXIMATION**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-75",
"text": "The inputs to an RNN at time step t, are the state vector h t and the current input token x t ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-76",
"text": "The output token (one-hot) is denoted by o t ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-77",
"text": "In RNNbased GANs, the previous output token is used at inference time as the input x t Guo et al., 2017; Press et al., 2017; Rajeswar et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-78",
"text": "In contrast, when evaluating with BPC or perplexity, the gold token x t is given as input."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-79",
"text": "Hence, LM-based evaluation neutralizes the problem of exposure bias addressed by GANs."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-80",
"text": "Nevertheless, this allows us to compare the quality of text produced by GANs and LMs on an equal footing."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-81",
"text": "Figure 1 illustrates the difference between inference time and during LM approximation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-82",
"text": "We can therefore define the generator function at time step t as a function of the initial state h 0 and the past generated tokens (x 0 . . ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-83",
"text": "x t ), which we denote as o t = G t (h 0 , x 0 ...x t ) (x 0 is a start token)."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-84",
"text": "Given a past sequence (x 0 . . ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-85",
"text": "x t ), G t is a stochastic function: the stochasticity of G t can"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-86",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-87",
"text": "**ALGORITHM 1 LM EVALUATION OF RNN-BASED GANS**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-88",
"text": "Input: Gt(\u00b7): the generator function at time step t (x0, ..., xt): previous gold tokens xt+1: the gold next token (as ground truth) f (\u00b7, \u00b7): a LM evaluation metric N : number of samples 1: for n \u2190 1 to N do 2:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-89",
"text": "gt,n \u2190\u2212 sample from Gt(x0...xt) 3:Gt,N = 1 N \u03a3 N n=1 gt,n 4: return f (Gt,N , xt+1) be gained either by using a noise vector as the initial state h 0 (Press et al., 2017) , or by sampling from the GAN's internal distribution over possible output tokens Guo et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-90",
"text": "Since h 0 is constant or a noise vector that makes G t stochastic, we can omit it to get G t (x 0 . . ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-91",
"text": "x t )."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-92",
"text": "In such a setup, the expected value E[G t (x 0 . . ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-93",
"text": "x t )] is a distribution q over the next vocabulary token a t :"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-94",
"text": "To empirically approximate q, we can sample from it N i.i.d samples, and compute an approximationG t,N = 1 N \u03a3 N n=1 g t,n , where g t,n is one sample from G t (x 0 ...x t )."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-95",
"text": "Then, according to the strong law of large numbers:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-96",
"text": "Given this approximate LM distribution, we can evaluate a GAN using perplexity or BPC."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-97",
"text": "We summarize the evaluation procedure in Algorithm 1. 1"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-99",
"text": "**APPROXIMATION BOUND**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-100",
"text": "We provide a theoretical bound for choosing a number of samples N that results in a good approximation ofG t,N to E[G t ]."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-101",
"text": "Perplexity and BPC rely on the log-probability of the ground truth token."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-102",
"text": "Since the ground truth token is unknown, we conservatively define the bad event B in which there exists v \u2208 V such that |{E[G t ]} v \u2212 {G t,N } v | > \u03b3, where V is the vocabulary."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-103",
"text": "We can then bound the probability of B by some ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-104",
"text": "We define the following notations:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-105",
"text": "1. The probability of a token a t to be v is p v"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-106",
"text": "2. \u03c7 v,n \u2206 = {g t,n } v is a random variable representing the binary value of the v'th index of g t,n which is a single sample of G t ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-107",
"text": "Note that the average of \u03c7 v,n over N samples is"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-108",
"text": "Using the above notation, we can re-define the probability of the bad event B with respect to the individual coordinates in the vectors:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-109",
"text": "We note that \u03c7 v,n \u223c Bernoulli(p v ), and given that {\u03c7 v,n } N n=1 are i.i.d., we can apply the Chernoff-Hoeffding theorem (Chernoff et al., 1952; Hoeffding, 1963) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-110",
"text": "According to the theorem, for every v \u2208 V , P r(|X v \u2212 p v | > \u03b3) < 2e \u22122N \u03b3 2 ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-111",
"text": "Taking the union bound over V implies:"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-112",
"text": "Hence, we get a lower bound on N :"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-113",
"text": "As a numerical example, choosing \u03b3 = 10 \u22123 and = 10 \u22122 , for a character-based LM over the text8 dataset, with |V | = 27, we get the bound: N > 4.3 \u00b7 10 6 ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-114",
"text": "With the same \u03b3 and , a typical word-based LM with vocabulary size |V | = 50, 000 would require N > 8.1 \u00b7 10 6 ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-115",
"text": "In practice, probability vectors of LMs tend to be sparse (Kim et al., 2016) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-116",
"text": "Thus, we argue that we can use a much smaller N for a good approximationG t,N ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-117",
"text": "Since the sparsity of LMs is difficult to bound, as it differs between models, we suggest an empirical method for choosing N ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-118",
"text": "The approximationG t,N is a converging sequence, particularly over \u00b7 \u221e (see Equation 1 )."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-119",
"text": "Hence, we can empirically choose an N which satisfies G t,N \u2212\u03b1 \u2212G t,N \u221e < \u03b3 , \u03b1 \u2208 N. In Section 5 we empirically measure G t,N \u2212\u03b1 \u2212G t,N \u221e as a function of N to choose N ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-120",
"text": "We choose a global N for a model, rather than for every t, by averaging over a subset of the evaluation set."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-121",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-122",
"text": "**EVALUATION**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-123",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-124",
"text": "**MODELS**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-125",
"text": "We focus on character-based GANs as a test-case for our method."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-126",
"text": "We evaluate two RNN-based GANs with different characteristics."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-127",
"text": "As opposed to the original GAN model (Goodfellow et al., 2014) , in which the generator is initialized with random noise, the GANs we evaluated both leverage gold standard text to initialize the generator, as detailed below."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-128",
"text": "Recurrent GAN (Press et al., 2017 ) is a continuous RNN-based generator which minimizes the improved WGAN loss (Gulrajani et al., 2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-129",
"text": "To guide the generator, during training it is initialized with the first i \u2212 1 characters from the ground truth, starting the prediction in the ith character."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-130",
"text": "Stochasticity is obtained by feeding the generator with a noise vector z as a hidden state."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-131",
"text": "At each time step, the input to the RNN generator is the output distribution of the previous step."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-132",
"text": "SeqGAN ) is a discrete RNNbased generator."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-133",
"text": "To guide the generator, it is pretrained as a LM on ground truth text."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-134",
"text": "Stochasticity is obtained by sampling tokens from an internal distribution function over the vocabulary."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-135",
"text": "To overcome differentiation problem, it is trained using a policy gradient objective (Sutton et al., 2000) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-136",
"text": "We also evaluated LeakGAN (Guo et al., 2017) , another discrete RNN-based generator, but since it is similar to SeqGAN and performed worse, we omit it for brevity."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-138",
"text": "**EVALUATION SETTINGS**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-139",
"text": "To compare to prior work in LM, we follow the common setup and train on the text8 dataset."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-140",
"text": "2 The dataset is derived from Wikipedia, and includes 26 English characters plus spaces."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-141",
"text": "We use the standard 90/5/5 split to train/validation/test."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-142",
"text": "Finally, we measure performance with BPC."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-143",
"text": "We tuned hyper-parameters on the validation set, including sequence length to generate at test time (7 for Press et al. (2017) , 1000 for )."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-144",
"text": "We chose the number of samples N empirically for each model, as described in Section 4.2."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-145",
"text": "We set \u03b1 to 10, and the boundary to \u03b3 = 10 \u22123 as a good trade-off between accuracy and run-time."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-146",
"text": "Figure 2 plots the approximate error G t,N \u2212\u03b1 \u2212G t,N \u221e as a function of N ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-147",
"text": "For both models, N > 1600 satisfies this condition (red line in Figure 2 )."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-148",
"text": "To be safe, we used N = 2000."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-149",
"text": "(Krause et al., 2016) 1.27 Large RHN (Zilly et al., 2016) 1.27 LayerNorm HM-LSTM (Chung et al., 2016) 1.29 BN LSTM (Cooijmans et al., 2016) 1.36 Unregularised mLSTM (Krause et al., 2016) 1.40"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-150",
"text": "----------------------------------"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-151",
"text": "**RESULTS**"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-152",
"text": "SeqGAN -pre-trained LM 1.85 1.95"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-153",
"text": "GANs (LM Approximation) SeqGAN -full adversarial training 1.99 2.08 Recurrent GAN without pre-training (Press et al., 2017) 3.31"
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-154",
"text": "Uniform Distribution 4.75 1. four zero five two memaire in afulie war formally dream the living of the centuries to quickly can f 2."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-155",
"text": "part of the pract the name in one nine seven were mustring of the airports tex works to eroses exten 3."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-156",
"text": "eight four th jania lpa ore nine zero zero zero sport for tail concents englished a possible for po Recurrent GAN 1."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-157",
"text": "nteractice computer may became were the generally treat he were computer may became were the general 2. lnannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne and and and and and and and and and and and and and and and a 3. perors as as seases as as as as as as as as as selected see see see see see see see see see see see Table 2 : Random samples of 100 characters generated by each model."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-158",
"text": "Because SeqGAN models output a distribution over tokens at every time step, we can measure the true BPC and assess the quality of our approximation."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-159",
"text": "Indeed, we observe that approximate BPC is only slightly higher than the true BPC."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-160",
"text": "GAN-based models perform worse than stateof-the-art LMs by a large margin."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-161",
"text": "Moreover, in SeqGAN, the pre-trained LM performs better than the fully trained model with approximate BPC scores of 1.95 and 2.06, respectively, and the BPC deteriorates as adversarial training continues."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-162",
"text": "Finally, we note that generating sequences larger than 7 characters hurts the BPC of Press et al. (2017) ."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-163",
"text": "It is difficult to assess the quality of generation with such short sequences."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-164",
"text": "In Table 2 we present a few randomly generated samples from each model."
},
{
"sent_id": "d9567072d2df6c0010b32e1d1eb676-C001-165",
"text": "We indeed observe that adversarial training slightly reduces the quality of generated text for SeqGAN, and find that the quality of 100-character long sequences generated from Press et al. (2017) is low."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"d9567072d2df6c0010b32e1d1eb676-C001-16"
],
[
"d9567072d2df6c0010b32e1d1eb676-C001-34"
],
[
"d9567072d2df6c0010b32e1d1eb676-C001-49"
],
[
"d9567072d2df6c0010b32e1d1eb676-C001-77"
]
],
"cite_sentences": [
"d9567072d2df6c0010b32e1d1eb676-C001-16",
"d9567072d2df6c0010b32e1d1eb676-C001-34",
"d9567072d2df6c0010b32e1d1eb676-C001-49",
"d9567072d2df6c0010b32e1d1eb676-C001-77"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"d9567072d2df6c0010b32e1d1eb676-C001-143"
],
[
"d9567072d2df6c0010b32e1d1eb676-C001-162"
],
[
"d9567072d2df6c0010b32e1d1eb676-C001-165"
]
],
"cite_sentences": [
"d9567072d2df6c0010b32e1d1eb676-C001-143",
"d9567072d2df6c0010b32e1d1eb676-C001-162",
"d9567072d2df6c0010b32e1d1eb676-C001-165"
]
}
}
},
"ABC_c182062efc486f83eb27f9a3859a9a_10": {
"x": [
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-78",
"text": "For the sake of consistency, we again opt for the Scikit-learn implementation."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-79",
"text": "7 We compare the results of our setup to the results of the original experiment."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-80",
"text": "In addition, we also compare evaluations of a system trained on various other features (which we will describe in Section 3.) extracted from the tweets and their metadata."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-2",
"text": "As research on hate speech becomes more and more relevant every day, most of it is still focused on hate speech detection."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-3",
"text": "By attempting to replicate a hate speech detection experiment performed on an existing Twitter corpus annotated for hate speech, we highlight some issues that arise from doing research in the field of hate speech, which is essentially still in its infancy."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-4",
"text": "We take a critical look at the training corpus in order to understand its biases, while also using it to venture beyond hate speech detection and investigate whether it can be used to shed light on other facets of research, such as popularity of hate tweets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-5",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-81",
"text": "The results are presented in Table 2 ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-6",
"text": "**INTRODUCTION**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-7",
"text": "The Internet, likely one of humanity's greatest inventions, facilitates the sharing of ideas and knowledge, as well as online discussion and user interaction."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-8",
"text": "All these are positive features but, as with any tool, whether they are used in a positive or negative manner depends largely on the people that use them."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-9",
"text": "Consequently, and especially when user anonymity is added to the mix, online discussion environments can become abusive, hateful and toxic."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-10",
"text": "To help identify, study, and ultimately curb this problem, such negative environments and the language used within are being studied under the name hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-11",
"text": "Research on hate speech has become quite prominent in recent years, with dedicated workshops and conferences, 1 and even being featured on LREC2018's list of hot topics."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-12",
"text": "However, hate speech research is still in its infancy."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-13",
"text": "In part, this is due to the following challenges:"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-14",
"text": "1. The term hate speech is difficult to define."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-15",
"text": "Silva et al. (2016) say that \"hate speech lies in a complex nexus with freedom of expression, group rights, as well as concepts of dignity, liberty, and equality."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-16",
"text": "For this reason, any objective definition (i.e., that can be easily implemented in a computer program) can be contested."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-17",
"text": "\" Generally, the current consensus among researchers seems to be that hate speech can be seen as a phenomenon encompassing issues such as: personal attacks, attacks on a specific group or minority, and abusive language targeting specific group characteristics (e.g., ethnicity, religion, gender, sexual orientation)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-18",
"text": "2. Creating resources for studying hate speech is far from trivial."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-19",
"text": "Hate speech comprises a very small fraction of online content, and on most social platforms it is heavily moderated."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-20",
"text": "For example, Nobata et al. (2016) report that in their corpus of comments on Yahoo! articles collected between April 2014 and April 2015, the percentage of abusive comments is around 3.4% on Finance articles and 10.7% on News."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-21",
"text": "Since the phenomenon is elusive, researchers often use lists of offensive terms to collect datasets with the aim to increase the likelihood of catching instances of hate speech Waseem and Hovy, 2016) ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-22",
"text": "This filtering process, however, has the risk of producing corpora with a variety of biases, which may go undetected."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-23",
"text": "3. Finally, hate speech is present in user-generated content that is not under the control of the researcher."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-24",
"text": "Social media data is typically collected by public APIs that may lead to inconsistent results."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-25",
"text": "For example, Gonz\u00e1lez-Bail\u00f3n et al. (2014) find that the Twitter Search API yields a smaller dataset than the Stream API when using the same filtering parameters."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-26",
"text": "Furthermore, users might delete their profiles or moderate their own questionable content themselves."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-82",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-27",
"text": "Thus, datasets on which research experiments are performed are ephemeral, which makes replication of results very difficult."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-28",
"text": "In this paper, we focus on the latter two points."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-29",
"text": "We consider a particular hate speech corpus -a Twitter corpus collected by Waseem and Hovy (2016) , which has been gaining traction as a resource for training hate speech detection models (Waseem and Hovy, 2016; Gamb\u00e4ck and Utpal, 2017; Park and Fung, 2017) -and analyse it critically to better understand its usefulness as a hate speech resource."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-30",
"text": "In particular, we make the following contributions:"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-31",
"text": "\u2022 We report the outcome of a reproduction experiment, where we attempt to replicate the results by Waseem and Hovy (2016) on hate speech detection using their Twitter corpus."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-32",
"text": "\u2022 We use the corpus to study a novel aspect related to hate speech: the popularity of tweets containing hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-33",
"text": "To this end, we develop models for the task of predicting whether a hate tweet will be interacted with and perform detailed feature analyses."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-34",
"text": "\u2022 We perform a quantitative and qualitative analysis of the corpus to analyse its possible biases and assess the generality of the results obtained for the hate speech detection and popularity tasks."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-36",
"text": "**REPLICATION: HATE SPEECH DETECTION RESULTS**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-37",
"text": "We aim to replicate the results on hate speech detection by Waseem and Hovy (2016) using the hate speech Twitter corpus created by the authors."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-38",
"text": "2 The dataset is a useful resource as it is one of few freely available corpora for hate speech research; it is manually annotated and distinguishes between two types of hate speech -sexism and racismwhich allows for more nuanced insight and analysis."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-39",
"text": "Additionally, as a Twitter corpus, it provides opportunity for any type of analysis and feature examination typical for Twitter corpora, such as user and tweet metadata, user interaction, etc."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-40",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-41",
"text": "**CORPUS IN NUMBERS**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-42",
"text": "Here we provide just a brief quantitative overview of the corpus, whereas a more detailed qualitative analysis is presented in Section 4.. The original dataset contains 16907 annotated tweets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-43",
"text": "However, as is common practice with Twitter corpora, the corpus was only made available as a set of annotated tweet IDs, rather than the tweets themselves."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-44",
"text": "To obtain the actual tweets and corresponding metadata, we used the Tweepy Twitter API wrapper."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-45",
"text": "3 Given that the corpus was initially collected and annotated in 2016, there have been some changes in the availability of tweets by the time we extracted in in May 2017."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-46",
"text": "Table 1 presents the distribution of annotations in the corpus in its original version and the version that was used for this paper."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-47",
"text": "A tweet in the corpus can have three labels (None, Racism, Sexism)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-48",
"text": "It is possible that a tweet has multiple labels, in the case that it contains both racism and sexism (this only happens in 8 tweets in the original dataset, so it is not a widespread phenomenon in this corpus.)"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-49",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-50",
"text": "**TAG**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-51",
"text": "Original The dataset is quite unbalanced, but this is reflective of the unbalanced distribution of hate speech 'in the wild', and speaks to why it is so difficult to do research on hate speech in the first place: it is an elusive phenomenon."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-52",
"text": "This, combined with the fact that users might delete their profiles or moderate their own questionable content themselves, makes available data scarce, and makes every Twitter corpus smaller over time, and consequently, less valuable and more prone to mistakes when attempting a replicative study."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-54",
"text": "**EXPERIMENTAL SETUP**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-55",
"text": "As with any replication study, our aim here is to mimic the original experimental setup as closely as possible, in hopes of obtaining same or comparable results."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-56",
"text": "Unfortunately, this effort is already potentially hindered by the fact that the Twitter corpus has shrunk over time."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-57",
"text": "However, the difference is not too large, and we expect it not to have a significant impact on the results."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-58",
"text": "A much more prominent obstacle is the lack of certain implementation details in the original paper that make reproduction difficult."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-59",
"text": "At several points in the pipeline, we were left to our own devices and resort to making educated guesses as to what may have been done, due to the lack of comprehensive documentation."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-60",
"text": "More specifically, there are two important aspects of the pipeline that present us with this problem: the algorithm and the features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-61",
"text": "The algorithm."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-62",
"text": "Waseem and Hovy (2016) state that they use a logistic regression classifier for their hate speech prediction task. What is not mentioned is which implementation of the algorithm is used, how the model was fit to the data, whether the features were scaled, and whether any other additional parameters had been used."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-63",
"text": "Due to its popularity and accessibility, we opt for the Scikitlearn (Pedregosa et al., 2011) Python implementation of the logistic regression algorithm."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-64",
"text": "4 In addition, after fitting the model, we do not do additional scaling of the features when working with just n-grams (as these are already scaled when extracted), but we do scale our other features using the scaling function."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-65",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-66",
"text": "**5**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-67",
"text": "The features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-68",
"text": "Waseem and Hovy (2016) explore several feature types: they employ n-gram features -specifically, they find that character n-grams of lengths up to 4 perform best -and in addition, they combine them with gender information, geographic location information and tweet length, finding that combining n-gram features with gender features yields slightly better results than just n-gram features do, while mixing in any of the other features results in slightly lower scores."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-69",
"text": "As a rule of thumb, we would attempt to replicate the best performing setup (character n-grams in combination with gender)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-70",
"text": "However, this proved to be difficult, as user gender information is not provided by Twitter (hence it cannot be scraped from the Twitter API) and has not been made available by the authors along with their dataset."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-71",
"text": "However, they do describe how they went about procuring the gender information for themselves (by performing semi-automatic, heuristics-based annotation), but only managed to annotate about 52% of the users."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-72",
"text": "This, in combination with the fact that in the original experiment the F1 score improvement when gender is considered is minor (0.04 points) and not statistically significant, led us to focus our efforts on replicating only the experiments involving n-gram features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-73",
"text": "However, extracting the n-gram features is also shown to be a nontrivial task, as the original paper does not state how the features are encoded: whether it is using a bag-of-ngrams approach, a frequency count approach, or a TF-IDF measure for each n-gram."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-74",
"text": "We opt for TF-IDF because it is most informative, and just as easy to implement as the more basic approaches."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-75",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-76",
"text": "**EVALUATION AND RESULTS**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-77",
"text": "The original paper states the use of 10-fold cross-validation for model evaluation purposes, without specifying a particular implementation."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-83",
"text": "**FEATURES**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-84",
"text": "Original Table 2 : Average evaluation scores on the hate speech detection task."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-85",
"text": "The original study only provided an F1 score metric for the logistic regression classifier trained on character n-grams (second column)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-86",
"text": "We replicate this experiment (third column), and also train a logistic regression classifier on the same task (fourth column), but on a different set of features (detailed in Section 3.)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-87",
"text": "Examining the table reveals that our best attempt at replicating the original experiment, with logistic regression trained only on character n-grams, yields an F1-score that is 0.03 points lower than the original."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-88",
"text": "Such a drop is to be expected, considering that our version of the dataset was smaller and that we had to fill in some gaps in the procedure ourselves, likely resulting in slight procedural mismatches."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-89",
"text": "However, the drop is not large, and might indicate a stable, consistent result."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-90",
"text": "When looking at the performance of classifiers trained on features extracted from tweets and their metadata, they significantly underperform, with a 6 point drop compared to our replicated experiment, and a 9 point drop compared to the original results."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-91",
"text": "This adds a strong confirmation of an observation made in the original study, namely that n-gram features are the most predictive compared to any other types of features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-92",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-93",
"text": "**NEW EXPERIMENT: POPULARITY PREDICTION**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-94",
"text": "To date, most research on hate speech within the NLP community has been done in the area of automatic detection using a variety of techniques, from lists of prominent keywords (Warner and Hirschberg, 2012) to regression classifiers as seen in the previous section (Nobata et al., 2016; Waseem and Hovy, 2016) , naive Bayes, decision trees, random forests, and linear SVMs , as well as deep learning models with convolutional neural networks (Gamb\u00e4ck and Utpal, 2017; Park and Fung, 2017) ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-95",
"text": "Our intent in this section is to explore hate speech beyond just detection, using the Twitter corpus by Waseem and Hovy (2016) ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-96",
"text": "Given that Twitter is a platform that enables sharing ideas, and given that extreme ideas have a tendency to intensely spread through social networks (Brady et al., 2017) , our question is: how does the fact that a tweet is a hate tweet affect its popularity?"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-97",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-98",
"text": "**RELATED WORK**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-99",
"text": "To our knowledge there has not been any work relating tweet popularity with hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-100",
"text": "However, there is a significant body of work dealing with tweet popularity modeling and prediction."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-101",
"text": "Many papers explore features that lead to retweeting."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-102",
"text": "Suh et al. (2010) perform an extensive analysis of features that affect retweetability, singling out two groups of features: content and contextual features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-103",
"text": "Similarly, Zhang et al. (2012) train a model to predict the number of retweets using two types of features: user features and tweet features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-104",
"text": "They also compute information gain scores for their features and build a feature-weighted model."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-105",
"text": "They compare the performance of two algorithms: logistic regression and SVM and find that SVM works better, yielding an F-score of 0.71."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-106",
"text": "In addition, some of the related work also relies on temporal features: Zaman et al. (2013) predict the total number of retweets a given amount of time after posting, using a Bayesian model based on features of early retweet times and follower graphs."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-107",
"text": "Similarly, Hong et al. (2011) predict the number of retweets, using binary and multi-class classifiers."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-108",
"text": "They use a more varied set of features, and aside from temporal features, they use content, topical and graph features, as well as user metadata."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-109",
"text": "We do not have temporal data at our disposal, nor are we at this stage interested in predicting the exact number of retweets at any given point."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-110",
"text": "We are more concerned with investigating how hate speech comes into play regarding tweet popularity, if at all."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-111",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-112",
"text": "**POPULARITY ANALYSIS**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-113",
"text": "As surveyed above, most of the related work on tweet popularity focuses solely on retweets as indicators of popularity."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-114",
"text": "However, while this is probably the clearest indicator, users can interact with tweets in a number of other ways."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-115",
"text": "For this reason, in the present work we also consider other potential measures of popularity; namely, number of tweet replies and number of 'likes' (formerly called 'favorites')."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-116",
"text": "The number of likes and retweets in the corpus is varied, but highly skewed, with most of the tweets being liked/retweeted 0 times."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-117",
"text": "The distributions are displayed in Tables 3 and 4 ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-118",
"text": "Given these distributions, we opt for framing the problem as a binary classification task: we wish to determine whether a tweet receives a reaction (retweet, like, response) at least once, or not at all. But before we go into prediction, we wish to investigate whether there is a significant difference between hate speech and non-hate speech tweets regarding the number of times a tweet was liked/retweeted/replied to."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-119",
"text": "Thus, to determine whether these differences are statistically significant, we employ the chi-squared (\u03c7 2 ) statistical significance test."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-120",
"text": "When examining likes and replies, the test yields p-values of <0.0001, meaning that tweets containing hate speech in the corpus are both liked and replied to significantly less than non-hate speech tweets are."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-121",
"text": "In other words, if a tweet contains hate speech, it is less likely to be liked and replied to."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-122",
"text": "However, when examining the difference in the number of retweets, the p-value comes out as 0.5967."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-123",
"text": "This means that we cannot dismiss the null hypothesis, or rather, that whether a tweet contains hate speech or not, does not impact its retweetability either way."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-124",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-125",
"text": "**POPULARITY PREDICTION**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-126",
"text": "Features."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-127",
"text": "We use an large set of features inspired by related work (Waseem and Hovy, 2016; Sutton et al., 2015; Suh et al., 2010; Zaman et al., 2013; Hong et al., 2011; Zhang et al., 2012; Cheng et al., 2014; Ma et al., 2013; Zhao et al., 2015) ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-128",
"text": "We divide our features into three groups: Tweet features (metadata about the the tweet itself), user features (metadata about the author of a tweet) and content features (features derived from the content of the tweet), with the largest number of features falling into the latter group."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-129",
"text": "The features are listed in Table 5 ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-130",
"text": "Models and results."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-131",
"text": "We train a logistic regression classifier, as well as a linear SVM classifier to compare their performances."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-132",
"text": "We also train separate models for likes and for retweets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-133",
"text": "One pair of models was trained on the whole corpus, and two additional pairs of classifiers were trained on just the hate speech portion and non-hate speech portion of the corpus respectively."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-134",
"text": "We tested all models using 10-fold cross validation, holding"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-135",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-136",
"text": "**TWEET FEATURES**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-137",
"text": "User features tweet age account age tweet hour len handle is quote status len name is reply num followers is reply to hate tweet num followees num replies num times user was listed num posted tweets num favorited tweets Content features is hate tweet has uppercase token has mentions uppercase token ratio num mentions lowercase token ratio has hashtags mixedcase token ratio num hashtags blacklist total has urls blacklist ratio num urls total negative tokens char count negative token ratio token count total positive tokens has digits positive token ratio has questionmark total subjective tokens has exclamationpoint subjective token ratio has fullstop Table 5 : Features used in the popularity prediction task."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-138",
"text": "out 10% of the sample for evaluation to help prevent overfitting."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-139",
"text": "All modeling and evaluation was performed using Scikit-learn (Pedregosa et al., 2011) ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-140",
"text": "The evaluation results are presented in Table 6 ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-141",
"text": "We also make our feature dataset, and our training and evaluation scripts available to the community for transparency and reproduction purposes."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-142",
"text": "Interestingly, our classifiers are consistently better at predicting retweets than likes."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-143",
"text": "Given that they are trained on the same features, this indicates that the nature of these two activities is different, in spite of the fact that they intuitively seem very similar."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-144",
"text": "Furthermore, it seems that the linear regression model seems to perform slightly better overall than the SVM model on both prediction tasks (likes and retweets)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-145",
"text": "Analysis."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-146",
"text": "In order to investigate which features are most informative for the task, we perform feature ablation according to our feature groups."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-147",
"text": "Some notable results show that removing author metadata from the feature set reduces the performance of the model."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-148",
"text": "9 However, the biggest takeaway for now is the is reply feature's impact on the model."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-149",
"text": "Our SVM model's average accuracy drops by 0.04 points if the is reply feature is omitted from the feature set, whereas omitting many of the other features decreases performance scores by 0.02 points at most, if at all."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-150",
"text": "Inspired by Zhang et al. (2012) Features that are very informative for retweeting, but not for liking, are whether the tweet contains uppercase tokens, and, most notably, whether the tweet is a reply."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-151",
"text": "This is in line with our findings in the feature ablation study, confirming that there is a strong link between the possibility of retweeting and whether or not the tweet in question is a reply."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-152",
"text": "Our interpretation of this discrepancy is that original, stand-alone ideas (tweets) might be more likely to be picked up and passed on (retweeted), than a turn in a twitter conversation thread would be."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-153",
"text": "In addition, these overall IG measurements also indicate that there is an inherent qualitative difference between the acts of liking and retweeting."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-155",
"text": "**CORPUS ANALYSIS**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-156",
"text": "As the field of hate speech research is yet to mature, with disagreement about what exactly the phenomenon entails (Waseem et al., 2017) and without a unified annotation framework (Fi\u0161er et al., 2017) , it is warranted to look at the data and examples in more detail, with considerations for potential shortcomings."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-157",
"text": "In Section 2., we pointed out the ephemeral nature of the corpus by Waseem and Hovy (2016) , common to all Twitter datasets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-158",
"text": "In this section, we analyse other characteristics of the corpus related to the challenges of data collection for hate speech analysis we mentioned in the Introduction (point 2), which can result in undesirable biases."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-159",
"text": "Tweet collection."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-160",
"text": "Given the small fraction of online content comprised of hate speech, collecting a significant amount of examples is an extremely difficult task."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-161",
"text": "At present, it is not feasible to collect a large sample of tweets and then manually label them as hate or non hate, as the fraction of instances labeled with the positive class will be negligible."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-162",
"text": "The only way to model the phenomenon is to target tweets already likely to contain hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-163",
"text": "Driven by this rationale, the authors of the corpus have obtained their dataset by performing an initial manual search of common slurs and terms used pertaining to religious, sexual, gender, and ethnic minorities."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-164",
"text": "The full list of terms they queried for is not very long: MKR, asian drive, femi-nazi, immigrant, nigger, sjw, WomenAgainstFeminism, blameonenotall, islam terrorism, notallmen, victimcard, victim card, arab terror, gamergate, jsil, racecard, race card."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-165",
"text": "In the results obtained from these queries, they identified frequently occurring terms in tweets that contain hate speech and references to specific entities (such as MKR, addressed further below)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-166",
"text": "In addition to this, they identified a small number of prolific users from these searches."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-167",
"text": "This manner of tweet collection allowed the authors to obtain quite a considerable amount of data."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-168",
"text": "However, this approach to data collection inevitably introduces many biases into the dataset, as will be demonstrated further in this section."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-169",
"text": "Qualitative observations on tweet content."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-170",
"text": "According to the annotation guidelines devised by Waseem and Hovy (2016) for the purpose of annotating this corpus, a tweet is tagged as offensive if it: (1) uses a sexist or racial slur, (2) attacks a minority, (3) seeks to silence a minority, (4) criticizes a minority (without a well founded argument), (5) promotes, but does not directly use, hate speech or violent crime, (6) criticizes a minority and uses a straw man argument, (7) blatantly misrepresents truth or seeks to distort views on a minority with unfounded claims, (8) shows support of problematic hashtags (e.g. #BanIslam, #whori-ental, #whitegenocide), (9) negatively stereotypes a minority, (10) defends xenophobia or sexism, (11) the tweet is ambiguous (at best); and contains a screen name that is offensive as per the previous criteria; and is on a topic that satisfies any of the above criteria."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-171",
"text": "Though at first glance specific and detailed, these criteria are quite broad and open to interpretation."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-172",
"text": "This was likely done to cover as many hate speech examples as possible -a thankless task, as hate speech data is scarce to begin with."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-173",
"text": "However, due to this same breadth, the corpus contains some potential false positives."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-174",
"text": "The most jarring example of this being that, if a user quotes a tweet containing hate speech (by prepending the quoted text with \"RT\"), the quoter's tweet is tagged as hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-175",
"text": "Certainly, the user could have quoted the original tweet in support of its message, and even if not, one could argue that they do perpetuate the original hateful message by quoting it."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-176",
"text": "On the other hand, it is just as likely that the user is quoting the tweet not to make an endorsement, but a neutral response."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-177",
"text": "It is even more likely that the user's response is an instance of counterspeech -interaction used to challenge hate speech (Wright et al., 2017) ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-178",
"text": "Manual inspection shows that there are instances of both such phenomena in the corpus, yet all those tweets are tagged as hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-179",
"text": "In fact, \u223c30% of hate speech tweets in the corpus contain the token 'RT', indicating they are actually retweets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-180",
"text": "This could pose a problem further down the line when extrapolating information about hate speech users."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-181",
"text": "Addressing this issue would at the very least require going through tweets with quotes and relabeling them, if not altogether rethinking the annotation guidelines, or rather, being more mindful of the semantics at play during annotation."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-182",
"text": "Topic domain."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-183",
"text": "In spite of the broad guidelines, however, it seems that the actual hate speech examples end up falling on quite a narrow spectrum."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-184",
"text": "Even though the tweets were semi-automatically picked based on a wide variety of keywords likely to identify hate speech, the tag 'racism' is in fact used as an umbrella term to label not only hate based on race/ethnicity, but also religion, specifically Islam."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-185",
"text": "Indeed, the majority of the tweets tagged as racist are, in fact, islamophobic, and primarily written by a user with an antiIslam handle (as per guideline 11)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-186",
"text": "Though it is stated in the original paper which seed words were used to collect the data (which included both racist and islamophobic terms), it is undeniable that the most frequent words in the racist portion of the corpus refer to islamophobia (which is also explicitly stated by the authors themselves)."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-187",
"text": "This is not wrong, of course, but it begs the question of why the authors did not choose a more specific descriptor for the category, especially given that the term 'racism' typically sparks different connotations, ones that, in this case, do not accurately reflect the content of the actual data."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-188",
"text": "When it comes to sexist tweets, they are somewhat more varied than those annotated as racist."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-189",
"text": "However, they do contain a similar type of bias: \u223c13.6% of the tweets tagged as sexist contain the hashtag and/or handle MKR/MyKitchenRules."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-190",
"text": "My Kitchen Rules is an Australian competitive cooking game show which is viewed less for the cooking and skill side of the show than for the gossip and conflict which certain contestants are encouraged to cause."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-191",
"text": "10 It seems to be a popular discussion topic among fans of the show on Twitter, and apparently prompts users to make sexist remarks regarding the female contestants."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-192",
"text": "There is nothing inherently problematic about this being included in a corpus of hate speech, but it cannot be disregarded that more than a tenth of the data on sexism is constrained to an extremely specific topic domain, which might not make for the most representative example of sexism on Twitter."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-193",
"text": "Distribution of users vs. tweet content Another interesting dimension of the corpus that we observe is the distribution of users in relation to the hate speech annotationsan aspect that could be important for our analysis of popularity presented in Section 3."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-194",
"text": "There are 1858 unique user IDs in the corpus."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-195",
"text": "Thus many of the 16907 tweets were written by the same people."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-196",
"text": "As a simplistic approximation, we can (very tentatively) label every user that is the author of at least one tweet containing hate speech as a hate user; and users that, in the given dataset, have not produced any tweets containing hate speech we label as non-hate users."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-197",
"text": "Of course, this does have certain drawbacks, as we cannot know that a user does not produce hate speech outside the sample we are working with, but it does provide at least an approximation of a user's production of hate tweets in the sample."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-198",
"text": "Using this heuristic, the distribution of users in the corpus in regards to whether they produce hate speech or not is presented in Table 10 ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-199",
"text": "A really striking discrepancy immediately jumps out when looking at is an average of 6 sexist tweets per sexist user."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-200",
"text": "The actual distribution, however, is extremely skewed -the bulk of all the hate speech data is distributed between three users: one user who produced 1927 tweets tagged as racist, and two users who respectively produced 1320 and 964 tweets tagged as sexist."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-201",
"text": "This is illustrated in Figure 1 ."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-202",
"text": "Figure 1: Graph illustrating the distribution of tweets containing hate speech among users producing them."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-203",
"text": "We represent the number of tweets in logarithmic scale."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-204",
"text": "Such a distribution renders any attempt at generalization or modeling of racist tweets moot, as the sample cannot be called representative of racism as such, but only of the Twitter production of these 5 users."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-205",
"text": "11 Similarly, the fact that most of the tweets tagged as sexist belong to the same two users considerably skews this subset of the data."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-206",
"text": "Corollary."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-207",
"text": "All of these points deserve due consideration."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-228",
"text": "In addition, reframing the task as not just a binary prediction task, but rather fitting a regression model to predict the exact number of likes, retweets and replies, would certainly be preferable and more informative, and could lead to a better understanding of how hate speech behaves on Twitter. What is clear is that hate speech is a very nuanced phenomenon and we are far from knowing everything there is to know about it."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-229",
"text": "Resources are scarce and far from perfect, and much more work and careful consideration are needed, as well as much cleaning, fine-tuning, discussion and agreement on what hate speech even is, if we are to build better resources and successfully model and predict hate speech, or any of its aspects."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-208",
"text": "The imbalances with respect to distribution of users were certainly considered while we worked with the data."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-209",
"text": "In an attempt to reduce them, we did not distinguish between racist and sexist tweets in our analysis in both Sections 2. and 3. (even though we were tempted to do so), but rather treated them all as simply hate speech tweets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-210",
"text": "Additionally, it is possible that the insights and biases presented in this section might even call into question the relevance of the findings from Section 3., as the popularity modeled there is likely reflecting the popularity of the particular Twitter users in the corpus rather than of hate speech tweets as such."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-211",
"text": "11 However, the data might still be useful when looked at in bulk with sexism, as it might reinforce the similarities they both share stemming from the fact that they are types of hate speech."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-212",
"text": "----------------------------------"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-213",
"text": "**CONCLUSION**"
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-214",
"text": "This paper has provided an overview of several research directions involving hate speech: 1."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-215",
"text": "A critical look at a publicly available hate speech dataset."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-216",
"text": "2."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-217",
"text": "An attempt at replicating and confirming already established hate speech detection findings."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-218",
"text": "3. Pushing the research space in a new direction: popularity prediction."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-219",
"text": "Overall, we analyzed a currently popular hate speech dataset, pointed out considerations that have to be made while working such data, and observed that it is biased on several levels."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-220",
"text": "This does not render it useless, but it is important to keep these biases in mind while using this resource and while drawing any sort of conclusions from the data."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-221",
"text": "As far as replicability goes, the resource does allow one to model hate speech (as biased as it may be), but not without a certain degree of difficulty."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-222",
"text": "We achieve system evaluation scores of 0.71 in terms of F1 score, which is slightly lower than the original results of 0.74 F1 score on the same setup."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-223",
"text": "The differences and gaps in implementation showcase a common trend in scientific publishing -the general problem of reproducing results due to publications not providing sufficient information to make the experiments they describe replicable without involving guessing games. And only when attempting to reproduce a study can one truly realize how much detail is so easily omitted or overlooked, simply due to lack of awareness."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-224",
"text": "When it comes to popularity prediction, we determine that hate speech negatively impacts the likelihood of likes and replies, but does not affect likelihood of retweets."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-225",
"text": "However, training only on the hate speech portion of the data does seem to boost our model's performance in retweet prediction."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-226",
"text": "These findings, as well as the evaluation scores and feature analyses, are only the first stepping stone in a long line of future work that can be done to better understand the impact of hate speech on social media and how it spreads."
},
{
"sent_id": "c182062efc486f83eb27f9a3859a9a-C001-227",
"text": "Possibilities include employing social graph mining and network analysis, perhaps using user centrality measures as features in both hate speech and popularity prediction tasks."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"c182062efc486f83eb27f9a3859a9a-C001-21"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-29"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-62"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-68"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-94"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-170"
]
],
"cite_sentences": [
"c182062efc486f83eb27f9a3859a9a-C001-21",
"c182062efc486f83eb27f9a3859a9a-C001-29",
"c182062efc486f83eb27f9a3859a9a-C001-62",
"c182062efc486f83eb27f9a3859a9a-C001-68",
"c182062efc486f83eb27f9a3859a9a-C001-94",
"c182062efc486f83eb27f9a3859a9a-C001-170"
]
},
"@USE@": {
"gold_contexts": [
[
"c182062efc486f83eb27f9a3859a9a-C001-29"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-31"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-37"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-95"
]
],
"cite_sentences": [
"c182062efc486f83eb27f9a3859a9a-C001-29",
"c182062efc486f83eb27f9a3859a9a-C001-31",
"c182062efc486f83eb27f9a3859a9a-C001-37",
"c182062efc486f83eb27f9a3859a9a-C001-95"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"c182062efc486f83eb27f9a3859a9a-C001-127"
],
[
"c182062efc486f83eb27f9a3859a9a-C001-157"
]
],
"cite_sentences": [
"c182062efc486f83eb27f9a3859a9a-C001-127",
"c182062efc486f83eb27f9a3859a9a-C001-157"
]
}
}
},
"ABC_fa00b8bac394b48bf950f154c65216_10": {
"x": [
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-2",
"text": "Abstractive summarization aims to rewrite a long document to its shorter form, which is usually modeled as a sequence-to-sequence (SEQ2SEQ) learning problem."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-3",
"text": "Transformers are powerful models for this problem."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-4",
"text": "Unfortunately, training large SEQ2SEQ Transformers on limited supervised summarization data is challenging."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-5",
"text": "We therefore propose STEP (as shorthand for Sequence-to-Sequence TransformEr Pretraining), which can be trained on large scale unlabeled documents."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-6",
"text": "Specifically, STEP is pre-trained using three different tasks, namely sentence reordering, next sentence generation, and masked document generation."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-7",
"text": "Experiments on two summarization datasets show that all three tasks can improve performance upon a heavily tuned large SEQ2SEQ Transformer which already includes a strong pretrained encoder by a large margin."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-8",
"text": "By using our best task to pre-train STEP, we outperform the best published abstractive model on CNN/DailyMail by 0.8 ROUGE-2 and New York Times by 2.4 ROUGE-2."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-9",
"text": "1 * Contribution during internship at Microsoft Research."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-10",
"text": "1 We plan to make our code and models public available."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-11",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-12",
"text": "**INTRODUCTION**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-13",
"text": "Large pre-trained language models (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019; improved the state-of-the-art of various natural language understanding (NLU) tasks such as question answering (e.g., SQuAD; Rajpurkar et al. 2016) , natural language inference (e.g., MNLI; Bowman et al. 2015) as well as text classification (Zhang et al., 2015) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-14",
"text": "These models (i.e., large LSTMs; Hochreiter and Schmidhuber 1997 or Transformers; Vaswani et al. 2017 ) are pre-trained on large scale unlabeled text with language modeling (Peters et al., 2018; Radford et al., 2018) , masked lan-guage modeling (Devlin et al., 2019; and permutation language modeling objectives."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-15",
"text": "In NLU tasks, pre-trained language models are mostly used as text encoders."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-16",
"text": "Abstractive document summarization aims to rewrite a long document to its shorter form while still retaining its important information."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-17",
"text": "Different from extractive document summarization that extacts important sentences, abstractive document summarization may paraphrase original sentences or delete contents from them."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-18",
"text": "For more details on differences between abstractive and extractive document summary, we refer the interested readers to Nenkova and McKeown (2011) and Section 2."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-19",
"text": "This task is usually framed as a sequence-to-sequence learning problem (Nallapati et al., 2016; See et al., 2017) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-20",
"text": "In this paper, we adopt the sequence-to-sequence (SEQ2SEQ) Transformer (Vaswani et al., 2017) , which has been demonstrated to be the state-ofthe-art for SEQ2SEQ modeling (Vaswani et al., 2017; ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-21",
"text": "Unfortunately, training large SEQ2SEQ Transformers on limited supervised summarization data is challenging ) (refer to Section 5)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-22",
"text": "The SEQ2SEQ Transformer has an encoder and a decoder Transformer."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-23",
"text": "Abstractive summarization requires both encoding of an input document and generation of a summary usually containing multiple sentences."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-24",
"text": "As mentioned earlier, we can take advantage of recent pre-trained Transformer encoders for the document encoding part as in Liu and Lapata (2019) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-25",
"text": "However, Liu and Lapata (2019) leave the decoder randomly initialized."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-26",
"text": "In this paper, we aim to pretrain both the encoder (i.e., the encoding part) and decoder (i.e., the generation part) of a SEQ2SEQ Transformer , which is able to improve abstractive summarization performance."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-27",
"text": "Based on the above observations, we propose STEP (as shorthand for Sequence-to-Sequence TransformEr Pre-training), which can be pretrained on large scale unlabeled documents."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-28",
"text": "Specifically, we design three tasks for SEQ2SEQ model pre-training, namely Sentence Reordering (SR), Next Sentence Generation (NSG), and Masked Document Generation (MDG)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-29",
"text": "SR learns to recover a document with randomly shuffled sentences."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-30",
"text": "NSG generates the next segment of a document based on its preceding segment."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-31",
"text": "MDG recovers a masked document to its original form."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-32",
"text": "After pre-trianing STEP using the three tasks on unlabeled documents, we fine-tune it on supervised summarization datasets."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-33",
"text": "We evaluate our methods on two summarization datasets (i.e., the CNN/DailyMail and the New York Times datasets)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-34",
"text": "Experiments show that all three tasks we propose can improve upon a heavily tuned large SEQ2SEQ Transformer which already includes a strong pre-trained encoder by a large margin."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-35",
"text": "Compared to the best published abstractive models, STEP improves the ROUGE-2 by 0.8 on the CNN/DailyMail dataset and by 2.4 on the New York Times dataset using our best performing task for pre-training."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-36",
"text": "Human experiments also show that STEP can produce significantly better summaries in comparison with recent strong abstractive models."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-37",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-38",
"text": "**RELATED WORK**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-39",
"text": "This section introduces extractive and abstractive document summarization as well as pre-training methods for natural language processing tasks."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-40",
"text": "Extractive Summarization Extractive summarization systems learn to find the informative sentences in a document as its summary."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-41",
"text": "This task is usually viewed as a sentence ranking problem (Kupiec et al., 1995; Conroy and O'leary, 2001) using scores from a binary (sequence) classification model, which predicts whether a sentence is in the summary or not."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-42",
"text": "Extractive neural models employ hierarchical LSTMs/CNNs as the feature learning part of the binary (sequence) classifier (Cheng and Lapata, 2016; Nallapati et al., 2017; Narayan et al., 2018; Zhang et al., 2018) , which largely outperforms discrete feature based models (Radev et al., 2004; Filatova and Hatzivassiloglou, 2004; Nenkova et al., 2006) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-43",
"text": "Very recently, the feature learning part was replaced again with pretrained transformers (Zhang et al., 2019; Liu and Lapata, 2019 ) that lead to another huge improvement of summarization performance."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-44",
"text": "However, extractive models have their own limitations."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-45",
"text": "For example, the extracted sentences might be too long and redundant."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-46",
"text": "Besides, human written summaries in their nature are abstractive."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-47",
"text": "Therefore, we focus on abstractive summarization in this paper."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-48",
"text": "Abstractive Summarization The goal of abstractive summarization is to generate summaries by rewriting a document, which is a sequence-tosequence learning problem."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-49",
"text": "SEQ2SEQ attentive LSTMs (Hochreiter and Schmidhuber, 1997; Bahdanau et al., 2015) are employed in Nallapati et al. (2016) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-50",
"text": "Even these models are extended with copy mechanism (Gu et al., 2016) , coverage model (See et al., 2017) and reinforcement learning (Paulus et al., 2018) , their results are still very close to that of Lead3 which selects the leading three sentences of a document as its summary."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-51",
"text": "One possible reason is that LSTMs without pre-training are not powerful enough."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-52",
"text": "Liu and Lapata (2019) used a SEQ2SEQ Transformer model with its encoder initialized with a pre-trained Transformer (i.e., BERT; Devlin et al. 2019 ) and achieved the state-of-the-art performance."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-53",
"text": "Our work goes one step further, we propose a method to pre-train the decoder together with the encoder and then initialize both the encoder and decoder of a summarization model with the pre-trained Transformers."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-54",
"text": "There is also a line of work that bridges extractive and abstractive models with reinforcement learning (Chen and Bansal, 2018) , attention fusion (Hsu et al., 2018) and bottom-up attention (Gehrmann et al., 2018) , while our model is conceptually simpler."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-55",
"text": "Pre-training Pre-training methods draw a lot of attention recently."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-56",
"text": "Peters et al. (2018) and Radford et al. (2019) pre-trained LSTM and Transformer encoders using language modeling objectives."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-57",
"text": "To leverage the context in both directions, (Devlin et al., 2019) proposed BERT, which is trained with the mask language modeling objective."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-58",
"text": "XLNet is trained with permutation language modeling objective, which removes the independence assumption of masked tokens in BERT."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-59",
"text": "RoBERTa extends BERT with more training data and better training strategies."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-60",
"text": "All the methods above focus on pre-training an encoder, while we propose methods to pre-train both the encoder and decoder of a SEQ2SEQ model."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-61",
"text": "Dong et al. (2019) proposed a Transformer language model that can be used for both natural language understanding and generation tasks, which is pre-trained using masked, unidirectional and SEQ2SEQ language modeling objectives."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-62",
"text": "Their method tries to pre-train a SEQ2SEQ Transformer with its encoder and decoder parameters shared."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-63",
"text": "Differently, we pre-train a SEQ2SEQ Transformer with separate parameters for the encoder and decoder."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-64",
"text": "Song et al. (2019) proposed a method to pre-train a SEQ2SEQ Transformer by masking a span of text and then predicting the original text with masked tokens at other positions."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-65",
"text": "Their pretraining task is similar to our Masked Document Generation task, but we apply a different masking strategy and predict the original text without masked tokens."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-66",
"text": "Besides, we propose another two tasks for SEQ2SEQ model pre-training."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-67",
"text": "Song et al. (2019) tested their model on sentence-level tasks (e.g., machine translation and sentence compression), while we aim to solve document-level tasks (e.g., abstractive document summarization)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-68",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-69",
"text": "**SEQUENCE-TO-SEQUENCE TRANSFORMER**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-70",
"text": "Pre-training"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-71",
"text": "This section first introduces the backbone architecture of our abstractive summarization model STEP."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-72",
"text": "We then describe methods to pre-train STEP and finally move on to the fine-tuning on summarization datasets."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-73",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-74",
"text": "**ARCHITECTURE**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-75",
"text": "In this work, the task of abstractive document summarization is modeled as a sequence-to-sequence learning problem, where a document is viewed as a sequence of tokens and its corresponding summary as another sequence of tokens."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-76",
"text": "We adopt the SEQ2SEQ Transformer architecture (Vaswani et al., 2017) , which includes an encoder Transformer and a decoder Transformer."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-77",
"text": "Both the encoder and decoder Transformers have multiple layers and each layer contains a multi-head attentive sub-layer 2 followed by a fully connected sublayer with residual connections (He et al., 2016) and layer normalization (Ba et al., 2016) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-78",
"text": "Let us use X = (x 1 , x 2 , . . . , x |X| ) to denote a document and use Y = (y 1 , y 2 , . . . , y |Y | ) to denote its summary."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-79",
"text": "The encoder takes the docu-ment X as input and transforms it to its contextual representations."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-80",
"text": "The decoder learns to generate the summary Y one token at a time based on the contextual representations and all preceding tokens that have been generated so far:"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-81",
"text": "where y Liu and Lapata, 2019) , we use the non-anonymized version of CNNDM."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-169",
"text": "Specifically, we preprocess the dataset with the publicly available scripts 5 provided by See et al. (2017) and obtain 287,226 document-summary pairs for training, 13,368 for validation and 11,490 for test."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-170",
"text": "NYT The NYT dataset is a collection of articles along with multi-sentence summaries written by library scientists."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-171",
"text": "We closely follow the preprocessing procedures described in Durrett et al. (2016) and Liu and Lapata (2019) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-172",
"text": "The test set is constructed by including all articles published on January 1, 2017 or later, which contains 9,076 articles."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-173",
"text": "The remaining 100,834 articles are split into a training set of 96,834 examples and a validation set of 4,000 examples."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-174",
"text": "As in (Durrett et al., 2016) , we also remove articles whose summaries contain less than 50 words from the test set, and the resulting test set contains 3,452 examples."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-175",
"text": "GIGA-CM To pre-train our model with the tasks introduced in Section 3.2, following the procedures in (Zhang et al., 2019) , we created the GIGA-CM dataset, which contains only unlabeled documents."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-176",
"text": "The training set of GIGA-CM is composed of 6,521,658 documents sampled from the English Gigaword dataset 6 and the training documents in CNNDM."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-177",
"text": "We used the 13,368 documents in the validation split of CNNDM as its validation set."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-178",
"text": "Note that the Gigaword dataset overlaps with the NYT dataset and we therefore exclude the test set of NYT from the training set of GIGA-CM."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-179",
"text": "For CNNDM, NYT and GIGA-CM datasets, we segment and tokenize documents and/or summaries (GIGA-CM only contains documents) using the Stanford CoreNLP toolkit (Manning et al., 2014) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-180",
"text": "To reduce the vocabulary size, we further apply the UTF8 based BPE (Sennrich et al.) introduced in GPT-2 (Radford et al., 2019) to all datasets."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-181",
"text": "As a common wisdom in abstractive summarization, documents and summaries in CN-NDM and NYT are usually truncated to 512 and 256 tokens, respectively."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-182",
"text": "We leverage unlabeled documents differently for different pre-training tasks (see Section 3.2)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-183",
"text": "We first split each document into 512 token segments if it contains more than 512 tokens (segments or documents with less than 512 tokens are removed)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-184",
"text": "In Sentence Reordering (SR) and Masked Document Generation (MDG), we use the segment after transformation to predict the original segment."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-185",
"text": "We set the minimum masked length a = 100 and the maximum masked length b = 256 in MDG."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-186",
"text": "In Next Sentence Generation (NSG), each segment is used to predict its next 256 tokens."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-187",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-188",
"text": "**IMPLEMENTATION DETAILS**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-189",
"text": "As mentioned in Section 3, our model is a SEQ2SEQ Transformer model (Vaswani et al., 2017) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-190",
"text": "The encoder is initialized with the RoBERTa LARGE model 7 , and therefore they share the same architecture."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-191",
"text": "Specifically, the encoder is a 24-layer Transformer."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-192",
"text": "Each layer has 16 attention heads and its hidden size and feed-forward filter size are 1,024 and 4,096, respectively."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-193",
"text": "The decoder is shallower with 6 layers."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-194",
"text": "The hidden size and number of attention head of the decoder are identical to these of the encoder, but the feed-forward filter size is 2,048."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-195",
"text": "We use a smaller filter size in the decoder to reduce the computational and memory cost."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-196",
"text": "The dropout rates of all layers in the encoder are set to 0.1 and all dropout rates in the decoder are set to 0.3."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-197",
"text": "Our models are optimized using Adam (Kingma and Ba, 2015) with \u03b2 1 = 0.9, \u03b2 2 = 0.98."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-198",
"text": "The other optimization hyper-parameters for pretraining and fine-tuning are different."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-199",
"text": "In the pretraining stage, the encoder is initialized with a pretrained model while the decoder is randomly initialized."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-200",
"text": "Therefore, we used two separate optimizers for the encoder and decoder with a smaller learning rate for the encoder optimizer."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-201",
"text": "Learning rates of the encoder and decoder are set to 2e \u2212 5 and 1e\u22124 with 10,000 warmup steps, respectively."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-202",
"text": "We also adopted the same learning rate schedule strategies as Vaswani et al. (2017) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-203",
"text": "We used smaller batch sizes for datasets with less examples (i.e., 1,024 for GIGA-CM, 256 for CNNDM and 128 for NYT) to ensure each epoch has sufficient number of model updates."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-204",
"text": "We trained our models until their convergence of validation perplexities (around 30 epochs on GIGA-CM, 60 epochs on CNNDM and 40 epochs on NYT)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-205",
"text": "One epoch on GIGA-CM takes around 24 hours with 8 Nvidia Tesla V100 GPUs."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-206",
"text": "The time costs for different pre-training tasks are close."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-207",
"text": "Most of the hyper-parameters in the fine-tuning stage are the same as these in the pre-training stage."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-208",
"text": "The differences are as follows."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-209",
"text": "The learning rates for both the encoder and decoder are set to 2e \u2212 5 with 4,000 warmup steps, since both the encoder and decoder are already pre-trained."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-210",
"text": "We trained our models for 50 epochs (saved per epoch) and selected the best model w.r.t."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-211",
"text": "ROUGE score on the validation set ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-212",
"text": "During decoding, we applied beam search with beam size of 5."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-213",
"text": "Following (Paulus et al., 2018) , we also blocked repeating trigrams during beam search and tuned the minimum summary length on the validation set."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-214",
"text": "Similar to the pre-training process, the datasets with less instances were fine-tuned with smaller batch size (i.e., 768 for CNNDM and 64 for NYT)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-215",
"text": "(Paulus et al., 2018) 39.87 15.82 36.90 BottomUp (Gehrmann et al., 2018) 41.22 18.68 38.34 DCA (Celikyilmaz et al., 2018) 41.69 19.47 37.92 BERTAbs (Liu and Lapata, 2019) 42.13 19.60 39.18 UniLM (Dong et al., 2019) 43"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-216",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-217",
"text": "**EVALUATIONS**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-218",
"text": "We used ROUGE (Lin, 2004) to measure the quality of different summarization model outputs."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-219",
"text": "We reported full-length F1 based ROUGE-1, ROUGE-2 and ROUGE-L scores on CN-NDM, while we used the limited-length recall based ROUGE-1, ROUGE-2 and ROUGE-L on NYT following (Durrett et al., 2016) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-220",
"text": "The ROUGE scores are computed using the ROUGE-1.5.5.pl script."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-221",
"text": "Since summaries generated by abstractive models may produce disfluent or ungrammatical outputs, we also evaluated abstractive systems by eliciting human judgements."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-222",
"text": "Following previous work (Cheng and Lapata, 2016; Narayan et al., 2018) , 20 documents are randomly sampled from the test split of CNNDM."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-223",
"text": "Participants are presented with a document and a list of outputs generated by different abstractive summarization systems."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-224",
"text": "Then they are asked to rank the outputs according to informativeness (does the summary capture the informative part of the document?), fluency (is the summary grammatical?), and succinctness (does the summary express the document clearly in a few words?) (See et al., 2017) 43.71 * 26.40 * -DRM (Paulus et al., 2018) 42.94 * 26.02 * -BERTAbs (Liu and Lapata, 2019)"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-225",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-226",
"text": "**RESULTS**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-227",
"text": "Automatic Evaluation The results on the CN-NDM are summarized in Table 1 ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-228",
"text": "The first and second blocks show results of previous extractive and abstractive models, respectively."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-229",
"text": "Results of STEP are all listed in the third block."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-230",
"text": "Lead3 is a baseline which simply takes the first three sentences of a document as its summary."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-231",
"text": "BERTExt (Liu and Lapata, 2019 ) is an extractive model fine-tuning on BERT (Devlin et al., 2019) that outperforms other extractive systems."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-232",
"text": "PTGen (See et al., 2017) , DRM (Paulus et al., 2018) , and DCA (Celikyilmaz et al., 2018) are sequence-to-sequence learning based models extended with copy and coverage mechanism, reinforcement learning, and deep communicating agents individually."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-233",
"text": "BottomUp (Gehrmann et al., 2018) assisted summary generation with a word prediction model."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-234",
"text": "BERTAbs (Liu and Lapata, 2019) and UniLM (Dong et al., 2019) are both pre-training based SEQ2SEQ summarization models."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-235",
"text": "We also implemented three abstractive models as our baselines."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-236",
"text": "Transformer-S2S is 6-layer SEQ2SEQ Transformer (Vaswani et al., 2017) with random initialization."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-237",
"text": "When we replaced the encoder of Transformer-S2S with RoBERTa BASE , RoBERTa BASE -S2S outperforms Transformer-S2S by nearly 2 ROUGE, which demonstrates the effectiveness of pre-trained models."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-238",
"text": "With even larger pre-trained model RoBERTa LARGE , RoBERTa-S2S is compa-rable with the best published abstractive model UniLM (Dong et al., 2019) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-239",
"text": "Based on RoBERTa-S2S (the sizes of STEP and RoBERTa-S2S are identical), we study the effect of different pre-training tasks (see Section 3.2)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-240",
"text": "We first pre-train STEP on unlabeled documents of CNNDM training split to get quick feedback 8 , denoted as STEP (in-domain)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-241",
"text": "From the top part of the third block in Table 1 , we can see that Sentence Reordering (SR), Next Sentence Generation (NSG) and Masked Document Generation (MDG) can all improve RoBERTa-S2S significantly measured by the ROUGE script."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-242",
"text": "Note that according to the ROUGE script, \u00b10.22 ROUGE almost always means a significant difference with p < 0.05."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-243",
"text": "Interesting, even STEP is pre-trained on 230 million words, it outperforms UniLM that is pretrained on 3,000 million words (Dong et al., 2019) ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-244",
"text": "When we pre-train STEP on even larger dataset (i.e., GIGA-CM), the results are further improved and STEP outperforms all models in comparison, as listed in the bottom part of Table 1 ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-245",
"text": "Table 2 presents results on NYT dataset."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-246",
"text": "Following the same evaluation protocol as Durrett et al. (2016) , we adopted the limited-length recall based ROUGE, where we truncated the predicted summaries to the length of the gold ones."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-247",
"text": "Again, the first and second blocks show results of previous extractive and abstractive models, respectively."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-248",
"text": "Results of STEP are listed in the third block."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-249",
"text": "Similar to the trends in CNNDM, STEP leads significant performance gains (with p < 0.05) compared to all other models in Table 2 ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-250",
"text": "Among all three pre-training tasks, SR works slightly better than the other two tasks (i.e., NSG and MDG)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-251",
"text": "We also tried to randomly use all the three tasks during training with 1/3 probability each (indicated as ALL)."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-252",
"text": "Interesting, we observed that, in general, All outperforms all three tasks when employing unlabeled documents of training splits of CNNDM or NYT, which might be due to limited number of unlabeled documents of the training splits."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-253",
"text": "After adding more data (i.e., GIAG-CM) to pre-training, SR consistently achieves highest ROUGE-2 on both CNNDM and NYT."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-254",
"text": "We conclude that SR is the most effective task for pre-training since sentence reordering task requires comprehensively understanding a document in a wide coverage, going beyond individual words and sentences, which is highly close to the 8 One epoch takes 3 hours on CNNDM and 0.5 on NYT."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-255",
"text": "essense of abstractive document summarization."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-256",
"text": "Human Evaluation We also conducted human evaluation with 20 documents randomly sampled from the test split of CNNDM."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-257",
"text": "We compared the best preforming STEP model (i.e., pre-training on the GIGA-CM dataset using SR task) with human references (denoted as Gold), RoBERTa-S2S, and two pre-training based models, BERTAbs (Liu and Lapata, 2019) and UniLM (Dong et al., 2019) 9 ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-258",
"text": "Participants were asked to rank the outputs of these systems from best to worst."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-259",
"text": "We report the proportions of system rankings and mean rank (lower is better) in Table 3 ."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-260",
"text": "The output of STEP is selected as the best for the 25% of cases and we obtained lower mean rank than all systems except for Gold, which shows the participants' preference for our model."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-261",
"text": "Then we converted ranking numbers into ratings (i.e., rank i is converted into 6 \u2212 i) and applied the student t-test on the ratings."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-262",
"text": "STEP is significantly better than all other systems in comparison with p < 0.05. But it still lags behind human."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-263",
"text": "One possible reason is that STEP (as well as other systems) only takes the first 512 tokens of a long document as input and thus may lose information residing in the following tokens."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-264",
"text": "----------------------------------"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-265",
"text": "**CONCLUSION**"
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-266",
"text": "We proposed STEP, a SEQ2SEQ transformer pretraining approach, for abstractive document summarization."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-267",
"text": "Specifically, three pre-training tasks are designed, sentence reordering, next sentence generation, and masked document generation."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-268",
"text": "When we only employ the unlabeled documents in the training splits of summarization datasets to pre-training STEP with our proposed tasks, the summarization model based on the pre-trained STEP outperforms the best published abstractive system."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-269",
"text": "Involving large scale data to pretraining leads to larger performance gains."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-270",
"text": "By using the best performing pre-training task, STEP achieves 0.8 absolute ROUGE-2 improvements on CNN/DailyMail and 2.4 absolute ROUGE-2 improvements on New York Times."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-271",
"text": "In the future, we would like to investigate other tasks to pre-train the SEQ2SEQ transformer model."
},
{
"sent_id": "fa00b8bac394b48bf950f154c65216-C001-272",
"text": "Pre-training for unsupervised abstractive summarization is also an interesting direction and worth exploration."
}
],
"y": {
"@USE@": {
"gold_contexts": [
[
"fa00b8bac394b48bf950f154c65216-C001-24"
],
[
"fa00b8bac394b48bf950f154c65216-C001-168"
],
[
"fa00b8bac394b48bf950f154c65216-C001-171"
]
],
"cite_sentences": [
"fa00b8bac394b48bf950f154c65216-C001-24",
"fa00b8bac394b48bf950f154c65216-C001-168",
"fa00b8bac394b48bf950f154c65216-C001-171"
]
},
"@SIM@": {
"gold_contexts": [
[
"fa00b8bac394b48bf950f154c65216-C001-24"
]
],
"cite_sentences": [
"fa00b8bac394b48bf950f154c65216-C001-24"
]
},
"@DIF@": {
"gold_contexts": [
[
"fa00b8bac394b48bf950f154c65216-C001-24",
"fa00b8bac394b48bf950f154c65216-C001-25"
]
],
"cite_sentences": [
"fa00b8bac394b48bf950f154c65216-C001-24",
"fa00b8bac394b48bf950f154c65216-C001-25"
]
},
"@BACK@": {
"gold_contexts": [
[
"fa00b8bac394b48bf950f154c65216-C001-43"
],
[
"fa00b8bac394b48bf950f154c65216-C001-231"
],
[
"fa00b8bac394b48bf950f154c65216-C001-234"
]
],
"cite_sentences": [
"fa00b8bac394b48bf950f154c65216-C001-43",
"fa00b8bac394b48bf950f154c65216-C001-231",
"fa00b8bac394b48bf950f154c65216-C001-234"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"fa00b8bac394b48bf950f154c65216-C001-257"
]
],
"cite_sentences": [
"fa00b8bac394b48bf950f154c65216-C001-257"
]
}
}
},
"ABC_dbe1f1bdf7d94824f6f7cd176a4f6d_10": {
"x": [
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-2",
"text": "Inferring implicit discourse relations in natural language text is the most difficult subtask in discourse parsing."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-99",
"text": "**CORPORA AND IMPLEMENTATION**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-3",
"text": "Surface features achieve good performance, but they are not readily applicable to other languages without semantic lexicons."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-4",
"text": "Previous neural models require parses, surface features, or a small label set to work well."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-5",
"text": "Here, we propose neural network models that are based on feedforward and long-short term memory architecture without any surface features."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-6",
"text": "To our surprise, our best configured feedforward architecture outperforms LSTM-based model in most cases despite thorough tuning."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-7",
"text": "Under various fine-grained label sets and a cross-linguistic setting, our feedforward models perform consistently better or at least just as well as systems that require hand-crafted surface features."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-8",
"text": "Our models present the first neural Chinese discourse parser in the style of Chinese Discourse Treebank, showing that our results hold cross-linguistically."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-11",
"text": "The discourse structure of a natural language text has been analyzed and conceptualized under various frameworks (Mann and Thompson, 1988; Lascarides and Asher, 2007; Prasad et al., 2008) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-12",
"text": "The Penn Discourse TreeBank (PDTB) and the Chinese Discourse Treebank (CDTB), currently the largest corpora annotated with discourse structures in English and Chinese respectively, view the discourse structure of a text as a set of discourse relations (Prasad et al., 2008; Zhou and Xue, 2012) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-13",
"text": "Each discourse relation is grounded by a discourse * Work performed while being a student at Brandeis connective taking two text segments as arguments (Prasad et al., 2008) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-14",
"text": "Implicit discourse relations are those where discourse connectives are omitted from the text and yet the discourse relations still hold."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-15",
"text": "While classifying explicit discourse relations is relatively easy, as the discourse connective itself provides a strong cue for the discourse relation (Pitler et al., 2008) , the classification of implicit discourse relations has proved to be notoriously hard and it has remained one of the last missing pieces in an end-to-end discourse parser (Xue et al., 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-16",
"text": "In the absence of explicit discourse connectives, implicit discourse relations have to be inferred from their two arguments."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-17",
"text": "Previous approaches on inferring implicit discourse relations have typically relied on features extracted from their two arguments."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-18",
"text": "These features include word pairs that are the Cartesian products of the word tokens in the two arguments as well as features manually crafted from various lexicons such as verb classes and sentiment lexicons (Pitler et al., 2009; Rutherford and Xue, 2014) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-19",
"text": "These lexicons are used mainly to offset the data sparsity problem created by pairs of word tokens used directly as features."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-20",
"text": "Neural network models are an attractive alternative for this task for at least two reasons."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-21",
"text": "First, they can model the argument of an implicit discourse relation as dense vectors and suffer less from the data sparsity problem that is typical of the traditional feature engineering paradigm."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-22",
"text": "Second, they should be easily extended to other languages as they do not require human-annotated lexicons."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-23",
"text": "However, despite the many nice properties of neural network models, it is not clear how well they will fare with a small dataset, typicalley found in discourse annotation projects."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-24",
"text": "Moreover, it is not straightforward to construct a single vector that properly represents the \"semantics\" of the ar-guments."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-25",
"text": "As a result, neural network models that use dense vectors have been shown to have inferior performance against traditional systems that use manually crafted features, unless the dense vectors are combined with the hand-crafted surface features (Ji and Eisenstein, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-26",
"text": "In this work, we explore multiple neural architectures in an attempt to find the best distributed representation and neural network architecture suitable for this task in both English and Chinese."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-27",
"text": "We do this by probing the different points on the spectrum of structurality from structureless bag-of-words models to sequential and tree-structured models."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-28",
"text": "We use feedforward, sequential long short-term memory (LSTM), and tree-structured LSTM models to represent these three points on the spectrum."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-29",
"text": "To the best of our knowledge, there is no prior study that investigates the contribution of the different architectures in neural discourse analysis."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-30",
"text": "Our main contributions and findings from this work can be summarized as follows:"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-31",
"text": "\u2022 Our neural discourse model performs comparably with or even outperforms systems with surface features across different fine-grained discourse label sets."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-32",
"text": "\u2022 We investigate the contribution of the linguistic structures in neural discourse modeling and found that high-dimensional word vectors trained on a large corpus can compensate for the lack of structures in the model, given the small amount of annotated data."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-33",
"text": "\u2022 We found that modeling the interaction across arguments via hidden layers is essential to improving the performance of an implicit discourse relation classifier."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-34",
"text": "\u2022 We present the first neural CDTB-style Chinese discourse parser, confirming that our current results and other previous findings conducted on English data also hold crosslinguistically."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-35",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-36",
"text": "**RELATED WORK**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-37",
"text": "The prevailing approach for this task is to use surface features derived from various semantic lexicons (Pitler et al., 2009) , reducing the number of parameters by mapping raw word tokens in the arguments of discourse relations to a limited number of entries in a semantic lexicon such as polarity and verb classes."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-38",
"text": "Along the same vein, Brown cluster assignments have also been used as a general purpose lexicon that requires no human manual annotation (Rutherford and Xue, 2014) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-39",
"text": "However, these solutions still suffer from the data sparsity problem and almost always require extensive feature selection to work well (Park and Cardie, 2012; Lin et al., 2009; Ji and Eisenstein, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-40",
"text": "The work we report here explores the use of the expressive power of distributed representations to overcome the data sparsity problem found in the traditional feature engineering paradigm."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-41",
"text": "Neural network modeling has attracted much attention in the NLP community recently and has been explored to some extent in the context of this task."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-42",
"text": "Recently, Braud and Denis (2015) tested various word vectors as features for implicit discourse relation classification and show that distributed features achieve the same level of accuracy as one-hot representations in some experimental settings."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-43",
"text": "Ji et al. (2015; 2016) advance the state of the art for this task using recursive and recurrent neural networks."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-44",
"text": "In the work we report here, we systematically explore the use of different neural network architectures and show that when high-dimensional word vectors are used as input, a simple feed-forward architecture can outperform more sophisticated architectures such as sequential and tree-based LSTM networks, given the small amount of data."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-45",
"text": "Recurrent neural networks, especially LSTM networks, have changed the paradigm of deriving distributed features from a sentence (Hochreiter and Schmidhuber, 1997), but they have not been much explored in the realm of discourse parsing."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-46",
"text": "LSTM models have been notably used to encode the meaning of source language sentence in neural machine translation (Cho et al., 2014; Devlin et al., 2014) and recently used to encode the meaning of an entire sentence to be used as features (Kiros et al., 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-47",
"text": "Many neural architectures have been explored and evaluated, but there is no single technique that is decidedly better across all tasks."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-48",
"text": "The LSTM-based models such as Kiros et al. (2015) perform well across tasks but do not outperform some other strong neural baselines."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-49",
"text": "Ji et al. (2016) to deduce how well LSTM fares in fine-grained implicit discourse relation classification."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-50",
"text": "A joint discourse language model might not scale well to finer-grained label set, which is more practical for application."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-51",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-52",
"text": "**MODEL ARCHITECTURES**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-53",
"text": "Following previous work, we assume that the two arguments of an implicit discourse relation are given so that we can focus on predicting the senses of the implicit discourse relations."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-54",
"text": "The input to our model is a pair of text segments called Arg1 and Arg2, and the label is one of the senses defined in the Penn Discourse Treebank as in the example below: Input: Arg1 Senator Pete Domenici calls this effort \"the first gift of democracy\" Arg2 The Poles might do better to view it as a Trojan Horse."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-55",
"text": "Output: Sense Comparison."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-56",
"text": "Contrast In all architectures, each word in the argument is represented as a k-dimensional word vector trained on an unannotated data set."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-57",
"text": "We use various model architectures to transform the semantics represented by the word vectors into distributed continuous-valued features."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-58",
"text": "In the rest of the section, we explain the details of the neural network architectures that we design for the implicit discourse relations classification task."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-59",
"text": "The models are summarized schematically in Figure 1 ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-60",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-61",
"text": "**BAG-OF-WORDS FEEDFORWARD MODEL**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-62",
"text": "This model does not model the structure or word order of a sentence."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-63",
"text": "The features are simply obtained through element-wise pooling functions."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-64",
"text": "Pooling is one of the key techniques in neural network modeling of computer vision (Krizhevsky et al., 2012; LeCun et al., 2010) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-65",
"text": "Max pooling is known to be very effective in vision, but it is unclear what pooling function works well when it comes to pooling word vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-66",
"text": "Summation pooling and mean pooling have been claimed to perform well at composing meaning of a short phrase from individual word vectors (Le and Mikolov, 2014; Blacoe and Lapata, 2012; Mikolov et al., 2013b; Braud and Denis, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-67",
"text": "The Arg1 vector a 1 and Arg2 vector a 2 are computed by applying element-wise pooling function f on all of the N 1 word vectors in Arg1 w 1 1:N 1 and all of the N 2 word vectors in Arg2 w 2 1:N 2 respectively:"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-68",
"text": "We consider three different pooling functions namely max, summation, and mean pooling functions:"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-69",
"text": "Inter-argument interaction is modeled directly by the hidden layers that take argument vectors as features."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-70",
"text": "Discourse relations cannot be determined based on the two arguments individually."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-71",
"text": "Instead, the sense of the relation can only be determined when the arguments in a discourse relation are analyzed jointly."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-72",
"text": "The first hidden layer h 1 is the non-linear transformation of the weighted linear combination of the argument vectors:"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-73",
"text": "where W 1 and W 2 are d \u00d7 k weight matrices and b h 1 is a d-dimensional bias vector."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-74",
"text": "Further hidden layers h t and the output layer o follow the standard feedforward neural network model."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-75",
"text": "where W ht is a d \u00d7 d weight matrix, b ht is a ddimensional bias vector, and T is the number of hidden layers in the network."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-77",
"text": "**SEQUENTIAL LONG SHORT-TERM MEMORY (LSTM)**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-78",
"text": "A sequential Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) models the semantics of a sequence of words through the use of hidden state vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-79",
"text": "Therefore, the word ordering does affect the resulting hidden state vectors, unlike the bag-of-word model."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-80",
"text": "For each word vector at word position t, we compute the corresponding hidden state vector s t and the memory cell vector c t from the previous step."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-81",
"text": "where * is elementwise multiplication."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-82",
"text": "The argument vectors are the results of applying a pooling function over the hidden state vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-83",
"text": "In addition to the three pooling functions that we describe in the previous subsection, we also consider using only the last hidden state vector, which should theoretically be able to encode the semantics of the entire word sequence."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-84",
"text": "Inter-argument interaction and the output layer are modeled in the same fashion as the bag-of-words model once the argument vector is computed."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-86",
"text": "**TREE LSTM**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-87",
"text": "The principle of compositionality leads us to believe that the semantics of the argument vector should be determined by the syntactic structures and the meanings of the constituents."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-88",
"text": "For a fair comparison with the sequential model, we apply the same formulation of LSTM on the binarized constituent parse tree."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-89",
"text": "The hidden state vector now corresponds to a constituent in the tree."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-90",
"text": "These hidden state vectors are then used in the same fashion as the sequential LSTM."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-91",
"text": "The mathematical formulation is the same as Tai et al. (2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-92",
"text": "This model is similar to the recursive neural networks proposed by Ji and Eisenstein (2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-93",
"text": "Our model differs from their model in several ways."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-94",
"text": "We use the LSTM networks instead of the \"vanilla\" RNN formula and expect better results due to less complication with vanishing and exploding gradients during training."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-95",
"text": "Furthermore, our purpose is to compare the influence of the model structures."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-96",
"text": "Therefore, we must use LSTM cells in both sequential and tree LSTM models for a fair and meaningful comparison."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-97",
"text": "The more indepth comparison of our work and recursive neural network model by Ji and Eisenstein (2015) is provided in the discussion section."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-100",
"text": "The Penn Discourse Treebank (PDTB) We use the PDTB due to its theoretical simplicity in discourse analysis and its reasonably large size."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-101",
"text": "The annotation is done as another layer on the Penn Treebank on Wall Street Journal sections."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-102",
"text": "Each relation consists of two spans of text that are minimally required to infer the relation, and the sense is organized hierarchically."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-103",
"text": "The classification problem can be formulated in various ways based on the hierarchy."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-104",
"text": "Previous work in this task has been done over three schemes of evaluation: top-level 4-way classification (Pitler et al., 2009 ), second-level 11-way classification (Lin et al., 2009; Ji and Eisenstein, 2015) , and modified second-level classification introduced in the CoNLL 2015 Shared Task ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-105",
"text": "We focus on the second-level 11-way classification because the labels are fine-grained enough to be useful for downstream tasks and also because the strongest neural network systems are tuned to this formulation."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-106",
"text": "If an instance is annotated with two labels (\u223c3% of the data), we only use the first label."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-107",
"text": "Partial labels, which constitute \u223c2% of the data, are excluded."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-108",
"text": "Table 4 shows the distribution of labels in the training set (sections 2-21), development set (section 22), and test set (section 23)."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-109",
"text": "Training Weight initialization is uniform random, following the formula recommended by ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-110",
"text": "The cost function is the standard crossentropy loss function, as the hinge loss function (large-margin framework) yields consistently inferior results."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-111",
"text": "We use Adagrad as the optimization algorithm of choice."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-112",
"text": "The learning rates are tuned over a grid search."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-113",
"text": "We monitor the accuracy on the development set to determine convergence and prevent overfitting."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-114",
"text": "L2 regularization and/or dropout do not make a big impact on performance in our case, so we do not use them in the final results."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-115",
"text": "Implementation All of the models are implemented in Theano (Bergstra et al., 2010; Bastien et al., 2012) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-116",
"text": "The gradient computation is done with symbolic differentiation, a functionality provided by Theano."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-117",
"text": "Feedforward models and sequential LSTM models are trained on CPUs on Intel Xeon X5690 3.47GHz, using only a single core per model."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-118",
"text": "A tree LSTM model is trained on a GPU on Intel Xeon CPU E5-2660."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-119",
"text": "All models converge within hours."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-120",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-121",
"text": "**EXPERIMENT ON THE SECOND-LEVEL SENSE IN THE PDTB**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-122",
"text": "We want to test the effectiveness of the interargument interaction and the three models described above on the fine-grained discourse relations in English."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-123",
"text": "The data split and the label set are exactly the same as previous works that use this label set (Lin et al., 2009; Ji and Eisenstein, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-124",
"text": "Preprocessing All tokenization is taken from the gold standard tokenization in the PTB (Marcus et al., 1993) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-125",
"text": "We use the Berkeley parser to parse all of the data (Petrov et al., 2006 too little data, 50-dimensional WSJ-trained word vectors have previously been shown to be the most effective in this task (Ji and Eisenstein, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-126",
"text": "Additionally, we also test the off-the-shelf word vectors trained on billions of tokens from Google News data freely available with the word2vec tool."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-127",
"text": "All word vectors are trained on the Skipgram architecture (Mikolov et al., 2013b; Mikolov et al., 2013a) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-128",
"text": "Other models such as GloVe and continuous bag-of-words seem to yield broadly similar results (Pennington et al., 2014) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-129",
"text": "We keep the word vectors fixed, instead of fine-tuning during training."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-130",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-131",
"text": "**RESULTS AND DISCUSSION**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-132",
"text": "The feedforward model performs best overall among all of the neural architectures we explore (Table 2) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-133",
"text": "It outperforms the recursive neural network with bilinear output layer introduced by Ji and Eisenstein (2015) (p < 0.05; bootstrap test) and performs comparably with the surface feature baseline (Lin et al., 2009) , which uses various lexical and syntactic features and extensive feature selection."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-134",
"text": "Tree LSTM achieves inferior accuracy than our best feedforward model."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-135",
"text": "The Figure 3: Inter-argument interaction can be modeled effectively with hidden layers."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-136",
"text": "The results are shown for the feedforward models with summation pooling, but this effect can be observed robustly in all architectures we consider."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-137",
"text": "best configuration of the feedforward model uses 300-dimensional word vectors, one hidden layer, and the summation pooling function to derive argument feature vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-138",
"text": "The model behaves well during training and converges in less than an hour on a CPU."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-139",
"text": "The sequential LSTM model outperforms the feedforward model when word vectors are not high-dimensional and not trained on a large corpus (Figure 4) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-160",
"text": "In previous work, LSTMs are applied to tasks with a lot of labeled data compared to mere 12,930 instances that we have (Vinyals et al., 2015; Chiu and Nichols, 2015; \u0130rsoy and Cardie, 2014) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-140",
"text": "Moving from 50 units to 100 units trained on the same dataset, we do not observe much of a difference in performance in both architectures, but the sequential LSTM model beats the feedforward model in both settings."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-141",
"text": "This suggests that only 50 dimensions are needed for the WSJ corpus."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-142",
"text": "However, the trend reverses when we move to 300-dimensional word vectors trained on a much larger corpus."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-143",
"text": "These results suggest an interaction between the lexical information encoded by word vectors and the structural information encoded by the model itself."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-144",
"text": "Hidden layers, especially the first one, make a substantial impact on performance."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-145",
"text": "This effect is observed across all architectures (Figure 3) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-146",
"text": "Strikingly, the improvement can be as high as 8% absolute when used with the feedforward model with small word vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-147",
"text": "We tried up to four hidden layers and found that the additional hidden layers yield diminishing-if not negative-returns."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-148",
"text": "These effects are not an artifact of the training process as we have tuned the models quite extensively, although it might be the case that we do not have sufficient data to fit those extra parameters."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-149",
"text": "Summation pooling is effective for both feedforward and LSTM models (Figure 2) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-150",
"text": "The word vectors we use have been claimed to have some additive properties (Mikolov et al., 2013b) , so summation pooling in this experiment supports this claim."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-151",
"text": "Max pooling is only effective for LSTM, probably because the values in the word vector encode the abstract features of each word relative to each other."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-152",
"text": "It can be trivially shown that if all of the vectors are multiplied by -1, then the results from max pooling will be totally different, but the word similarities remain the same."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-153",
"text": "The memory cells and the state vectors in the LSTM models transform the original word vectors to work well the max pooling operation, but the feedforward net cannot transform the word vectors to work well with max pooling as it is not allowed to change the word vectors themselves."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-154",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-155",
"text": "**DISCUSSION**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-156",
"text": "Why does the feedforward model outperform the LSTM models?"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-157",
"text": "Sequential and tree LSTM models might work better if we are given larger amount of data."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-158",
"text": "We observe that LSTM models outperform the feedforward model when word vectors are smaller, so it is unlikely that we train the LSTMs incorrectly."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-159",
"text": "It is more likely that we do not have enough annotated data to train a more powerful model such as LSTM."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-161",
"text": "Another explanation comes from the fact that the contextual information encoded in the word vectors can compensate for the lack of structure in the model in this task."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-162",
"text": "Word vectors are already trained to encode the words in their linguistic context especially information from word order."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-163",
"text": "Our discussion would not be complete without explaining our results in relation to the recursive neural network model proposed by Ji and Eisenstein (2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-164",
"text": "Why do sequential LSTM models outperform recursive neural networks or tree LSTM models?"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-165",
"text": "Although this first comes as a surprise to us, the results are consistent with recent works that use sequential LSTM to encode syntactic information."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-166",
"text": "For example, Vinyals et al. (2015) use sequential LSTM to encode the features for syntactic parse output."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-167",
"text": "Tree LSTM seems to show improvement when there is a need to model longdistance dependency in the data (Tai et al., 2015; Li et al., 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-168",
"text": "Furthermore, the benefits of tree LSTM are not readily apparent for a model that discards the syntactic categories in the intermediate nodes and makes no distinction between heads and their dependents, which are at the core of syntactic representations."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-169",
"text": "Another point of contrast between our work and Ji and Eisenstein's (2015) is the modeling choice for inter-argument interaction."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-170",
"text": "Our experimental results show that the hidden layers are an important contributor to the performance for all of our models."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-171",
"text": "We choose linear inter-argument interaction instead of bilinear interaction, and this decision gives us at least two advantages."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-172",
"text": "Linear interaction allows us to stack up hidden layers without the exponential growth in the number of parameters."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-173",
"text": "Secondly, using linear interaction allows us to use high dimensional word vectors, which we found to be another important component for the performance."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-174",
"text": "The recursive model by Ji and Eisenstein (2015) is limited to 50 units due to the bilinear layer."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-175",
"text": "Our choice of linear interargument interaction and high-dimensional word vectors turns out to be crucial to building a competitive neural network model for classifying implicit discourse relations."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-176",
"text": "6 Extending the results across label sets and languages"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-177",
"text": "Do our feedforward models perform well without surface features across different label sets and languages as well? We want to extend our results to another label set and language by evaluating our models on non-explicit discourse relation data used in English and Chinese CoNLL 2016 Shared Task."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-178",
"text": "We will have more confidence in our model if it works well across label sets."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-179",
"text": "It is also important that our model works cross-linguistically because other languages might not have resources such as semantic lexicons or parsers, required by some previously used features."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-180",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-181",
"text": "**ENGLISH DISCOURSE RELATIONS**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-182",
"text": "We follow the experimental setting used in CoNLL 2015-2016 Shared Task as we want to compare our results against previous systems."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-183",
"text": "This setting differs from the previous experiment in a few ways."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-184",
"text": "Entity relations (EntRel) and alternative lexicalization relations (AltLex) are included in this setting."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-185",
"text": "The label set is modified by the shared task organizers into 15 different senses including EntRel as another sense (Xue et al., 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-186",
"text": "We use the 300-dimensional word vector used in the previous experiment and tune the number of hidden layers and hidden units on the development set."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-187",
"text": "The best results from last year's shared task are used as a strong baseline."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-188",
"text": "It only uses surface features and also achieves the stateof-the-art performance under this label set (Wang and Lan, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-189",
"text": "These features are similar to the ones used by Lin et al. (2009 (Zhou and Xue, 2015) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-190",
"text": "The sense set consists of 10 different senses, which are not organized in a hierarchy, unlike the PDTB."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-191",
"text": "We use the version of the data provided to the CoNLL 2016 Shared Task participants."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-192",
"text": "This version has 16,946 instances of discourse relations total in the combined training and development sets."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-193",
"text": "The test set is not yet available at the time of submission, so the system is evaluated based on the average accuracy over 7-fold cross-validation on the combined set of training and development sets."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-194",
"text": "There is no previously published baseline for Chinese."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-195",
"text": "To establish baseline comparison, we use MaxEnt models loaded with the feature sets previously shown to be effective for English, namely dependency rule pairs, production rule pairs (Lin et al., 2009) , Brown cluster pairs (Rutherford and Xue, 2014) , and word pairs (Marcu and Echihabi, 2002) ."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-196",
"text": "We use information gain criteria to select the best subset of each feature set, which is crucial in feature-based discourse parsing."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-197",
"text": "Chinese word vectors are induced through CBOW and Skipgram architecture in word2vec (Mikolov et al., 2013a) on Chinese Gigaword corpus (Graff and Chen, 2005) using default settings."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-198",
"text": "The number of dimensions that we try are 50, 100, 150, 200, 250, and 300."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-199",
"text": "We induce 1,000 and 3,000 Brown clusters on the Gigaword corpus."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-200",
"text": "Table 4 shows the results for the models which are best tuned on the number of hidden units, hidden layers, and the types of word vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-201",
"text": "The feedforward variant of our model significantly outperforms the strong baselines in both English and Chinese (p < 0.05 bootstrap test)."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-202",
"text": "This suggests that our approach is robust against different label sets, and our findings are valid across languages."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-203",
"text": "Our Chinese model outperforms all of the feature sets known to work well in English despite using only word vectors."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-204",
"text": "The choice of neural architecture used for inducing Chinese word vectors turns out to be crucial."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-205",
"text": "Chinese word vectors from Skipgram model perform consistently better than the ones from CBOW model ( Figure 5 )."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-206",
"text": "These two types of word vectors do not show much difference in the English tasks."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-207",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-208",
"text": "**RESULTS**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-210",
"text": "**CONCLUSIONS AND FUTURE WORK**"
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-211",
"text": "We report a series of experiments that systematically probe the effectiveness of various neural network architectures for the task of implicit discourse relation classification."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-212",
"text": "Given the small amount of annotated data, we found that a feedforward variant of our model combined with hidden layers and high dimensional word vectors outperforms more complicated LSTM models."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-213",
"text": "Our model performs better or competitively against models that use manually crafted surface features, and it is the first neural CDTB-style Chinese discourse parser."
},
{
"sent_id": "dbe1f1bdf7d94824f6f7cd176a4f6d-C001-214",
"text": "We will make our code and models publicly available."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-25"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-39"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-104"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-125"
]
],
"cite_sentences": [
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-25",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-39",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-104",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-125"
]
},
"@SIM@": {
"gold_contexts": [
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-88",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-89",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-90",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-91",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-92"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-123"
]
],
"cite_sentences": [
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-92",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-123"
]
},
"@UNSURE@": {
"gold_contexts": [
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-97"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-163"
]
],
"cite_sentences": [
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-97",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-163"
]
},
"@USE@": {
"gold_contexts": [
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-123"
]
],
"cite_sentences": [
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-123"
]
},
"@DIF@": {
"gold_contexts": [
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-132",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-133"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-169"
],
[
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-173",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-174"
]
],
"cite_sentences": [
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-133",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-169",
"dbe1f1bdf7d94824f6f7cd176a4f6d-C001-174"
]
}
}
},
"ABC_385ce03aee1e3d3de193de09fa1278_10": {
"x": [
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-2",
"text": "Text adventure games, in which players must make sense of the world through text descriptions and declare actions through text descriptions, provide a stepping stone toward grounding action in language."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-3",
"text": "Prior work has demonstrated that using a knowledge graph as a state representation and question-answering to pre-train a deep Q-network facilitates faster control policy learning."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-24",
"text": "**BACKGROUND AND RELATED WORK**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-4",
"text": "In this paper, we explore the use of knowledge graphs as a representation for domain knowledge transfer for training text-adventure playing reinforcement learning agents."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-5",
"text": "Our methods are tested across multiple computer generated and human authored games, varying in domain and complexity, and demonstrate that our transfer learning methods let us learn a higher-quality control policy faster."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-6",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-7",
"text": "**INTRODUCTION**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-8",
"text": "Text adventure games, in which players must make sense of the world through text descriptions and declare actions through natural language, can provide a stepping stone toward more realworld environments where agents must communicate to understand the state of the world and affect change in the world."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-9",
"text": "Despite the steadily increasing body of research on text-adventure games (Bordes et al., 2010; He et al., 2016; Narasimhan et al., 2015; Fulda et al., 2017; Haroush et al., 2018; Tao et al., 2018; Ammanabrolu and Riedl, 2019) , and in addition to the ubiquity of deep reinforcement learning applications (Parisotto et al., 2016; Zambaldi et al., 2019) , teaching an agent to play text-adventure games remains a challenging task."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-10",
"text": "Learning a control policy for a text-adventure game requires a significant amount of exploration, resulting in training runs that take hundreds of thousands of simulations (Narasimhan et al., 2015; Ammanabrolu and Riedl, 2019) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-11",
"text": "One reason that text-adventure games require so much exploration is that most deep reinforcement learning algorithms are trained on a task without a real prior."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-12",
"text": "In essence, the agent must learn everything about the game from only its interactions with the environment."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-13",
"text": "Yet, text-adventure games make ample use of commonsense knowledge (e.g., an axe can be used to cut wood) and genre themes (e.g., in a horror or fantasy game, a coffin is likely to contain a vampire or other undead monster)."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-14",
"text": "This is in addition to the challenges innate to the text-adventure game itself-games are puzzleswhich results in inefficient training."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-15",
"text": "Ammanabrolu and Riedl (2019) developed a reinforcement learning agent that modeled the text environment as a knowledge graph and achieved state-of-the-art results on simple text-adventure games provided by the TextWorld environment."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-16",
"text": "They observed that a simple form of transfer from very similar games greatly improved policy training time."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-17",
"text": "However, games beyond the toy TextWorld environments are beyond the reach of state-of-the-art techniques."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-18",
"text": "In this paper, we explore the use of knowledge graphs and associated neural embeddings as a medium for domain transfer to improve training effectiveness on new text-adventure games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-19",
"text": "Specifically, we explore transfer learning at multiple levels and across different dimensions."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-20",
"text": "We first look at the effects of playing a text-adventure game given a strong prior in the form of a knowledge graph extracted from generalized textual walk-throughs of interactive fiction as well as those made specifically for a given game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-21",
"text": "Next, we explore the transfer of control policies in deep Q-learning (DQN) by pre-training portions of a deep Q-network using question-answering and by DQN-to-DQN parameter transfer between games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-22",
"text": "We evaluate these techniques on two different sets of human authored and computer generated games, demonstrating that our transfer learning methods enable us to learn a higher-quality control policy faster."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-23",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-25",
"text": "Text-adventure games, in which an agent must interact with the world entirely through natural language, provide us with two challenges that have proven difficult for deep reinforcement learning to solve (Narasimhan et al., 2015; Haroush et al., 2018; Ammanabrolu and Riedl, 2019) : (1) The agent must act based only on potentially incomplete textual descriptions of the world around it."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-26",
"text": "The world is thus partially observable, as the agent does not have access to the state of the world at any stage."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-27",
"text": "(2) the action space is combinatorially large-a consequence of the agent having to declare commands in natural language."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-28",
"text": "These two problems together have kept commercial text adventure games out of the reach of existing deep reinforcement learning methods, especially given the fact that most of these methods attempt to train on a particular game from scratch."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-29",
"text": "Text-adventure games can be treated as partially observable Markov decision processes (POMDPs)."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-30",
"text": "This can be represented as a 7-tuple of S, T, A, \u2126, O, R, \u03b3 : the set of environment states, conditional transition probabilities between states, words used to compose text commands, observations, conditional observation probabilities, the reward function, and the discount factor respectively ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-31",
"text": "Multiple recent works have explored the challenges associated with these games (Bordes et al., 2010; He et al., 2016; Narasimhan et al., 2015; Fulda et al., 2017; Haroush et al., 2018; Tao et al., 2018; Ammanabrolu and Riedl, 2019) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-32",
"text": "Narasimhan et al. (2015) introduce the LSTM-DQN, which learns to score the action verbs and corresponding objects separately and then combine them into a single action."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-33",
"text": "He et al. (2016) propose the Deep Reinforcement Relevance Network that consists of separate networks to encode state and action information, with a final Q-value for a state-action pair that is computed between a pairwise interaction function between these."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-34",
"text": "Haroush et al. (2018) present the Action Elimination Network (AEN), which restricts actions in a state to the top-k most likely ones, using the emulator's feedback."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-35",
"text": "Hausknecht et al. (2019) design an agent that uses multiple modules to identify a general set of game play rules for text games across various domains."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-36",
"text": "None of these works study how to transfer policies between different text-adventure games in any depth and so there exists a gap between the two bodies of work."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-37",
"text": "Transferring policies across different textadventure games requires implicitly learning a mapping between the games' state and action spaces."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-38",
"text": "The more different the domain of the two games, the harder this task becomes."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-39",
"text": "Previous work (Ammanabrolu and Riedl, 2019) introduced the use of knowledge graphs and questionanswering pre-training to aid in the problems of partial observability and a combinatorial action space."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-40",
"text": "This work made use of a system called TextWorld that uses grammars to generate a series of similar (but not exact same) games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-41",
"text": "An oracle was used to play perfect games and the traces were used to pre-train portions of the agent's network responsible for encoding the observations, graph, and actions."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-42",
"text": "Their results show that this form of pre-training improves the quality of the policy at convergence it does not show a significant improvement in the training time required to reach convergence."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-43",
"text": "Further, it is generally unrealistic to have a corpus of very similar games to draw from."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-44",
"text": "We build on this work, and explore modifications of this algorithm that would enable more efficient transfer in textadventure games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-45",
"text": "Work in transfer in reinforcement learning has explored the idea of transferring skills (Konidaris and Barto, 2007; Konidaris et al., 2012) or transferring value functions/policies (Liu and Stone, 2006) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-46",
"text": "Other approaches attempt transfer in model-based reinforcement learning (Taylor et al., 2008; Nguyen et al., 2012; Gasic et al., 2013; Wang et al., 2015; Joshi and Chowdhary, 2018) , though traditional approaches here rely heavily on hand crafting state-action mappings across domains."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-47",
"text": "Narasimhan et al. (2017) learn to play games by predicting mappings across domains using a both deep Q-networks and value iteration networks, finding that that grounding the game state using natural language descriptions of the game itself aids significantly in transferring useful knowledge between domains."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-48",
"text": "In transfer for deep reinforcement learning, Parisotto et al. (2016) propose the Actor-Mimic network which learns from expert policies for a source task using policy distillation and then initializes the network for a target task using these parameters."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-49",
"text": "Yin and Pan (2017) also use policy distillation, using task specific features as inputs to a multi-task policy network and use a hierarchical experience sampling method to train this multitask network."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-50",
"text": "Similarly, Rusu et al. (2016) attempt to transfer parameters by using frozen parameters trained on source tasks to help learn a new set of parameters on target tasks."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-51",
"text": "Rajendran et al. (2017) attempt something similar but use attention networks to transfer expert policies between tasks."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-52",
"text": "These works, however, do not study the requirements for enabling efficient transfer for tasks rooted in natural language, nor do they explore the use of knowledge graphs as a state representation."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-53",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-54",
"text": "**KNOWLEDGE GRAPHS FOR DQNS**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-55",
"text": "A knowledge graph is a directed graph formed by a set of semantic, or RDF, triples in the form of subject, relation, object -for example, vampires, are, undead ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-56",
"text": "We follow the open-world assumption that what is not in our knowledge graph can either be true or false."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-57",
"text": "Ammanabrolu and Riedl (2019) introduced the Knowledge Graph DQN (KG-DQN) and touched on some aspects of transfer learning, showing that pre-training portions of the deep Qnetwork using question answering system on perfect playthroughs of a game increases the quality of the learned control policy for a generated text-adventure game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-58",
"text": "We build on this work and use KG-DQN to explore transfer with both knowledge graphs and network parameters."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-59",
"text": "Specifically we seek to transfer skills and knowledge from (a) static text documents describing game play and (b) from playing one text-adventure game to a second complete game in in the same genre (e.g., horror games)."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-60",
"text": "The rest of this section describes KG-DQN in detail and summarizes our modifications."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-61",
"text": "1 For each step that the agent takes, it automatically extracts a set of RDF triples from the received observation through the use of OpenIE (Angeli et al., 2015) in addition to a few rules to account for the regularities of text-adventure games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-62",
"text": "The graph itself is more or less a map of the world, with information about objects' affordances and attributes linked to the rooms that they are place in in a map."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-63",
"text": "The graph also makes a distinction with respect to items that are in the agent's possession or in their immediate surrounding environment."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-64",
"text": "We make minor modifications to the rules used in Ammanabrolu and Riedl (2019) to better achieve such a graph in general interactive fiction environments."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-65",
"text": "The agent also has access to all actions accepted by the game's parser, following Narasimhan et al. (2015) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-66",
"text": "For general interactive fiction environments, we develop our own method to extract this information."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-67",
"text": "This is done by extracting a set of templates accepted by the parser, with the objects or noun phrases in the actions replaces with a OBJ tag."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-68",
"text": "An example of such a template is \"place OBJ in OBJ\"."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-69",
"text": "These OBJ tags are then filled in by looking at all possible objects in the given vocabulary for the game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-70",
"text": "This action space is of the order of A = O(|V | \u00d7 |O| 2 ) where V is the number of action verbs, and O is the number of distinct objects in the world that the agent can interact with."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-71",
"text": "As this is too large a space for a RL agent to effectively explore, the knowledge graph is used to prune this space by ranking actions based on their presence in the current knowledge graph and the relations between the objects in the graph as in (Ammanabrolu and Riedl, 2019) The architecture for the deep Q-network consists of two separate neural networks-encoding state and action separately-with the final Q-value for a state-action pair being the result of a pairwise interaction function between the two (Figure 1 )."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-72",
"text": "We train with a standard DQN training loop; the policy is determined by the Q-value of a particular state-action pair, which is updated using the Bellman equation (Sutton and Barto, 2018) :"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-73",
"text": "( 1) where \u03b3 refers to the discount factor and r t+1 is the observed reward."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-74",
"text": "The whole system is trained using prioritized experience replay Lin (1993) , a modified version of -greedy learning, and a temporal difference loss that is computed as:"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-75",
"text": "where A k+1 represents the action set at step k + 1 and s t , a t refer to the encoded state and action representations respectively."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-76",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-77",
"text": "**KNOWLEDGE GRAPH SEEDING**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-78",
"text": "In this section we consider the problem of transferring a knowledge graph from a static text resource to a DQN-which we refer to as seeding."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-79",
"text": "KG-DQN uses a knowledge graph as a state representation and also to prune the action space."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-80",
"text": "This graph is built up over time, through the course of the agent's exploration."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-81",
"text": "When the agent first starts the game, however, this graph is empty and does not help much in the action pruning process."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-82",
"text": "The agent thus wastes a large number of steps near the beginning of each game exploring ineffectively."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-83",
"text": "The intuition behind seeding the knowledge graph from another source is to give the agent a prior on which actions have a higher utility and thereby enabling more effective exploration."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-84",
"text": "Textadventure games typically belong to a particular genre of storytelling-e.g."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-85",
"text": ", horror, sci-fi, or soap opera-and an agent is at a distinct disadvantage if it doesn't have any genre knowledge."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-86",
"text": "Thus, the goal of seeding is to give the agent a strong prior."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-87",
"text": "This seed knowledge graph is extracted from online general text-adventure guides as well as game/genre specific guides when available."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-88",
"text": "2 The graph is extracted from this the guide using a subset of the rules described in Section 3 used to extract information from the game observations, with the remainder of the RDF triples coming from OpenIE."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-89",
"text": "There is no map of rooms in the environment that can be built, but it is possible to 2 An example of a guide we use is found here http:// www.microheaven.com/IFGuide/step3.html extract information regarding affordances of frequently occurring objects as well as common actions that can be performed across a wide range of text-adventure games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-90",
"text": "This extracted graph is thus potentially disjoint, containing only this generalizable information, in contrast to the graph extracted during the rest of the exploration process."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-91",
"text": "An example of a graph used to seed KG-DQN is given in Fig. 2 ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-92",
"text": "The KG-DQN is initialized with this knowledge graph."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-93",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-94",
"text": "**TASK SPECIFIC TRANSFER**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-95",
"text": "The overarching goal of transfer learning in textadventure games is to be able to train an agent on one game and use this training on to improve the learning capabilities of another."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-96",
"text": "There is growing body of work on improving training times on target tasks by transferring network parameters trained on source tasks (Rusu et al., 2016; Yin and Pan, 2017; Rajendran et al., 2017) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-97",
"text": "Of particular note is the work by Rusu et al. (2016) , where they train a policy on a source task and then use this to help learn a new set of parameters on a target task."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-98",
"text": "In this approach, decisions made during the training of the target task are jointly made using the frozen parameters of the transferred policy network as well as the current policy network."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-99",
"text": "Our system first trains a question-answering system (Chen et al., 2017) using traces given by an oracle, as in Section 4."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-100",
"text": "For commercial textadventure games, these traces take the form of state-action pairs generated using perfect walkthrough descriptions of the game found online as described in Section 4."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-101",
"text": "We use the parameters of the questionanswering system to pre-train portions of the deep Q-network for a different game within in the same domain."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-102",
"text": "The portions that are pre-trained are the same parts of the architecture as in Ammanabrolu and Riedl (2019) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-103",
"text": "This game is referred to as the source task."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-104",
"text": "The seeding of the knowledge graph is not strictly necessary but given that state-of-theart DRL agents cannot complete real games, this makes the agent more effective at the source task."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-105",
"text": "We then transfer the knowledge and skills acquired from playing the source task to another game from the same genre-the target task."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-106",
"text": "The parameters of the deep Q-network trained on the source game are used to initialize a new deep Qnetwork for the target task."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-107",
"text": "All the weights indicated in the architecture of KG-DQN as shown in Fig. 1 are transferred."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-108",
"text": "Unlike Rusu et al. (2016), we do not freeze the parameters of the deep Qnetwork trained on the source task nor use the two networks to jointly make decisions but instead just use it to initialize the parameters of the target task deep Q-network."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-109",
"text": "This is done to account for the fact that although graph embeddings can be transferred between games, the actual graph extracted from a game is non-transferable due to differences in structure between the games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-110",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-111",
"text": "**EXPERIMENTS**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-112",
"text": "We test our system on two separate sets of games in different domains using the Jericho and TextWorld frameworks (Hausknecht, 2018; ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-113",
"text": "The first set of games is \"slice of life\" themed and contains games that involve mundane tasks usually set in textual descriptions of normal houses."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-114",
"text": "The second set of games is \"horror\" themed and contains noticeably more difficult games with a relatively larger vocabulary size and action set, non-standard fantasy names, etc."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-115",
"text": "We choose these domains because of the availability of games in popular online gaming communities, the degree of vocabulary overlap within each theme, and overall structure of games in each theme."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-116",
"text": "Specifically, there must be at least three games in each domain: at least one game to train the question-answering system on, and two more to train the parameters of the source and target task deep Q-networks."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-117",
"text": "A summary of the statistics for the games is given in Table 1 ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-118",
"text": "Vocabulary overlap is calculated by measuring the percentage of overlap between a game's vocabulary and the domain's vocabulary, i.e. the union of the vocabularies for all the games we use within the domain."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-119",
"text": "We observe that in both of these domains, the complexity of the game increases steadily from the game used for the question-answering system to the target and then source task games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-120",
"text": "We perform ablation tests within each domain, mainly testing the effects of transfer from seeding, oracle-based question-answering, and sourceto-target parameter transfer."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-121",
"text": "Additionally, there are a couple of extra dimensions of ablations that we study, specific to each of the domains and explained below."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-122",
"text": "All experiments are run three times using different random seeds."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-123",
"text": "For all the experiments we report metrics known to be important for transfer learning tasks (Taylor and Stone, 2009; Narasimhan et al., 2017) : average reward collected in the first 50 episodes (init. reward), average reward collected for 50 episodes after convergence (final reward), and number of steps taken to finish the game for 50 episodes after convergence (steps)."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-124",
"text": "For the metrics tested after convergence, we set = 0.1 following both Narasimhan et al. (2015) and Ammanabrolu and Riedl (2019) ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-125",
"text": "We use similar hyperparameters to those reported in (Ammanabrolu and Riedl, 2019) for training the KG-DQN with action pruning, with the main difference being that we use 100 dimensional word embeddings instead of 50 dimensions for the horror genre."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-126",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-127",
"text": "**SLICE OF LIFE EXPERIMENTS**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-128",
"text": "TextWorld uses a grammar to generate similar games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-129",
"text": "Following Ammanabrolu and Riedl (2019), we use TextWorld's \"home\" theme to generate the games for the question-answering system."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-130",
"text": "TextWorld is a framework that uses a grammar to randomly generate game worlds and quests."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-131",
"text": "This framework also gives us information such as instructions on how to finish the quest, and a list of actions that can be performed at each step based on the current world state."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-132",
"text": "We do not let our agent access this additional solution information or admissible actions list."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-133",
"text": "Given the relatively small quest length for TextWorld games-games can be completed in as little as 5 steps-we generate 50 such games and partition them into train and test sets in a 4:1 ratio."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-134",
"text": "The traces are generated on the training set, and the question-answering system is evaluated on the test set."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-135",
"text": "We then pick a random game from the test set to train our source task deep Q-network for this domain."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-136",
"text": "For this training, we use the reward function provided by TextWorld: +1 for each action taken that moves the agent closer to finishing the quest; -1 for each action taken that extends the minimum number of steps needed to finish the quest from the current stage; 0 for all other situations."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-137",
"text": "We choose the game, 9:05 3 as our target task game due to similarities in structure in addition to the vocabulary overlap."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-138",
"text": "Note that there are multiple possible endings to this game and we pick the simplest one for the purpose of training our agent."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-139",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-140",
"text": "**OUTSIDE THE REAL ESTATE OFFICE**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-141",
"text": "A grim little cul-de-sac, tucked away in a corner of the claustrophobic tangle of narrow, twisting avenues that largely constitute the older portion of Anchorhead."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-142",
"text": "Like most of the streets in this city, it is ancient, shadowy, and leads essentially nowhere."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-143",
"text": "The lane ends here at the real estate agent's office, which lies to the east, and winds its way back toward the center of town to the west."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-144",
"text": "A narrow, garbage-choked alley opens to the southeast."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-145",
"text": ">go southeast"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-146",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-147",
"text": "**ALLEY**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-148",
"text": "This narrow aperture between two buildings is nearly blocked with piles of rotting cardboard boxes and overstuffed garbage cans."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-149",
"text": "Ugly, halfcrumbling brick walls to either side totter oppressively over you."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-150",
"text": "The alley ends here at a tall, wooden fence."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-151",
"text": "High up on the wall of the northern building there is a narrow, transom-style window."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-153",
"text": "**OUTSIDE THE REAL ESTATE OFFICE**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-154",
"text": "A grim little cul-de-sac, tucked away in a corner of the claustrophobic tangle of narrow, twisting avenues that largely constitute the older portion of Anchorhead."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-155",
"text": "Like most of the streets in this city, it is ancient, shadowy, and leads essentially nowhere."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-156",
"text": "The lane ends here at the real estate agent's office, which lies to the east, and winds its way back toward the center of town to the west."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-157",
"text": "A narrow, garbage-choked alley opens to the southeast."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-158",
"text": ">go southeast"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-159",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-160",
"text": "**ALLEY**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-161",
"text": "This narrow aperture between two buildings is nearly blocked with piles of rotting cardboard boxes and overstuffed garbage cans."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-162",
"text": "Ugly, halfcrumbling brick walls to either side totter oppressively over you."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-163",
"text": "The alley ends here at a tall, wooden fence."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-164",
"text": "High up on the wall of the northern building there is a narrow, transom-style window."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-165",
"text": "'re in Figure 4 : Partial unseeded knowledge graph example given observations and actions in the game Anchorhead."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-166",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-167",
"text": "**HORROR EXPERIMENTS**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-168",
"text": "For the horror domain, we choose Lurking Horror 4 to train the question-answering system on."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-169",
"text": "The source and target task games are chosen as Afflicted 5 and Anchorhead 6 respectively."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-170",
"text": "However, due to the size and complexity of these two games some modifications to the games are required for the agent to be able to effectively solve them."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-171",
"text": "We partition each of these games and make them smaller by reducing the final goal of the game to an intermediate checkpoint leading to it."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-172",
"text": "This checkpoints were identified manually using walkthroughs of the game; each game has a natural intermediate goal."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-173",
"text": "For example, Anchorhead is segmented into 3 chapters in the form of objectives spread across 3 days, of which we use only the first chapter."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-174",
"text": "The exact details of the games after partitioning is described in Table 1 ."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-175",
"text": "For Lurking Horror, we report numbers relevant for the oracle walkthrough."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-176",
"text": "We then pre-prune the action space and use only the actions that are relevant for the sections of the game that we have partitioned out."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-177",
"text": "The majority of the environment is still available for the agent to explore but the game ends upon completion of the chosen intermediate checkpoint."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-178",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-179",
"text": "**REWARD AUGMENTATION**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-180",
"text": "The combined state-action space for a commercial text-adventure game is quite large and the corresponding reward function is very sparse in comparison."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-181",
"text": "The default, implied reward signal is to receive positive value upon completion of the game, and no reward value elsewhere."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-182",
"text": "This is problematic from an experimentation perspective as text-adventure games are too complex for even state-of-the-art deep reinforcement learning agents to complete."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-183",
"text": "Even using transfer learning methods, a sparse reward signal usually results in ineffective exploration by the agent."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-184",
"text": "To make experimentation feasible, we augment the reward to give the agent a dense reward signal."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-185",
"text": "Specifically, we use an oracle to generate state-action traces (identical to how as when train- ing the question-answering system)."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-186",
"text": "An oracle is an agent that is capable of playing and finishing a game perfectly in the least number of steps possible."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-187",
"text": "The state-action pairs generated using perfect walkthroughs of the game are then used as checkpoints and used to give the agent additional reward."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-188",
"text": "If the agent encounters any of these stateaction pairs when training, i.e. performs the right action given a corresponding state, it receives a proportional reward in addition to the standard reward built into the game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-189",
"text": "This reward is scaled based on the game and is designed to be less than the smallest reward given by the original reward function to prevent it from overpowering the builtin reward."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-190",
"text": "We refer to agents using this technique as having \"dense\" reward and \"sparse\" reward otherwise."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-191",
"text": "The agent otherwise receives no information from the oracle about how to win the game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-192",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-193",
"text": "**RESULTS/DISCUSSION**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-194",
"text": "The structure of the experiments are such that the for each of the domains, the target task game is more complex that the source task game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-195",
"text": "The slice of life games are also generally less complex than the horror games; they have a simpler vocabulary and a more linear quest structure."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-196",
"text": "Additionally, given the nature of interactive fiction games, it is nearly impossible-even for human playersto achieve completion in the minimum number of steps (as given by the steps to completion in Table 1); each of these games are puzzle based and require extensive exploration and interaction with various objects in the environment to complete."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-197",
"text": "Table 2 and Table 3 show results for the slice of life and horror domains, respectively."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-198",
"text": "In both do- mains seeding and QA pre-training improve performance by similar amounts from the baseline on both the source and target task games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-199",
"text": "A series of t-tests comparing the results of the pre-training and graph seeding with the baseline KG-DQN show that all results are significant with p < 0.05."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-200",
"text": "Both the pre-training and graph seeding perform similar functions in enabling the agent to explore more effectively while picking high utility actions."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-201",
"text": "Even when untuned, i.e. evaluating the agent on the target task after having only trained on the source task, the agent shows better performance than training on the target task from scratch using the sparse reward."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-202",
"text": "As expected, we see a further gain in performance when the dense reward function is used for both of these domains as well."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-203",
"text": "In the horror domain, the agent fails to converge to a state where it is capable of finishing the game without the dense reward function due to the horror games being more complex."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-204",
"text": "When an agent is trained using on just the target task horror game, Anchorhead, it does not converge to completion and only gets as far as achieving a reward of approximately 7 (max."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-205",
"text": "observed reward from the best model is 41)."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-206",
"text": "This corresponds to a point in the game where the player is required to use a term in an action that the player has never observed before, \"look up Verlac\" when in front of a certain file cabinet-\"Verlac\" being the unknown entity."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-207",
"text": "Without seeding or QA pretraining, the agent is unable to cut down the action space enough to effectively explore and find the solution to progress further."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-208",
"text": "The relative effectiveness of the gains in initial reward due to seeding appears to depend on the game and the corresponding static text document."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-209",
"text": "In all situations except Anchohead, seeding provides comparable gains in initial reward as compared to QA -there is no statistical difference between the two when performing similar t-tests."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-210",
"text": "When the full system is used-i.e."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-211",
"text": "we seed the knowledge graph, pre-train QA, then train the source task game, then the target task game using the augmented reward function-we see a significant gain in performance, up to an 80% gain in terms of completion steps in some cases."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-212",
"text": "The bottleneck at reward 7 is still difficult to pass, however, as seen in Fig. 6 , in which we can see that the agent spends a relatively long time around this reward level unless the full transfer technique is used."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-213",
"text": "We further see in Figures 5, 6 that transferring knowledge results in the agent learning this higher quality policy much faster."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-214",
"text": "In fact, we note that training a full system is more efficient than just training the agent on a single task, i.e. training a QA system then a source task game for 50 episodes then transferring and training a seeded target task game for 50 episodes is more effective than just training the target task game by itself for even 150+ episodes."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-215",
"text": "----------------------------------"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-216",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-217",
"text": "We have demonstrated that using knowledge graphs as a state representation enables efficient transfer between deep reinforcement learning agents designed to play text-adventure games, reducing training times and increasing the quality of the learned control policy."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-218",
"text": "Our results show that we are able to extract a graph from a general static text resource and use that to give the agent knowledge regarding domain specific vocabulary, object affordances, etc."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-219",
"text": "Additionally, we demonstrate that we can effectively transfer knowledge using deep Q-network parameter weights, either by pretraining portions of the network using a questionanswering system or by transferring parameters from a source to a target game."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-220",
"text": "Our agent trains faster overall, including the number of episodes required to pre-train and train on a source task, and performs up to 80% better on convergence than an agent not utilizing these techniques."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-221",
"text": "We conclude that knowledge graphs enable transfer in deep reinforcement learning agents by providing the agent with a more explicit-and interpretable-mapping between the state and action spaces of different games."
},
{
"sent_id": "385ce03aee1e3d3de193de09fa1278-C001-222",
"text": "This mapping helps overcome the challenges twin challenges of partial observability and combinatorially large action spaces inherent in all text-adventure games by allowing the agent to better explore the stateaction space."
}
],
"y": {
"@BACK@": {
"gold_contexts": [
[
"385ce03aee1e3d3de193de09fa1278-C001-9"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-10"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-25"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-31"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-39"
]
],
"cite_sentences": [
"385ce03aee1e3d3de193de09fa1278-C001-9",
"385ce03aee1e3d3de193de09fa1278-C001-10",
"385ce03aee1e3d3de193de09fa1278-C001-25",
"385ce03aee1e3d3de193de09fa1278-C001-31",
"385ce03aee1e3d3de193de09fa1278-C001-39"
]
},
"@EXT@": {
"gold_contexts": [
[
"385ce03aee1e3d3de193de09fa1278-C001-64"
]
],
"cite_sentences": [
"385ce03aee1e3d3de193de09fa1278-C001-64"
]
},
"@USE@": {
"gold_contexts": [
[
"385ce03aee1e3d3de193de09fa1278-C001-71"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-102"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-124"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-129"
]
],
"cite_sentences": [
"385ce03aee1e3d3de193de09fa1278-C001-71",
"385ce03aee1e3d3de193de09fa1278-C001-102",
"385ce03aee1e3d3de193de09fa1278-C001-124",
"385ce03aee1e3d3de193de09fa1278-C001-129"
]
},
"@SIM@": {
"gold_contexts": [
[
"385ce03aee1e3d3de193de09fa1278-C001-71"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-102"
],
[
"385ce03aee1e3d3de193de09fa1278-C001-125"
]
],
"cite_sentences": [
"385ce03aee1e3d3de193de09fa1278-C001-71",
"385ce03aee1e3d3de193de09fa1278-C001-102",
"385ce03aee1e3d3de193de09fa1278-C001-125"
]
},
"@DIF@": {
"gold_contexts": [
[
"385ce03aee1e3d3de193de09fa1278-C001-125"
]
],
"cite_sentences": [
"385ce03aee1e3d3de193de09fa1278-C001-125"
]
}
}
},
"ABC_60cc075e5351a756de8f9919d5a84e_10": {
"x": [
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-117",
"text": "Section 4.2 discusses the use of various settings on the supertagger."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-46",
"text": "Figure 1 gives an example sentence supertagged with the correct CCG lexical categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-1",
"text": "**ABSTRACT**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-2",
"text": "This paper describes the role of supertagging in a wide-coverage CCG parser which uses a log-linear model to select an analysis."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-3",
"text": "The supertagger reduces the derivation space over which model estimation is performed, reducing the space required for discriminative training."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-4",
"text": "It also dramatically increases the speed of the parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-5",
"text": "We show that large increases in speed can be obtained by tightly integrating the supertagger with the CCG grammar and parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-6",
"text": "This is the first work we are aware of to successfully integrate a supertagger with a full parser which uses an automatically extracted grammar."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-7",
"text": "We also further reduce the derivation space using constraints on category combination."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-8",
"text": "The result is an accurate wide-coverage CCG parser which is an order of magnitude faster than comparable systems for other linguistically motivated formalisms."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-9",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-10",
"text": "**INTRODUCTION**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-11",
"text": "Lexicalised grammar formalisms such as Lexicalized Tree Adjoining Grammar (LTAG) and Combinatory Categorial Grammar (CCG) assign one or more syntactic structures to each word in a sentence which are then manipulated by the parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-12",
"text": "Supertagging was introduced for LTAG as a way of increasing parsing efficiency by reducing the number of structures assigned to each word (Bangalore and Joshi, 1999) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-13",
"text": "Supertagging has more recently been applied to CCG (Clark, 2002; ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-14",
"text": "Supertagging accuracy is relatively high for manually constructed LTAGs (Bangalore and Joshi, 1999) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-15",
"text": "However, for LTAGs extracted automatically from the Penn Treebank, performance is much lower (Chen et al., 1999; Chen et al., 2002) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-16",
"text": "In fact, performance for such grammars is below that needed for successful integration into a full parser (Sarkar et al., 2000) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-17",
"text": "In this paper we demonstrate that CCG supertagging accuracy is not only sufficient for accurate and robust parsing using an automatically extracted grammar, but also offers several practical advantages."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-18",
"text": "Our wide-coverage CCG parser uses a log-linear model to select an analysis."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-19",
"text": "The model paramaters are estimated using a discriminative method, that is, one which requires all incorrect parses for a sentence as well as the correct parse."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-20",
"text": "Since an automatically extracted CCG grammar can produce an extremely large number of parses, the use of a supertagger is crucial in limiting the total number of parses for the training data to a computationally manageable number."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-21",
"text": "The supertagger is also crucial for increasing the speed of the parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-22",
"text": "We show that spectacular increases in speed can be obtained, without affecting accuracy or coverage, by tightly integrating the supertagger with the CCG grammar and parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-23",
"text": "To achieve maximum speed, the supertagger initially assigns only a small number of CCG categories to each word, and the parser only requests more categories from the supertagger if it cannot provide an analysis."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-24",
"text": "We also demonstrate how extra constraints on the category combinations, and the application of beam search using the parsing model, can further increase parsing speed."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-25",
"text": "This is the first work we are aware of to succesfully integrate a supertagger with a full parser which uses a lexicalised grammar automatically extracted from the Penn Treebank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-26",
"text": "We also report significantly higher parsing speeds on newspaper text than any previously reported for a full wide-coverage parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-27",
"text": "Our results confirm that wide-coverage CCG parsing is feasible for many large-scale NLP tasks."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-28",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-29",
"text": "**CCG SUPERTAGGING**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-30",
"text": "Parsing using CCG can be viewed as a two-stage process: first assign lexical categories to the words in the sentence, and then combine the categories together using CCG's combinatory rules."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-31",
"text": "1 The first stage can be accomplished by simply assigning to each word all categories from the word's entry in the lexicon (Hockenmaier, 2003) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-32",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-33",
"text": "**1**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-34",
"text": "See Steedman (2000) for an introduction to CCG, and see and Hockenmaier (2003) for an introduction to wide-coverage parsing using CCG."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-35",
"text": "An alternative is to use a statistical tagging approach to assign one or more categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-36",
"text": "A statistical model can be used to determine the most likely categories given the word's context."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-37",
"text": "The advantage of this supertagging approach is that the number of categories assigned to each word can be reduced, with a correspondingly massive reduction in the number of derivations."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-38",
"text": "Bangalore and Joshi (1999) use a standard Markov model tagger to assign LTAG elementary trees to words."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-39",
"text": "Here we use the Maximum Entropy models described in ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-40",
"text": "An advantage of the Maximum Entropy approach is that it is easy to encode a wide range of potentially useful information as features; for example, Clark (2002) has shown that POS tags provide useful information for supertagging."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-41",
"text": "The next section describes the set of lexical categories used by our supertagger and parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-42",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-43",
"text": "**THE LEXICAL CATEGORY SET**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-44",
"text": "The set of lexical categories is obtained from CCGbank Hockenmaier, 2003) , a corpus of CCG normal-form derivations derived semi-automatically from the Penn Treebank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-45",
"text": "Following Clark (2002) , we apply a frequency cutoff to the training set, only using those categories which appear at least 10 times in sections 2-21."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-47",
"text": "Table 1 gives the number of different category types and shows the coverage on training (seen) and development (unseen) data (section 00 from CCGbank)."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-48",
"text": "The table also gives statistics for the complete set containing every lexical category type in CCGbank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-49",
"text": "2 These figures show that using a frequency cutoff can significantly reduce the size of the category set with only a small loss in coverage."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-50",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-51",
"text": "**2**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-52",
"text": "The numbers differ slightly from those reported in Clark (2002) since a newer version of CCGbank is being used here."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-53",
"text": "Clark (2002) compares the size of grammars extracted from CCGbank with automatically extracted LTAGs."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-54",
"text": "The grammars of Chen and VijayShanker (2000) contain between 2,000 and 9,000 tree frames, depending on the parameters used in the extraction process, significantly more elementary structures than the number of lexical categories derived from CCGbank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-55",
"text": "We hypothesise this is a key factor in the higher accuracy for supertagging using a CCG grammar compared with an automatically extracted LTAG."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-56",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-57",
"text": "**THE TAGGING MODEL**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-58",
"text": "The supertagger uses probabilities p(y|x) where y is a lexical category and x is a context."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-59",
"text": "The conditional probabilities have the following log-linear form:"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-60",
"text": "where f i is a feature, \u03bb i is the corresponding weight, and Z(x) is a normalisation constant."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-61",
"text": "The context is a 5-word window surrounding the target word."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-62",
"text": "Features are defined for each word in the window and for the POS tag of each word."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-63",
"text": "describes the model and explains how Generalised Iterative Scaling, together with a Gaussian prior for smoothing, can be used to set the weights."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-64",
"text": "The supertagger in Curran and finds the single most probable category sequence given the sentence, and uses additional features defined in terms of the previously assigned categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-65",
"text": "The per-word accuracy is between 91 and 92% on unseen data in CCGbank; however, Clark (2002) shows this is not high enough for integration into a parser since the large number of incorrect categories results in a significant loss in coverage."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-66",
"text": "Clark (2002) shows how the models in (1) can be used to define a multi-tagger which can assign more than one category to a word."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-67",
"text": "For each word in the sentence, the multi-tagger assigns all those cat- egories whose probability according to (1) is within some factor, \u03b2, of the highest probability category for the word."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-68",
"text": "We follow Clark (2002) in ignoring the features based on the previously assigned categories; therefore every tagging decision is local and the Viterbi algorithm is not required."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-69",
"text": "This simple approach has the advantage of being very efficient, and we find that it is accurate enough to enable highly accurate parsing."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-70",
"text": "However, a method which used the forward-backward algorithm to sum over all possible sequences, or some other method which took into account category sequence information, may well improve the results."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-71",
"text": "For words seen at least k times in the training data, the tagger can only assign categories appearing in the word's entry in the tag dictionary."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-72",
"text": "Each entry in the tag dictionary is a list of all the categories seen with that word in the training data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-73",
"text": "For words seen less than k times, we use an alternative dictionary based on the word's POS tag: the tagger can only assign categories that have been seen with the POS tag in the training data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-74",
"text": "A value of k = 20 was used in this work, and sections 2-21 of CCGbank were used as training data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-75",
"text": "Table 2 gives the per-word accuracy (acc) on section 00 for various values of \u03b2, together with the average number of categories per word."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-76",
"text": "The sent acc column gives the precentage of sentences whose words are all supertagged correctly."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-77",
"text": "The figures for \u03b2 = 0.01 k=100 correspond to a value of 100 for the tag dictionary parameter k. The set of categories assigned to a word is considered correct if it contains the correct category."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-78",
"text": "The table gives results for gold standard POS tags and, in the final 2 columns, for POS tags automatically assigned by the Curran and tagger."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-79",
"text": "The drop in accuracy is expected given the importance of POS tags as features."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-80",
"text": "The figures for \u03b2 = 0 are obtained by assigning all categories to a word from the word's entry in the tag dictionary."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-81",
"text": "For words which appear less than 20 times in the training data, the dictionary based on the word's POS tag is used."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-82",
"text": "The table demonstrates the significant reduction in the average number of categories that can be achieved through the use of a supertagger."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-83",
"text": "To give one example, the number of categories in the tag dictionary's entry for the word is is 45 (only considering categories which have appeared at least 10 times in the training data)."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-84",
"text": "However, in the sentence Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group., the supertagger correctly assigns 1 category to is for \u03b2 = 0.1, and 3 categories for \u03b2 = 0.01."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-85",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-86",
"text": "**THE PARSER**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-87",
"text": "The parser is described in detail in Clark and Curran (2004) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-88",
"text": "It takes POS tagged sentences as input with each word assigned a set of lexical categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-89",
"text": "A packed chart is used to efficiently represent all of the possible analyses for a sentence, and the CKY chart parsing algorithm described in Steedman (2000) is used to build the chart."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-90",
"text": "Clark and Curran (2004) evaluate a number of log-linear parsing models for CCG."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-91",
"text": "In this paper we use the normal-form model, which defines probabilities with the conditional log-linear form in (1), where y is a derivation and x is a sentence."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-92",
"text": "Features are defined in terms of the local trees in the derivation, including lexical head information and wordword dependencies."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-93",
"text": "The normal-form derivations in CCGbank provide the gold standard training data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-163",
"text": "In this way the parser interacts much more closely with the supertagger."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-209",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-94",
"text": "The feature set we use is from the best performing normal-form model in Clark and Curran (2004) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-95",
"text": "For a given sentence the output of the parser is a dependency structure corresponding to the most probable derivation, which can be found using the Viterbi algorithm."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-96",
"text": "The dependency relations are defined in terms of the argument slots of CCG lexical categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-97",
"text": "and Clark and Curran (2004) give a detailed description of the dependency structures."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-98",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-99",
"text": "**MODEL ESTIMATION**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-100",
"text": "In Clark and Curran (2004) we describe a discriminative method for estimating the parameters of a log-linear parsing model."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-101",
"text": "The estimation method maximises the following objective function:"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-102",
"text": "The data consists of sentences S 1 , . . . , S m , together with gold standard normal-form derivations,"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-103",
"text": "is the log-likelihood of model \u039b, and G(\u039b) is a Gaussian prior term used to avoid overfitting (n is the number of features; \u03bb i is the weight for feature f i ; and \u03c3 is a parameter of the Gaussian)."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-104",
"text": "The objective function is optimised using L-BFGS (Nocedal and Wright, 1999) , an iterative algorithm from the numerical optimisation literature."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-105",
"text": "The algorithm requires the gradient of the objective function, and the value of the objective function, at each iteration."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-106",
"text": "Calculation of these values requires all derivations for each sentence in the training data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-107",
"text": "In Clark and Curran (2004) we describe efficient methods for performing the calculations using packed charts."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-108",
"text": "However, a very large amount of memory is still needed to store the packed charts for the complete training data even though the representation is very compact; in we report a memory usage of 30 GB."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-109",
"text": "To handle this we have developed a parallel implementation of the estimation algorithm which runs on a Beowulf cluster."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-110",
"text": "The need for large high-performance computing resources is a disadvantage of our earlier approach."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-111",
"text": "In the next section we show how use of the supertagger, combined with normal-form constraints on the derivations, can significantly reduce the memory requirements for the model estimation."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-112",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-113",
"text": "**GENERATING PARSER TRAINING DATA**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-114",
"text": "Since the training data contains the correct lexical categories, we ensure the correct category is assigned to each word when generating the packed charts for model estimation."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-115",
"text": "Whilst training the parser, the supertagger can be thought of as supplying a number of plausible but incorrect categories for each word; these, together with the correct categories, determine the parts of the parse space that are used in the estimation process."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-116",
"text": "We would like to keep the packed charts as small as possible, but not lose accuracy in the resulting parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-210",
"text": "**CONCLUSIONS**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-118",
"text": "The next section describes how normal-form constraints can further reduce the derivation space."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-119",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-120",
"text": "**NORMAL-FORM CONSTRAINTS**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-121",
"text": "As well as the supertagger, we use two additional strategies for reducing the derivation space."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-122",
"text": "The first, following Hockenmaier (2003) , is to only allow categories to combine if the combination has been seen in sections 2-21 of CCGbank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-123",
"text": "For example, NP/NP could combine with NP/NP according to CCG's combinatory rules (by forward composition), but since this particular combination does not appear in CCGbank the parser does not allow it."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-124",
"text": "The second strategy is to use Eisner's normalform constraints (Eisner, 1996) Table 3 : Space requirements for model training data prevent any constituent which is the result of a forward (backward) composition serving as the primary functor in another forward (backward) composition or a forward (backward) application."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-125",
"text": "Eisner only deals with a grammar without type-raising, and so the constraints do not guarantee a normalform parse when using a grammar extracted from CCGbank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-126",
"text": "However, the constraints are still useful in restricting the derivation space."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-127",
"text": "As far as we are aware, this is the first demonstration of the utility of such constraints for a wide-coverage CCG parser."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-128",
"text": "Table 3 shows the effect of different supertagger settings, and the normal-form constraints, on the size of the packed charts used for model estimation."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-129",
"text": "The disk usage is the space taken on disk by the charts, and the memory usage is the space taken in memory during the estimation process."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-130",
"text": "The training sentences are parsed using a number of nodes from a 64-node Beowulf cluster."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-131",
"text": "3 The time taken to parse the training sentences depends on the supertagging and parsing constraints, and the number of nodes used, but is typically around 30 minutes."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-132",
"text": "The first row of the table corresponds to using the least restrictive \u03b2 value of 0.01, and reverting to \u03b2 = 0.05, and finally \u03b2 = 0.1, if the chart size exceeds some threshold."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-133",
"text": "The threshold was set at 300,000 nodes in the chart."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-134",
"text": "Packed charts are created for approximately 94% of the sentences in sections 2-21 of CCGbank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-135",
"text": "The coverage is not 100% because, for some sentences, the parser cannot provide an analysis, and some charts exceed the node limit even at the \u03b2 = 0.1 level."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-136",
"text": "This strategy was used in our earlier work and, as the table shows, results in very large charts."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-137",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-138",
"text": "**RESULTS (SPACE REQUIREMENTS)**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-139",
"text": "Note that, even with this relaxed setting on the supertagger, the number of categories assigned to each word is only around 3 on average."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-140",
"text": "This suggests that it is only through use of the supertagger that we are able to estimate a log-linear parsing model on all of the training data at all, since without it the memory requirements would be far too great, even for the entire 64-node cluster."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-141",
"text": "4 The second row shows the reduction in size if the parser is only allowed to combine categories which have combined in the training data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-142",
"text": "This significantly reduces the number of categories created using the composition rules, and also prevents the creation of unlikely categories using rule combinations not seen in CCGbank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-143",
"text": "The results show that the memory and disk usage are reduced by approximately 25% using these constraints."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-144",
"text": "The third row shows a further reduction in size when using the Eisner normal-form constraints."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-145",
"text": "Even with the CCGbank rule constraints, the parser still builds many non-normal-form derivations, since CCGbank does contain cases of composition and type-raising."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-146",
"text": "(These are used to analyse some coordination and extraction cases, for example.) The combination of the two types of normalform constraints reduces the memory requirements by 48% over the original approach."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-147",
"text": "In Clark and Curran (2004) we show that the parsing model resulting from training data generated in this way produces state-of-the-art CCG dependency recovery: 84.6 F-score over labelled dependencies."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-148",
"text": "The final row corresponds to a more restrictive setting on the supertagger, in which a value of \u03b2 = 0.05 is used initially and \u03b2 = 0.1 is used if the node limit is exceeded."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-149",
"text": "The two types of normalform constraints are also used."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-150",
"text": "In Clark and Curran (2004) we show that using this more restrictive setting has a small negative impact on the accuracy of the resulting parser (about 0.6 F-score over labelled dependencies)."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-151",
"text": "However, the memory requirement for training the model is now only 4 GB, a reduction of 87% compared with the original approach."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-152",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-153",
"text": "**PARSING UNSEEN DATA**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-154",
"text": "The previous section showed how to combine the supertagger and parser for the purpose of creating training data, assuming the correct category for each word is known."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-155",
"text": "In this section we describe our approach to tightly integrating the supertagger and parser for parsing unseen data."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-156",
"text": "Our previous approach to parsing unseen data was to use the least restrictive setting of the supertagger which still allows a reasonable compromise between speed and accuracy."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-157",
"text": "Our philosophy was to give the parser the greatest possibility of finding the correct parse, by giving it as many categories as possible, while still retaining reasonable efficiency."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-158",
"text": "4 Another possible solution would be to use sampling methods, e.g. Osborne (2000) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-159",
"text": "The problem with this approach is that, for some sentences, the number of categories in the chart still gets extremely large and so parsing is unacceptably slow."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-160",
"text": "Hence we applied a limit to the number of categories in the chart, as in the previous section, and reverted to a more restrictive setting of the supertagger if the limit was exceeded."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-161",
"text": "We first used a value of \u03b2 = 0.01, and then reverted to \u03b2 = 0.05, and finally \u03b2 = 0.1."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-162",
"text": "In this paper we take the opposite approach: we start with a very restrictive setting of the supertagger, and only assign more categories if the parser cannot find an analysis spanning the sentence."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-164",
"text": "In effect, the parser is using the grammar to decide if the categories provided by the supertagger are acceptable, and if not the parser requests more categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-165",
"text": "The parser uses the 5 levels given in Table 2 , starting with \u03b2 = 0.1 and moving through the levels to \u03b2 = 0.01 k=100 ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-166",
"text": "The advantage of this approach is that parsing speeds are much higher."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-167",
"text": "We also show that our new approach slightly increases parsing accuracy over the previous method."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-168",
"text": "This suggests that, given our current parsing model, it is better to rely largely on the supertagger to provide the correct categories rather than use the parsing model to select the correct categories from a very large derivation space."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-169",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-170",
"text": "**RESULTS (PARSE TIMES)**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-171",
"text": "The results in this section are all using the best performing normal-form model in Clark and Curran (2004) , which corresponds to row 3 in Table 3 ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-172",
"text": "All experiments were run on a 2.8 GHZ Intel Xeon P4 with 2 GB RAM."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-173",
"text": "Table 5 : Supertagger \u03b2 levels used on section 00"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-174",
"text": "words, parsed per second."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-175",
"text": "For all of the figures reported on section 23, unless stated otherwise, the parser is able to provide an analysis for 98.5% of the sentences."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-176",
"text": "The parse times and speeds include the failed sentences, but do not include the time taken by the supertagger; however, the supertagger is extremely efficient, and takes less than 6 seconds to supertag section 23, most of which consists of load time for the Maximum Entropy model."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-177",
"text": "The first three rows correspond to our strategy of earlier work by starting with the least restrictive setting of the supertagger."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-178",
"text": "The first value of \u03b2 is 0.01; if the parser cannot find a spanning analysis, this is changed to \u03b2 = 0.01 k=100 ; if the node limit is exceeded (for these experiments set at 1,000,000), \u03b2 is changed to 0.05."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-179",
"text": "If the node limit is still exceeded, \u03b2 is changed to 0.075, and finally 0.1."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-180",
"text": "The second row has the CCGbank rule restriction applied, and the third row the Eisner normal-form restrictions."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-181",
"text": "The next three rows correspond to our new strategy of starting with the least restrictive setting of the supertagger (\u03b2 = 0.1), and moving through the settings if the parser cannot find a spanning analysis."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-182",
"text": "The table shows that the normal-form constraints have a significant impact on the speed, reducing the parse times for the old strategy by 72%, and reducing the times for the new strategy by 84%."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-183",
"text": "The new strategy also has a spectacular impact on the speed compared with the old strategy, reducing the times by 83% without the normal-form constraints and 90% with the constraints."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-184",
"text": "The 94% coverage row corresponds to using only the first two supertagging levels; the parser ignores the sentence if it cannot get an analysis at the \u03b2 = 0.05 level."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-185",
"text": "The percentage of sentences without an analysis is now 6%, but the parser is extremely fast, processing almost 50 sentences a second."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-186",
"text": "This configuration of the system would be useful for obtaining data for lexical knowledge acquisition, for example, for which large amounts of data are required."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-187",
"text": "The oracle row shows the parser speed when it is provided with only the correct lexical categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-188",
"text": "The parser is extremely fast, and in Clark and Curran (2004) we show that the F-score for labelled dependencies is almost 98%."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-189",
"text": "This demonstrates the large amount of information in the lexical categories, and the potential for improving parser accuracy and efficiency by improving the supertagger."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-190",
"text": "Finally, the first parser beam row corresponds to the parser using a beam search to further reduce the derivation space."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-191",
"text": "The beam search works by pruning categories from the chart: a category can only be part of a derivation if its beam score is within some factor, \u03b1, of the highest scoring category for that cell in the chart."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-192",
"text": "Here we simply use the exponential of the inside score of a category as the beam score; the inside score for a category c is the sum over all sub-derivations dominated by c of the weights of the features in those sub-derivations (see Clark and Curran (2004) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-193",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-194",
"text": "**5**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-195",
"text": "The value of \u03b1 that we use here reduces the accuracy of the parser on section 00 by a small amount (0.3% labelled F-score), but has a significant impact on parser speed, reducing the parse times by a further 33%."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-196",
"text": "The final parser beam row combines the beam search with the fast, reduced coverage configuration of the parser, producing speeds of over 50 sentences per second."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-197",
"text": "Table 5 gives the percentage of sentences which are parsed at each supertagger level, for both the new and old parsing strategies."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-198",
"text": "The results show that, for the old approach, most of the sentences are parsed using the least restrictive setting of the supertagger (\u03b2 = 0.01); conversely, for the new approach, most of the sentences are parsed using the most restrictive setting (\u03b2 = 0.1)."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-199",
"text": "As well as investigating parser efficiency, we have also evaluated the accuracy of the parser on section 00 of CCGbank, using both parsing strategies together with the normal-form constraints."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-200",
"text": "The new strategy increases the F-score over labelled dependencies by approximately 0.5%, leading to the figures reported in Clark and Curran (2004) ."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-201",
"text": "----------------------------------"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-202",
"text": "**COMPARISON WITH OTHER WORK**"
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-203",
"text": "The only other work we are aware of to investigate the impact of supertagging on parsing efficiency is the work of Sarkar et al. (2000) for LTAG."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-204",
"text": "Sarkar et al. did find that LTAG supertagging increased parsing speed, but at a significant cost in coverage: only 1,324 sentences out of a test set of 2,250 received a parse."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-205",
"text": "The parse times reported are also not as good as those reported here: the time taken to parse the 2,250 test sentences was over 5 hours."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-206",
"text": "Kaplan et al. (2004) report high parsing speeds for a deep parsing system which uses an LFG grammar: 1.9 sentences per second for 560 sentences from section 23 of the Penn Treebank."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-207",
"text": "They also report speeds for the publicly available Collins parser (Collins, 1999) : 2.8 sentences per second for the same set."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-208",
"text": "The best speeds we have reported for the CCG parser are an order of magnitude faster."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-211",
"text": "This paper has shown that by tightly integrating a supertagger with a CCG parser, very fast parse times can be achieved for Penn Treebank WSJ text."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-212",
"text": "As far as we are aware, the times reported here are an order of magnitude faster than any reported for comparable systems using linguistically motivated grammar formalisms."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-213",
"text": "The techniques we have presented in this paper increase the speed of the parser by a factor of 77."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-214",
"text": "This makes this parser suitable for largescale NLP tasks."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-215",
"text": "The results also suggest that further improvements can be obtained by improving the supertagger, which should be possible given the simple tagging approach currently being used."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-216",
"text": "The novel parsing strategy of allowing the grammar to decide if the supertagging is likely to be correct suggests a number of interesting possibilities."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-217",
"text": "In particular, we would like to investigate only repairing those areas of the chart that are most likely to contain errors, rather than parsing the sentence from scratch using a new set of lexical categories."
},
{
"sent_id": "60cc075e5351a756de8f9919d5a84e-C001-218",
"text": "This could further increase parsing effficiency."
}
],
"y": {
"@UNSURE@": {
"gold_contexts": [
[
"60cc075e5351a756de8f9919d5a84e-C001-87"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-97"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-192"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-200"
]
],
"cite_sentences": [
"60cc075e5351a756de8f9919d5a84e-C001-87",
"60cc075e5351a756de8f9919d5a84e-C001-97",
"60cc075e5351a756de8f9919d5a84e-C001-192",
"60cc075e5351a756de8f9919d5a84e-C001-200"
]
},
"@BACK@": {
"gold_contexts": [
[
"60cc075e5351a756de8f9919d5a84e-C001-90"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-100"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-147"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-188"
]
],
"cite_sentences": [
"60cc075e5351a756de8f9919d5a84e-C001-90",
"60cc075e5351a756de8f9919d5a84e-C001-100",
"60cc075e5351a756de8f9919d5a84e-C001-147",
"60cc075e5351a756de8f9919d5a84e-C001-188"
]
},
"@USE@": {
"gold_contexts": [
[
"60cc075e5351a756de8f9919d5a84e-C001-94"
],
[
"60cc075e5351a756de8f9919d5a84e-C001-171"
]
],
"cite_sentences": [
"60cc075e5351a756de8f9919d5a84e-C001-94",
"60cc075e5351a756de8f9919d5a84e-C001-171"
]
},
"@MOT@": {
"gold_contexts": [
[
"60cc075e5351a756de8f9919d5a84e-C001-107",
"60cc075e5351a756de8f9919d5a84e-C001-108",
"60cc075e5351a756de8f9919d5a84e-C001-109"
]
],
"cite_sentences": [
"60cc075e5351a756de8f9919d5a84e-C001-107"
]
},
"@DIF@": {
"gold_contexts": [
[
"60cc075e5351a756de8f9919d5a84e-C001-150",
"60cc075e5351a756de8f9919d5a84e-C001-151"
]
],
"cite_sentences": [
"60cc075e5351a756de8f9919d5a84e-C001-150"
]
}
}
}
}